JP2012079266A

JP2012079266A - Information processing apparatus, fault portion discrimination method and fault portion discrimination program

Info

Publication number: JP2012079266A
Application number: JP2010226577A
Authority: JP
Inventors: Yuko Wakagi; 裕子若木
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 2010-10-06
Filing date: 2010-10-06
Publication date: 2012-04-19
Anticipated expiration: 2030-10-06
Also published as: JP5541519B2

Abstract

PROBLEM TO BE SOLVED: To provide an information processing apparatus, in multi-processor configuration, which performs startup processing promptly and is capable of accurately discriminating a fault portion even when an interface fault occurs.SOLUTION: When a BIOS 61 operating on e.g., a CPU 11 detects any fault, a fault detection notice comprising an error code based on an analysis result of a status register 21 in the CPU 11 and an investigation request of a CPU 12 of a communication destination of an interface circuit 5 is transmitted to a BMC 3. On the basis of the error code in the received fault detection notice, a BMCFW 7 operating on the BMC 3 determines a ratio indicating a possibility of a fault suspicious portion as a first suspicious ratio and, on the basis of a read result of a status register in the CPU 12 of the communication destination, determines a ratio indicating a possibility of a fault in the fault suspicious portion as a second suspicious ratio. In accordance with predetermined rules, the determined first and second suspicious ratios are merged to determine a final suspicious ratio, thereby discriminating a fault portion.

Description

本発明は、情報処理装置、故障部位判別方法および故障部位判別プログラムに関し、特に、サーバなどに適用され、保守業務を正確かつ迅速に行わなければならないような分野に好適に適用可能な情報処理装置、故障部位判別方法および故障部位判別プログラムに関する。 The present invention relates to an information processing apparatus, a failure part determination method, and a failure part determination program, and in particular, an information processing apparatus that is applied to a server and can be suitably applied to a field where maintenance work must be performed accurately and quickly. The present invention relates to a failure part determination method and a failure part determination program.

最近、サーバ等に適用される情報処理装置の分野では、性能の向上を図るために、ＲＩＳＣ(Reduced Instruction Set Computer)等に代表されるように、複数の部品すなわちＣＰＵ（Central Processor Unit）を、インターフェース回路を介してパイプライン接続したマルチプロセッサ構成とする技術が定着してきている。 Recently, in the field of information processing apparatuses applied to servers and the like, in order to improve performance, as represented by RISC (Reduced Instruction Set Computer) and the like, a plurality of components, that is, a CPU (Central Processor Unit), A technique of adopting a multiprocessor configuration in which pipeline connections are made through an interface circuit has been established.

かかるマルチプロセッサ構成の情報処理装置において障害を検出した場合、障害被疑部品すなわち故障の可能性が高いＣＰＵを特定し、特定した障害被疑部品を運用系から切り離すことにより、以降の処理をスムーズに実施することを可能とする障害処理機能を実現することが重要である。このため、特許文献１の特開２００６−１７８５５７号公報「コンピュータシステムおよびエラー処理方法」や特許文献２の特開２０１０−０２６６７７号公報「ファイル共有装置およびファイル共有システム」においては、マルチプロセッサ構成の情報処理装置において障害部位を判別するための有効な技術が提案されている。 When a failure is detected in an information processing device with such a multiprocessor configuration, the CPU suspected of failure, that is, a CPU with a high possibility of failure, is identified, and the subsequent failure is performed smoothly by separating the identified failure component from the operational system. It is important to realize a fault handling function that makes it possible. Therefore, in Japanese Patent Laid-Open No. 2006-178557 “Computer System and Error Processing Method” of Patent Document 1 and Japanese Patent Laid-Open No. 2010-026667 “File Sharing Device and File Sharing System” of Patent Document 2, a multiprocessor configuration is used. An effective technique for determining a faulty part in an information processing apparatus has been proposed.

特開２００６−１７８５５７号公報（第１０−１３頁）JP 2006-178557 A (pages 10-13) 特開２０１０−０２６６７７号公報（第７−９頁）JP 2010-026667 A (page 7-9)

前述のようなマルチプロセッサ構成の情報処理装置において、装置の立ち上げをベーシック入出力システム（ＢＩＯＳ：Basic Input/Output System）が行っている際に、ＣＰＵ−ＣＰＵ間のインターフェース回路の障害が検出された場合、インターフェース障害検出元のＣＰＵは、インターフェース障害（すなわちリンク障害）が検出されたインターフェース回路の通信先の部品（すなわちＣＰＵや制御装置等の処理機能を有するプロセッサ）に対してアクセスすることができないため、従来の故障部位判別方法においては、一般に、インターフェース障害であるにも関わらず、通信先の部品（すなわちＣＰＵや制御装置等の処理機能を有するプロセッサ）に関する情報を一切利用することなく、インターフェース障害検出元の部品すなわちＣＰＵのみの情報を用いて障害の解析を行い、故障の被疑部位の指摘を行っていた。そのため、検出した故障に対する信頼度が低く、一回の障害処理で正しい故障部位を指摘することができなく、複数回の障害処理を繰り返してしまうという場合が生じる。 In an information processing apparatus having a multiprocessor configuration as described above, a failure of the interface circuit between the CPU and the CPU is detected when the basic input / output system (BIOS) is starting up the apparatus. In such a case, the interface failure detection source CPU can access the communication destination component of the interface circuit in which the interface failure (that is, link failure) is detected (that is, a processor having a processing function such as a CPU or a control device). Therefore, in the conventional fault location determination method, in general, despite the interface failure, without using any information regarding the communication destination component (that is, a processor having a processing function such as a CPU or a control device) Interface failure detection source component, that is, CPU The failure information was analyzed using only the information, and the suspected part of the failure was pointed out. For this reason, the reliability with respect to the detected failure is low, a correct failure part cannot be pointed out by a single failure process, and the failure process is repeated a plurality of times.

一方、前記特許文献１や特許文献２に記載のような情報処理装置においては、マルチプロセッサを構成する各部品（すなわちＣＰＵや制御装置等の処理機能を有するプロセッサ）が、ハードウェアの動作をリモート管理・監視する機能を備えたベースボード管理コントローラ（ＢＭＣ：Baseboard Management Controller)とインターフェース回路を介して接続されているので、ベースボード管理コントローラ（ＢＭＣ）がベーシック入出力システム（ＢＩＯＳ）の代わりに各部品の立ち上げ処理を行うことが可能である。 On the other hand, in the information processing apparatuses described in Patent Document 1 and Patent Document 2, each component constituting the multiprocessor (that is, a processor having a processing function such as a CPU or a control device) remotely controls the hardware operation. Since it is connected to the baseboard management controller (BMC) with management and monitoring functions via an interface circuit, the baseboard management controller (BMC) is replaced with the basic input / output system (BIOS). It is possible to perform a part startup process.

かくのごとき構成においてベースボード管理コントローラ（ＢＭＣ）が各部品（すなわちＣＰＵや制御装置等）の立ち上げ処理を行った際に、インターフェース障害が検出された場合であっても、ベースボード管理コントローラ（ＢＭＣ）は、インターフェース障害検出元のＣＰＵのみならず、インターフェース障害が検出されたインターフェース回路の通信先の部品（すなわちＣＰＵや制御装置等）についても状態情報を収集して故障部位の解析を行うことができる。しかし、ベースボード管理コントローラ（ＢＭＣ）は、ベーシック入出力システム（ＢＩＯＳ）に比して各部品（すなわちＣＰＵや制御装置等）を立ち上げる動作が遅いため、立ち上げに時間がかかるという問題点がある。 Even if an interface failure is detected when the baseboard management controller (BMC) starts up each component (i.e., CPU, control device, etc.) in such a configuration, the baseboard management controller ( BMC) collects status information not only for the interface failure detection source CPU but also for the communication destination components (ie, CPU, control device, etc.) of the interface circuit where the interface failure is detected, and analyzes the failure location. Can do. However, the baseboard management controller (BMC) has a problem that it takes time to start up because the operation of starting up each component (i.e., CPU, control device, etc.) is slower than the basic input / output system (BIOS). is there.

ベースボード管理コントローラ（ＢＭＣ)を備えた情報処理装置における従来の課題を、図９を使ってさらに説明する。図９は、従来のマルチプロセッサ構成の情報処理装置におけるブロック構成を示すブロック構成図であり、ベースボード管理コントローラ（ＢＭＣ）を用いた場合の従来の故障部位判別方法の課題を説明するために、その概略構成を示している。 The conventional problem in the information processing apparatus including the baseboard management controller (BMC) will be further described with reference to FIG. FIG. 9 is a block configuration diagram showing a block configuration in an information processing apparatus of a conventional multiprocessor configuration, and in order to explain the problem of the conventional fault location determination method when a baseboard management controller (BMC) is used. The schematic structure is shown.

図９の情報処理装置は、２つのＣＰＵ１１，１２のマルチプロセッサ構成の場合を示しており、２つのＣＰＵ１１，１２それぞれには、立ち上げ処理を行うベーシック入出力システムとしてＢＩＯＳ６１，６２を内蔵するとともに、ＣＰＵ１１，１２それぞれの状態を保持しているステータスレジスタ２１，２２を備え、かつ、インターフェース回路５を介して相互に通信を行うことが可能なように構成されている。さらに、ＣＰＵ１１，１２それぞれの動作状態をリモート管理し監視するためのＢＭＣ３がインターフェース回路４１，４２それぞれを介してＣＰＵ１１，１２に接続されている。ＢＭＣ３には、リモート管理監視用の機能を実行するベースボード管理コントローラ用ファームウェアとしてＢＭＣＦＷ７が内蔵されていて、障害が発生した際に、保守者からの指示によりＣＰＵ１１，１２の状態を読み取って出力することができる。 The information processing apparatus of FIG. 9 shows a case of a multiprocessor configuration of two CPUs 11 and 12, and the two CPUs 11 and 12 each include BIOS 61 and 62 as a basic input / output system that performs startup processing. The status registers 21 and 22 holding the states of the CPUs 11 and 12 are provided, and are configured to be able to communicate with each other via the interface circuit 5. Further, a BMC 3 for remotely managing and monitoring the operation states of the CPUs 11 and 12 is connected to the CPUs 11 and 12 via the interface circuits 41 and 42, respectively. The BMC 3 has a built-in BMC FW 7 as firmware for a baseboard management controller that executes a function for remote management monitoring. When a failure occurs, the BMC 3 reads and outputs the states of the CPUs 11 and 12 according to instructions from the maintenance personnel. be able to.

装置の立ち上げをＢＩＯＳ６１，６２が行うマルチプロセッサ構成の情報処理装置において、ＣＰＵ１１とＣＰＵ１２との間のインターフェース回路５の初期化中に障害が発生した場合、ＣＰＵ１１−ＣＰＵ１２間のインターフェース回路５が使用不可能な状態になるので、例えば、インターフェース障害の検出元のＣＰＵ１１上で動作するＢＩＯＳ６１から通信先である相手側のＣＰＵ１２に直接アクセスすることができなくなる。そのため、従来の故障部位判別方法においては、通信先である相手側のＣＰＵ１２の状態を調査することはしないで、障害検出元のＣＰＵ１１の状態のみから、故障の被疑部位を判定しており、故障の被疑部位の指摘精度が悪くなってしまうという課題があった。 In a multiprocessor configuration information processing apparatus in which the BIOS 61 and 62 start up the apparatus, when a failure occurs during initialization of the interface circuit 5 between the CPU 11 and the CPU 12, the interface circuit 5 between the CPU 11 and the CPU 12 is used. Since the state becomes impossible, for example, it becomes impossible to directly access the CPU 12 on the other side as the communication destination from the BIOS 61 operating on the CPU 11 that is the detection source of the interface failure. Therefore, in the conventional failure location determination method, the suspected failure location is determined only from the status of the failure detection source CPU 11 without investigating the status of the counterpart CPU 12 as the communication destination. There was a problem that the accuracy of pointing out the suspected part of the subject would deteriorate.

なお、図９の情報処理装置においては、ＣＰＵ１１，１２それぞれの動作状態を管理することにより情報処理装置全体を管理、監視するＢＭＣ３に内蔵されているベースボード管理コントローラ用ファームウェアのＢＭＣＦＷ７はＢＭＣ３上で動作している。ここで、ＢＭＣＦＷ７は、インターフェース回路４１，４２に接続した全ての部品つまりＣＰＵ１１，１２にアクセスすることができる。したがって、情報処理装置の立ち上げ処理の全てをＢＭＣ３上で動作するＢＭＣＦＷ７によって行うことも可能であり、かつ、立ち上げ中に障害が発生した場合においても、ＢＭＣ３のＢＭＣＦＷ７において故障部位を判定することにより、より精度の高い障害処理を行うことが可能になる。しかし、ＢＭＣ３のＢＭＣＦＷ７による情報処理装置の立ち上げ処理がＢＩＯＳ６１，６２の場合に比べると遅くなるという課題がある。 In the information processing apparatus of FIG. 9, the BMCFW 7 of the baseboard management controller firmware built in the BMC 3 that manages and monitors the entire information processing apparatus by managing the operation states of the CPUs 11 and 12 is stored on the BMC 3. It is working. Here, the BMCFW 7 can access all the components connected to the interface circuits 41 and 42, that is, the CPUs 11 and 12. Therefore, it is possible to perform all of the startup processing of the information processing apparatus by the BMCFW 7 operating on the BMC 3, and even when a failure occurs during startup, the failure part is determined in the BMC FW 7 of the BMC 3. Thus, it is possible to perform failure processing with higher accuracy. However, there is a problem that the startup processing of the information processing apparatus by the BMC FW 7 of the BMC 3 is slower than the case of the BIOS 61 or 62.

本発明は、かかる課題を解決するためになされたものであり、マルチプロセッサ構成の情報処理装置において立ち上げ処理を迅速に行い、かつ、インターフェース障害が発生した場合であっても、故障部位を正確に判別することが可能な情報処理装置、故障部位判別方法および故障部位判別プログラムを提供することをその目的としている。 The present invention has been made in order to solve such a problem, and in a multiprocessor information processing apparatus, the start-up process is quickly performed, and even when an interface failure occurs, the failure site is accurately determined. It is an object of the present invention to provide an information processing apparatus, a fault site determination method, and a fault site determination program that can be discriminated automatically.

前述の課題を解決するため、本発明による情報処理装置、故障部位判別方法および故障部位判別プログラムは、主に、次のような特徴的な構成を採用している。 In order to solve the above-described problems, the information processing apparatus, the failure part determination method, and the failure part determination program according to the present invention mainly adopt the following characteristic configuration.

（１）本発明による情報処理装置は、インターフェース回路を介してプロセッサ間を接続した複数のプロセッサからなるマルチプロセッサ構成を有し、それぞれのプロセッサを接続し、接続したそれぞれのプロセッサの管理・監視を実行するベースボード管理コントローラ（ＢＭＣ：Baseboard Management Controller）を備える情報処理装置であって、前記ベースボード管理コントローラ（ＢＭＣ）上で動作するファームウェア（ＢＭＣＦＷ：ＢＭＣ Firmware）は、それぞれのプロセッサ上で動作するベーシック入出力システム（ＢＩＯＳ：Basic Input/Output System）と連携することにより、故障部位を判別して、故障部位を運用系から切り離す障害処理機能を有し、かつ、前記ベーシック入出力システム（ＢＩＯＳ）は、装置の立ち上げ動作を実行中に障害を検出した場合、当該ベーシック入出力システム（ＢＩＯＳ）が動作するプロセッサ内に備えているステータスレジスタが保持する動作状態を解析した結果として得られるエラーコードと、検出した障害が前記インターフェース回路に関するリンク障害と判定した場合には当該インターフェース回路の通信先になる相手側のプロセッサの状態の解析を要求する通信先プロセッサ調査依頼とからなる障害検出通知を、前記ベースボード管理コントローラ（ＢＭＣ）に送信し、前記障害検出通知を受け取った前記ベースボード管理コントローラ（ＢＭＣ）上で動作するファームウェア（ＢＭＣＦＷ）は、前記インターフェース回路を介して接続されているプロセッサに関して、前記障害検出通知に含まれている前記エラーコードに基づき、障害被疑部位の可能性を示す割合を第１の被疑割合として決定するとともに、前記障害検出通知に前記通信先プロセッサ調査依頼が含まれていた場合には、障害が検出された前記インターフェース回路の通信先の相手側のプロセッサ内に備えているステータスレジスタが保持する動作状態を読み取って解析した結果に基づいて障害被疑部位の障害の可能性を示す割合を第２の被疑割合として決定し、決定した前記第１の被疑割合と前記第２の被疑割合とをあらかじめ定めた規則にしたがってマージして最終的な被疑割合を求めることにより、該最終的な被疑割合が最も高い部位を故障部位と判別して、該故障部位を運用系から切り離すことを特徴とする。 (1) An information processing apparatus according to the present invention has a multiprocessor configuration including a plurality of processors connected to each other via an interface circuit, and each processor is connected to manage and monitor each connected processor. An information processing apparatus including a baseboard management controller (BMC) to be executed, and firmware (BMCFW) that operates on the baseboard management controller (BMC) operates on each processor By cooperating with the basic input / output system (BIOS), it has a fault processing function for determining the fault site and separating the fault site from the active system, and the basic input / output system (BIOS). Failed during the device startup operation When detected, an error code obtained as a result of analyzing an operation state held by a status register included in a processor in which the basic input / output system (BIOS) operates, and a detected failure are a link failure related to the interface circuit. When the determination is made, a failure detection notification consisting of a communication destination processor investigation request for requesting an analysis of the state of the counterpart processor as the communication destination of the interface circuit is transmitted to the baseboard management controller (BMC), and The firmware (BMCFW) operating on the baseboard management controller (BMC) that has received the failure detection notification relates to the error code included in the failure detection notification with respect to the processor connected via the interface circuit. Based on the obstacle The ratio indicating the possibility of the part is determined as the first suspicious ratio, and if the communication destination processor investigation request is included in the failure detection notification, the communication destination of the interface circuit in which the failure is detected is determined. Based on the result of reading and analyzing the operation state held in the status register provided in the processor on the other side, a ratio indicating the possibility of failure of the suspected failure site is determined as a second suspected rate, and the determined first By determining the suspect ratio of 1 and the second suspect ratio in accordance with a predetermined rule to obtain a final suspect ratio, the part having the highest final suspect ratio is determined as a failure part, The failure site is separated from the operational system.

（２）本発明による故障部位判別方法は、インターフェース回路を介してプロセッサ間を接続した複数のプロセッサからなるマルチプロセッサ構成の情報処理装置において故障部位を判別する故障部位判別方法であって、それぞれのプロセッサを接続し、接続したそれぞれのプロセッサの管理・監視を実行するベースボード管理コントローラ（ＢＭＣ）を備え、前記ベースボード管理コントローラ（ＢＭＣ）上で動作するファームウェア（ＢＭＣＦＷ：ＢＭＣ Firmware）は、それぞれのプロセッサ上で動作するベーシック入出力システム（ＢＩＯＳ：Basic Input/Output System）と連携することにより、故障部位を判別して、故障部位を運用系から切り離す障害処理機能を有し、かつ、前記ベーシック入出力システム（ＢＩＯＳ）は、装置の立ち上げ動作を実行中に障害を検出した場合、当該ベーシック入出力システム（ＢＩＯＳ）が動作するプロセッサ内に備えているステータスレジスタが保持する動作状態を解析した結果として得られるエラーコードと、検出した障害が前記インターフェース回路に関するリンク障害と判定した場合には当該インターフェース回路の通信先になる相手側のプロセッサの状態の解析を要求する通信先プロセッサ調査依頼とからなる障害検出通知を、前記ベースボード管理コントローラ（ＢＭＣ）に送信し、前記障害検出通知を受け取った前記ベースボード管理コントローラ（ＢＭＣ）上で動作するファームウェア（ＢＭＣＦＷ）は、前記インターフェース回路を介して接続されているプロセッサに関して、前記障害検出通知に含まれている前記エラーコードに基づき、障害被疑部位の可能性を示す割合を第１の被疑割合として決定するとともに、前記障害検出通知に前記通信先プロセッサ調査依頼が含まれていた場合には、障害が検出された前記インターフェース回路の通信先の相手側のプロセッサ内に備えているステータスレジスタが保持する動作状態を読み取って解析した結果に基づいて障害被疑部位の障害の可能性を示す割合を第２の被疑割合として決定し、決定した前記第１の被疑割合と前記第２の被疑割合とをあらかじめ定めた規則にしたがってマージして最終的な被疑割合を求めることにより、該最終的な被疑割合が最も高い部位を故障部位と判別して、該故障部位を運用系から切り離すことを特徴とする。 (2) A failure location determination method according to the present invention is a failure location determination method for determining a failure location in an information processing apparatus having a multiprocessor configuration including a plurality of processors connected between processors via an interface circuit. A baseboard management controller (BMC) that connects processors and executes management and monitoring of each connected processor is provided, and firmware (BMCFW: BMC Firmware) that operates on the baseboard management controller (BMC) By cooperating with a basic input / output system (BIOS) that runs on the processor, it has a fault handling function that identifies the fault location and isolates the fault location from the operational system. The output system (BIOS) When a failure is detected during execution of an operation, an error code obtained as a result of analyzing an operation state held by a status register included in a processor in which the basic input / output system (BIOS) operates, and a detected failure When it is determined that the link fault is related to the interface circuit, a fault detection notification including a communication destination processor investigation request for requesting an analysis of the state of the counterpart processor that is the communication destination of the interface circuit is sent to the baseboard management controller ( The firmware (BMCFW) that is transmitted to the BMC and receives the failure detection notification and that operates on the baseboard management controller (BMC) is included in the failure detection notification regarding the processor connected via the interface circuit. The error code And determining the ratio indicating the possibility of the failure suspected part as the first suspected ratio, and if the communication destination processor investigation request is included in the failure detection notification, the interface where the failure is detected A ratio indicating the possibility of failure of the suspected failure part is determined as the second suspected ratio based on the result of reading and analyzing the operating state held in the status register provided in the processor on the other side of the circuit communication destination The determined first suspect ratio and the second suspect ratio are merged according to a predetermined rule to obtain a final suspect ratio, so that the part having the highest final suspect ratio is determined as a failed part. And the failure part is separated from the operational system.

（３）本発明による故障部位判別プログラムは、少なくとも前記（２）に記載の故障部位判別方法を、コンピュータによって実行可能なプログラムとして実施していることを特徴とする。 (3) The failure site determination program according to the present invention is characterized in that at least the failure site determination method described in (2) is implemented as a program executable by a computer.

本発明の情報処理装置、故障部位判別方法および故障部位判別プログラムによれば、以下のような効果を奏することができる。 According to the information processing apparatus, the failure part determination method, and the failure part determination program of the present invention, the following effects can be obtained.

第１の効果は、情報処理装置の立ち上げ中に、ＣＰＵ上で動作するベーシック入出力システム（ＢＩＯＳ）が検出したインターフェース回路に関するリンク障害についても、精度良く故障の被疑部位を指摘することができることにある。本発明では、ＢＩＯＳから障害検出通知を受信したベースボード管理コントローラ（ＢＭＣ）が、ＢＩＯＳが解析した障害被疑部品の情報のみならず、リンク障害が発生したインターフェース回路の通信先の部品に直接アクセスして取得した通信先の部品の状態の解析結果から得られる障害被疑部品の情報をも用いて、故障部位の解析を行うので、このような精度の良い故障被疑部位の指摘ができる。 The first effect is that the suspected part of the failure can be pointed out with high accuracy even for a link failure related to the interface circuit detected by the basic input / output system (BIOS) operating on the CPU during startup of the information processing apparatus. It is in. In the present invention, the baseboard management controller (BMC) that has received the failure detection notification from the BIOS directly accesses not only the information on the suspected failure component analyzed by the BIOS but also the communication destination component of the interface circuit in which the link failure has occurred. Since the failure part is analyzed also using information on the suspected failure part obtained from the obtained analysis result of the state of the communication destination part, it is possible to point out the suspected failure part with high accuracy.

第２の効果は、情報処理装置の立ち上げ中にベーシック入出力システム（ＢＩＯＳ）からはアクセスすることができない部品の状態を把握することができることにある。その理由は、ベーシック入出力システム（ＢＩＯＳ）が、ベースボード管理コントローラ（ＢＭＣ）に対して、調査依頼として、アクセスしたい部品を通知するコードを障害検出通知に含めて送信し、該障害検出通知を受け取ったＢＭＣにより、調査依頼があった部品に直接アクセスする仕組みを備えているためである。 The second effect is that it is possible to grasp the state of a component that cannot be accessed from the basic input / output system (BIOS) during startup of the information processing apparatus. The reason is that the basic input / output system (BIOS) sends a code for notifying the component to be accessed to the baseboard management controller (BMC) in the failure detection notification and sends the failure detection notification to the baseboard management controller (BMC). This is because the received BMC has a mechanism for directly accessing the parts requested for investigation.

第３の効果は、情報処理装置の立ち上げ時間を短縮することができることにある。本発明では、情報処理装置の立ち上げ時に、インターフェース回路に関するリンク障害が検出された場合であっても、ベースボード管理コントローラ（ＢＭＣ）において精度良く故障の被疑部位の指摘を行うことができる。そこで、本発明によれば、その精度の良い故障被疑部位の指摘により、同じ障害に対して、切り離し処理とリブート動作とが繰り返されることを防止できるので、情報処理装置の立ち上げ時間が短縮できる。 The third effect is that the startup time of the information processing apparatus can be shortened. In the present invention, even when a link failure related to the interface circuit is detected at the time of starting up the information processing apparatus, the baseboard management controller (BMC) can point out the suspected failure site with high accuracy. Therefore, according to the present invention, it is possible to prevent the separation process and the reboot operation from being repeated for the same failure by pointing out the accurate suspected failure portion, so that the startup time of the information processing apparatus can be shortened. .

本発明による情報処理装置のブロック構成の一例を示すブロック構成図である。It is a block block diagram which shows an example of the block configuration of the information processing apparatus by this invention. 図１に示す情報処理装置においてＢＩＯＳが障害検出時のエラーコードを保持しているエラーコード表の一例を説明するためのテーブルである。3 is a table for explaining an example of an error code table in which the BIOS holds an error code when a failure is detected in the information processing apparatus shown in FIG. 1. 図１に示す情報処理装置においてステータスレジスタが保持する状態情報の一例を説明するためのテーブルである。3 is a table for explaining an example of state information held by a status register in the information processing apparatus shown in FIG. 1. 図１に示す情報処理装置においてＢＭＣが障害被疑部品に関する情報を保持している障害表の一例を説明するためのテーブルである。3 is a table for explaining an example of a failure table in which the BMC holds information related to a failure suspected part in the information processing apparatus illustrated in FIG. 1. 図１に示す情報処理装置の動作の一例を説明するためのフローチャートである。3 is a flowchart for explaining an example of an operation of the information processing apparatus illustrated in FIG. 1. 図１に示す情報処理装置において障害を検出したＢＩＯＳからＢＭＣに対して送信する障害検出通知フォーマットの一例を示すテーブルである。3 is a table illustrating an example of a failure detection notification format transmitted from a BIOS that has detected a failure to the BMC in the information processing apparatus illustrated in FIG. 1. 図１に示す情報処理装置において障害を検出したＢＩＯＳからＢＭＣに対して送信する障害検出通知フォーマットの他の例を示すテーブルである。6 is a table showing another example of a failure detection notification format transmitted from the BIOS that detects a failure to the BMC in the information processing apparatus shown in FIG. 1; 本発明による情報処理装置のブロック構成の図１とは異なる他の例を示すブロック構成図である。It is a block block diagram which shows the other example different from FIG. 1 of the block configuration of the information processing apparatus by this invention. 従来のマルチプロセッサ構成の情報処理装置におけるブロック構成を示すブロック構成図である。It is a block block diagram which shows the block structure in the information processing apparatus of the conventional multiprocessor structure.

以下、本発明による情報処理装置、故障部位判別方法および故障部位判別プログラムの好適な実施形態について添付図を参照して説明する。なお、以下の説明においては、本発明による情報処理装置および故障部位判別方法について説明するが、かかる故障部位判別方法をコンピュータにより実行可能な故障部位判別プログラムとして実施するようにしても良いし、あるいは、故障部位判別プログラムをコンピュータにより読み取り可能な記録媒体に記録するようにしても良いことは言うまでもない。 Hereinafter, preferred embodiments of an information processing apparatus, a failure part determination method, and a failure part determination program according to the present invention will be described with reference to the accompanying drawings. In the following description, the information processing apparatus and the failure part determination method according to the present invention will be described. However, the failure part determination method may be implemented as a failure part determination program executable by a computer, or Needless to say, the failure site determination program may be recorded on a computer-readable recording medium.

（本発明の特徴）
本発明の実施形態の説明に先立って、本発明の特徴についてその概要をまず説明する。本発明は、マルチプロセッサ構成の情報処理装置においてＣＰＵや制御装置等の処理機能を有するプロセッサ上で動作するベーシック入出力システム（ＢＩＯＳ：Basic Input/Output System。以降ＢＩＯＳと表記する）が装置立ち上げ中などにおいてインターフェース回路の初期化中の障害（リンク障害）を検出した場合であっても、該インターフェース回路に関するリンク障害の故障部位をより精度良く判別するための技術に関するものである。 (Features of the present invention)
Prior to the description of the embodiments of the present invention, an outline of the features of the present invention will be described first. According to the present invention, a basic input / output system (BIOS: Basic Input / Output System; hereinafter referred to as BIOS) that operates on a processor having a processing function such as a CPU or a control device in an information processing apparatus having a multiprocessor configuration is started up. The present invention relates to a technique for more accurately discriminating a faulty part of a link fault related to the interface circuit even when a fault during initialization of the interface circuit (link fault) is detected.

すなわち、本発明は、マルチプロセッサ構成の複数の部品（すなわちＣＰＵや制御装置等の処理機能を有するプロセッサ）の動作状態を管理し監視するためのベースボード管理コントローラ（ＢＭＣ：Baseboard Management Controller。以降ＢＭＣと表記する)上で動作するファームウェア（ＢＭＣＦＷ：ＢＭＣ Firmware。以降ＢＭＣＦＷと表記する）を、情報処理装置の立ち上げを実行するＢＩＯＳと連携させ、ＢＭＣＦＷに、故障部位を判別するための障害処理機能を割り当て、情報処理装置を構成する複数の部品（すなわちＣＰＵや制御装置等の処理機能を有するプロセッサ）の中の指定した部品のステータス（状態）を読み出して解析することにより、故障部位を精度良く判別することを主要な特徴としている。 In other words, the present invention relates to a baseboard management controller (BMC: Baseboard Management Controller, hereinafter referred to as BMC) for managing and monitoring the operating state of a plurality of components having a multiprocessor configuration (that is, a processor having a processing function such as a CPU and a control device). BMCFW: BMC Firmware (hereinafter referred to as BMCFW) is linked with the BIOS that starts up the information processing apparatus, and the BMCFW identifies the faulty part. By assigning and analyzing the status (state) of the specified part among a plurality of parts (that is, a processor having a processing function such as a CPU or a control device) constituting the information processing apparatus, the failure part can be accurately analyzed Discrimination is the main feature.

さらに説明すると、本発明は、インターフェース回路に関するリンク障害が発生して、インターフェース障害検出元の部品（以降ＣＰＵ１１と称する）と該インターフェース回路の通信先の部品（以降ＣＰＵ１２と称する）との間の通信が断絶されて、ＣＰＵ１１上で動作するＢＩＯＳから通信先のＣＰＵ１２に対して直接アクセスすることができない状況下においても、障害処理機能を受け持つＢＭＣＦＷにおいて、ＣＰＵ１１のＢＩＯＳから調査依頼があった通信先のＣＰＵ１２の状態を読み出すことにより、ＣＰＵ１１とＣＰＵ１２との両方の状態を解析して、故障部位を精度良く判別することを可能とし、而して、情報処理装置の立ち上げ時間を短縮することを可能とすることを主要な特徴としている。 More specifically, according to the present invention, communication between an interface failure detection source component (hereinafter referred to as CPU 11) and a communication destination component of the interface circuit (hereinafter referred to as CPU 12) occurs when a link failure relating to the interface circuit occurs. Even in a situation where the BIOS running on the CPU 11 cannot directly access the communication destination CPU 12, the communication destination CPU 12 requested to investigate from the BIOS of the CPU 11 in the BMCFW responsible for the failure processing function. It is possible to analyze the states of both the CPU 11 and the CPU 12 to accurately determine the failure site, and thus to shorten the startup time of the information processing apparatus. The main feature is to do.

かくのごとき本発明による故障部位判定方法について、先に説明した図９のブロック構成図を用いてさらに説明する。ＣＰＵ１１上で動作するＢＩＯＳ６１が情報処理装置の立ち上げ中にインターフェース回路５のリンク障害を検出した場合、このインターフェース回路５を介する部品間（ＣＰＵ１１−ＣＰＵ１２間）のアクセスは断絶される。このため、ＣＰＵ１１上で動作しているＢＩＯＳ６１からＣＰＵ１２へアクセスすることができず、従来の障害処理においては、前述したように、ＣＰＵ１１内の状態のみを用いて、故障部位の判別を行っていた。 The failure site determination method according to the present invention will be further described with reference to the block configuration diagram of FIG. 9 described above. When the BIOS 61 operating on the CPU 11 detects a link failure of the interface circuit 5 during startup of the information processing apparatus, access between components (between the CPU 11 and the CPU 12) via the interface circuit 5 is interrupted. For this reason, it is impossible to access the CPU 12 from the BIOS 61 operating on the CPU 11, and in the conventional failure processing, as described above, only the state in the CPU 11 is used to determine the failure part. .

これに対して、本発明においては、新たに、障害検出元のＣＰＵ１１のＢＩＯＳ６１がインターフェース回路４１を介して障害処理機能を司るＢＭＣ３に対して障害検出通知を送信するとともに、該障害検出通知を送信する際に、リンク障害が発生したインターフェース回路５の通信先であるＣＰＵ１２を示すコードを通知して、通信先のＣＰＵ１２の状態の調査依頼を行う仕組みを設けている。 On the other hand, in the present invention, the BIOS 61 of the failure detection source CPU 11 newly transmits a failure detection notification to the BMC 3 that manages the failure processing function via the interface circuit 41 and transmits the failure detection notification. In this case, a mechanism is provided in which a code indicating the CPU 12 that is the communication destination of the interface circuit 5 in which the link failure has occurred is notified and a request for investigating the state of the communication destination CPU 12 is made.

さらに、ＢＭＣ３上で動作するファームウェアであるＢＭＣＦＷ７には、障害検出元のＣＰＵ１１のＢＩＯＳ６１からは読み出すことができない通信先のＣＰＵ１２のステータスレジスタ２２の内容をインターフェース回路４２を経由して読み出して、障害検出元のＣＰＵ１１のＢＩＯＳ６１からインターフェース回路４１を介して受け取った障害検出通知に関する故障部位の解析を行うという機能を備えている。 Further, the BMC FW 7 which is firmware operating on the BMC 3 reads the contents of the status register 22 of the communication destination CPU 12 that cannot be read from the BIOS 61 of the failure detection source CPU 11 via the interface circuit 42 to detect the failure. It has a function of analyzing a failure part related to a failure detection notification received from the BIOS 61 of the original CPU 11 via the interface circuit 41.

さらに、障害検出元のＣＰＵ１１のＢＩＯＳ６１とＢＭＣ３上で動作するＢＭＣＦＷ７とが、ＣＰＵ１１とＣＰＵ１２とのそれぞれの状態を解析して得た故障部品の被疑割合をあらかじめ定めた規則に基づいてマージすることによって、最終的な被疑割合を作り直すという仕組みを、ＢＭＣＦＷ７に用意している。 Further, the BIOS 61 of the CPU 11 of the failure detection source and the BMC FW 7 operating on the BMC 3 merge the suspected proportions of the failed parts obtained by analyzing the respective states of the CPU 11 and the CPU 12 based on a predetermined rule. BMCFW7 has a mechanism to recreate the final suspicion rate.

以上のごとき仕組みを採用することにより、インターフェース回路５に関するリンク障害が発生した場合であっても、ＣＰＵ１１とＣＰＵ１２との状態をより正確に解析することができるので、従来技術よりも精度良く故障の被疑部位を指摘することができる。 By adopting the above-described mechanism, even when a link failure relating to the interface circuit 5 occurs, the state of the CPU 11 and the CPU 12 can be analyzed more accurately. The suspected part can be pointed out.

（実施形態の構成例）
次に、本発明の情報処理装置のブロック構成について、その一例を、図１を参照して詳細に説明する。図１は、本発明による情報処理装置のブロック構成の一例を示すブロック構成図であり、図９にて説明した従来の情報処理装置のブロック構成と略同じであるが、マルチプロセッサ構成の各部品（すなわちＣＰＵや制御装置等の処理機能を有するプロセッサ）の状態を管理監視するＢＭＣに内蔵するファームウェアＢＭＣＦＷに、各部品（すなわちＣＰＵや制御装置等の処理機能を有するプロセッサ）に内蔵するＢＩＯＳと連携して動作し、故障部位を精度良く判別するための障害処理機能が備えられている点が、図９の場合とは異なっている。 (Configuration example of embodiment)
Next, an example of the block configuration of the information processing apparatus of the present invention will be described in detail with reference to FIG. FIG. 1 is a block configuration diagram showing an example of a block configuration of an information processing apparatus according to the present invention, which is substantially the same as the block configuration of the conventional information processing apparatus described in FIG. The firmware BMCFW built in the BMC that manages and monitors the state of the processor (that is, a processor having a processing function such as a CPU or a control device) cooperates with the BIOS built in each component (that is, a processor having a processing function such as a CPU or control device) 9 is different from the case of FIG. 9 in that it has a failure processing function for accurately determining the faulty part.

図１の情報処理装置は、図９の従来の情報処理装置と同様、インターフェース回路５を介して互いに接続している２つの部品例えばＣＰＵ１１，１２と、ＣＰＵ１１，１２それぞれにインターフェース回路４１，４２それぞれを介して接続しているベースボード管理コントローラＢＭＣ３とによって構成されている。ＣＰＵ１１，１２上では、それぞれ、立ち上げ処理を行うベーシック入出力システム（ファームウェア）としてＢＩＯＳ６１，６２が動作し、ＢＭＣ３上では、ＣＰＵ１１，１２のリモート管理、監視を行うベースボード管理コントローラ用ファームウェアとしてＢＭＣＦＷ７が動作する。ＣＰＵ１１，１２は、インターフェース回路５に関する動作状態も含めて内部の状態を保持するステータスレジスタ２１，２２をそれぞれ備えている。 The information processing apparatus of FIG. 1 is similar to the conventional information processing apparatus of FIG. 9 in that two components connected to each other via the interface circuit 5, such as CPUs 11 and 12, and CPUs 11 and 12, respectively, are interface circuits 41 and 42, respectively. And a baseboard management controller BMC3 connected via the. On the CPUs 11 and 12, the BIOS 61 and 62 operate as basic input / output systems (firmware) that perform startup processing, respectively. On the BMC 3, BMCFW 7 as firmware for a baseboard management controller that performs remote management and monitoring of the CPUs 11 and 12. Works. The CPUs 11 and 12 are respectively provided with status registers 21 and 22 for holding the internal state including the operation state relating to the interface circuit 5.

ステータスレジスタ２１，２２に保持されている状態情報は、それぞれのＣＰＵ１１，１２に内蔵のＢＩＯＳ６１，６２それぞれが読み取ることが可能であるとともに、ＢＭＣ３上で動作するＢＭＣＦＷ７も、インターフェース回路４１，４２それぞれを介して直接読み取ることが可能である。 The status information held in the status registers 21 and 22 can be read by the BIOS 61 and 62 incorporated in each of the CPUs 11 and 12, and the BMCFW 7 operating on the BMC 3 also has the interface circuits 41 and 42 respectively. Can be read directly through.

ＣＰＵ１１上で動作するＢＩＯＳ６１は、装置立ち上げ中にＣＰＵ１１とＣＰＵ１２との間のインターフェース回路５のリンクアップを行う。インターフェース回路５のリンクアップとは、インターフェース回路５というインターフェースを使用可能にするための動作を指す。また、ＣＰＵ１１のＢＩＯＳ６１は、インターフェース回路５に関するリンク障害を検出すると、故障部位を判別するための障害処理が起動されて、障害処理機能を司るＢＭＣ３のＢＭＣＦＷ７に対して、障害検出通知として、検出した障害の内容を示すエラーコードと、障害発生時に通信しようとしていた通信先の部品(例えばＣＰＵ１２)を示すコードとを、インターフェース回路４１を介して通知する。なお、ＣＰＵ１１のＢＩＯＳ６１は、装置立ち上げ中であり、ＣＰＵ１１とＣＰＵ１２との間のインターフェース回路５のリンクアップが成功していないので、インターフェース回路５の通信先となる相手側のＣＰＵ１２に対して直接アクセスすることができない。 The BIOS 61 operating on the CPU 11 performs link-up of the interface circuit 5 between the CPU 11 and the CPU 12 during startup of the apparatus. The link-up of the interface circuit 5 indicates an operation for enabling the interface called the interface circuit 5. In addition, when the BIOS 61 of the CPU 11 detects a link failure related to the interface circuit 5, failure processing for determining the failure part is started, and is detected as a failure detection notification to the BMCFW 7 of the BMC 3 that manages the failure processing function. An error code indicating the content of the failure and a code indicating the communication destination component (for example, the CPU 12) that was trying to communicate when the failure occurred are notified via the interface circuit 41. Note that the BIOS 61 of the CPU 11 is in the process of starting up, and since the link up of the interface circuit 5 between the CPU 11 and the CPU 12 has not been successful, direct access to the CPU 12 on the other side that is the communication destination of the interface circuit 5 Can not do it.

ここで、ＣＰＵ１１上で動作するＢＩＯＳ６１およびＣＰＵ１２上で動作するＢＩＯＳ６２は、それぞれ、図２に一例を示すようなエラーコードの一覧を保持している。図２は、図１に示す情報処理装置においてＢＩＯＳ６１，６２が障害検出時のエラーコードを保持しているエラーコード表の一例を説明するためのテーブルであり、エラーコード表は、故障部位が一意に決まるエラーコードと障害を検出したときに処理対象であった部品（ＣＰＵや制御装置等。本実施形態においてはＣＰＵ１１，１２）などがリストアップされている表である。つまり、図２に示すエラーコード表は、エラーコード８１、処理対象部品８２、リンク障害８３、故障被疑部品８４を少なくとも含んで構成されている。 Here, the BIOS 61 operating on the CPU 11 and the BIOS 62 operating on the CPU 12 each hold a list of error codes as shown in FIG. FIG. 2 is a table for explaining an example of an error code table in which the BIOS 61, 62 holds an error code at the time of failure detection in the information processing apparatus shown in FIG. This is a table listing components (CPU, control device, etc .; in this embodiment, CPUs 11 and 12) that were processed when an error code and a failure were detected. That is, the error code table shown in FIG. 2 includes at least an error code 81, a processing target component 82, a link failure 83, and a failure suspected component 84.

図２のエラーコード表において、例えば、エラーコード８１が'0x0000_0001'であった場合、該エラーコードが示す障害を検出した部品を示す処理対象部品８２は、ＣＰＵ１１であり、リンク障害８３に示すように、該エラーコードが示す障害がリンクアップ中に発生したインターフェース回路５に関するリンク障害であり、かつ、故障被疑部品８４に示すように、故障の可能性が高い部品を示す故障被疑部品がＣＰＵ１１であることを示している。 In the error code table of FIG. 2, for example, when the error code 81 is “0x0000_0001”, the processing target component 82 indicating the component in which the failure indicated by the error code is detected is the CPU 11, as indicated by the link failure 83. In addition, the failure indicated by the error code is a link failure relating to the interface circuit 5 that has occurred during the link-up, and as shown in the failure suspected component 84, the suspected failure component indicating a component with a high possibility of failure is the CPU 11. It shows that there is.

また、エラーコード８１が'0x0000_0002'であった場合、該エラーコードが示す障害を検出した部品を示す処理対象部品８２は、ＣＰＵ１１であり、リンク障害８３に示すように、該エラーコードが示す障害がリンクアップ中に発生したインターフェース回路５に関するリンク障害であり、かつ、故障被疑部品８４に示すように、故障の可能性が高い部品を示す故障被疑部品がＣＰＵ１２であることを示している。 When the error code 81 is “0x0000_0002”, the processing target component 82 indicating the component in which the failure indicated by the error code is detected is the CPU 11, and the failure indicated by the error code is indicated by the link failure 83. Is a link failure relating to the interface circuit 5 that occurred during link-up, and as shown in the failure suspected component 84, the suspected failure component indicating a component with a high possibility of failure is the CPU 12.

また、エラーコード８１が'0x0000_1001'であった場合、該エラーコードが示す障害を検出した部品を示す処理対象部品８２は、ＣＰＵ１１であるが、リンク障害８３に示すように、該エラーコードが示す障害はリンク障害ではなく、かつ、故障被疑部品８４に示すように、故障の可能性が高い部品を示す故障被疑部品がＣＰＵ１１であることを示している。 When the error code 81 is '0x0000_1001', the processing target component 82 indicating the component that detected the failure indicated by the error code is the CPU 11, but as indicated by the link failure 83, the error code indicates The failure is not a link failure and, as shown in the failure suspected component 84, indicates that the failure suspected component indicating a component with a high possibility of failure is the CPU 11.

また、ＣＰＵ１１、１２それぞれのステータスレジスタ２１、２２の構成例を図３に示している。図３は、図１に示す情報処理装置においてステータスレジスタ２１，２２が保持する状態情報の一例を説明するためのテーブルであり、ＣＰＵ１１，１２の間を接続するインターフェース回路５の送受信状態に関する情報部分のみを取り出して示している。ステータスレジスタ２１，２２は、図３のＢｉｔ欄９１に示すように、インターフェース回路５の送受信状態を示すレジスタ領域として、例えば１６ビットで構成されている。 Further, FIG. 3 shows a configuration example of the status registers 21 and 22 of the CPUs 11 and 12, respectively. FIG. 3 is a table for explaining an example of the status information held by the status registers 21 and 22 in the information processing apparatus shown in FIG. 1, and an information part regarding the transmission / reception status of the interface circuit 5 connecting the CPUs 11 and 12. Only take out and show. As shown in the Bit column 91 in FIG. 3, the status registers 21 and 22 are configured with, for example, 16 bits as a register area indicating the transmission / reception state of the interface circuit 5.

ここで、図３のＢｉｔ欄９１に示すように、インターフェース回路５の送信側、受信側とのそれぞれで１バイトずつ用意されていて、受信側の状態を記録するレジスタ領域は、第０ビット目から第７ビット目までのビット［０：７］であり、送信側の状態を記録するレジスタ領域は、第８ビット目から第１５ビット目までのビット［８：１５］である。なお、本実施形態においては、送信側および受信側ともに、インターフェース回路５の通信処理の段階として６段階からなっている場合を例示している。 Here, as shown in the Bit column 91 of FIG. 3, one byte is prepared for each of the transmission side and the reception side of the interface circuit 5, and the register area for recording the state of the reception side is the 0th bit. Bits [0: 7] from the first bit to the seventh bit, and the register area for recording the state on the transmitting side is bits [8:15] from the eighth bit to the fifteenth bit. In the present embodiment, the case where the transmission side and the reception side are composed of six stages as communication communication stages of the interface circuit 5 is illustrated.

説明欄９２および内容欄９３に示すように、受信側の状態については、第０ビット［０］は、ＣＰＵ１１，１２が受信したか否かを表す受信の有無に関する状態を示し、'０'は受信なしの状態、'１'は受信ありの状態を示している。また、第１ビット［１］は、ＣＰＵ１１，１２の受信処理が正常に終了したか否かを表す正常終了の有無に関する状態を示し、'０'は正常終了の状態、'１'は異常終了の状態を示している。また、第２〜第７ビット［２：７］は、ＣＰＵ１１，１２の受信処理の進捗状況を示す６段階の各段階ごとに正常終了したか否かを表す段階別正常終了の有無に関する状態を示し、'０'は異常なしの状態、'１'は異常ありの状態を示している。 As shown in the explanation column 92 and the content column 93, as for the state on the receiving side, the 0th bit [0] indicates the state relating to the presence / absence of reception indicating whether or not the CPU 11 or 12 has received, and “0” indicates A state without reception, “1” indicates a state with reception. The first bit [1] indicates a state relating to the presence / absence of normal termination indicating whether or not the reception processing of the CPUs 11 and 12 has terminated normally, “0” is a normal termination state, and “1” is an abnormal termination. Shows the state. In addition, the second to seventh bits [2: 7] indicate a state regarding the presence / absence of normal termination for each stage indicating whether or not each of the six stages indicating the progress of the reception processing of the CPUs 11 and 12 has been normally completed. '0' indicates a state without abnormality, and '1' indicates a state with abnormality.

ここで、段階別正常終了の有無を示す第２〜第７ビット［２：７］について、第２ビット［２］は、受信処理の第１段階の正常終了の有無を、'０'が異常なしの状態、'１'が異常ありの状態として示し、以降、第３〜第７ビット［３：７］は、順次、受信処理の第２、第３、第４、第５、第６段階の正常終了の有無を、それぞれ、'０'が異常なしの状態、'１'が異常ありの状態として示している。 Here, with respect to the second to seventh bits [2: 7] indicating the presence / absence of normal termination by stage, the second bit [2] indicates whether there is a normal termination at the first stage of reception processing, and “0” is abnormal. “No”, “1” indicates an abnormal state, and thereafter, the third to seventh bits [3: 7] sequentially indicate the second, third, fourth, fifth, and sixth stages of the reception process. The presence or absence of normal termination is shown as “0” indicating no abnormality and “1” indicating an abnormality.

また、送信側の状態についても同様であり、第８ビット［８］は、ＣＰＵ１１，１２が送信したか否かを表す送信の有無に関する状態を示し、'０'は送信なしの状態、'１'は送信ありの状態を示している。また、第９ビット［９］は、ＣＰＵ１１，１２の送信処理が正常に終了したか否かを表す正常終了の有無に関する状態を示し、'０'は正常終了の状態、'１'は異常終了の状態を示している。また、第１０〜第１５ビット［１０：１５］は、ＣＰＵ１１，１２の送信処理の進捗状況を示す６段階の各段階ごとに正常終了したか否かを表す段階別正常終了の有無に関する状態を示し、'０'は異常なしの状態、'１'は異常ありの状態を示している。 The same applies to the state on the transmission side, and the eighth bit [8] indicates the state relating to the presence / absence of transmission indicating whether or not the CPU 11 or 12 has transmitted, '0' is the state without transmission, and '1' 'Indicates a state with transmission. The ninth bit [9] indicates a state relating to the presence / absence of normal termination indicating whether or not the transmission processing of the CPUs 11 and 12 is normally terminated, “0” is a normal termination state, and “1” is an abnormal termination. Shows the state. In addition, the 10th to 15th bits [10:15] indicate the state regarding the presence / absence of normal termination for each stage indicating whether or not each of the six stages indicating the progress status of the transmission processing of the CPUs 11 and 12 is normally completed. '0' indicates a state without abnormality, and '1' indicates a state with abnormality.

ここで、段階別正常終了の有無を示す第１０〜第１５ビット［１０：１５］について、第１０ビット［１０］は、送信処理の第１段階の正常終了の有無を、'０'が異常なしの状態、'１'が異常ありの状態として示し、以降、第１１〜第１５ビット［１１：１５］は、順次、送信処理の第２、第３、第４、第５、第６段階の正常終了の有無を、それぞれ、'０'が異常なしの状態、'１'が異常ありの状態として示している。 Here, with respect to the 10th to 15th bits [10:15] indicating the presence / absence of normal termination by stage, the 10th bit [10] indicates whether there is normal termination at the first stage of transmission processing, and “0” is abnormal. No state, “1” indicates an abnormal state, and the 11th to 15th bits [11:15] are the second, third, fourth, fifth, and sixth stages of the transmission process in order. The presence or absence of normal termination is shown as “0” indicating no abnormality and “1” indicating an abnormality.

ＢＭＣ３上で動作するベースボード管理コントローラ用ファームウェアであるＢＭＣＦＷ７は、ＣＰＵ１１，１２それぞれの状態を保持しているステータスレジスタ２１，２２に、それぞれ、インターフェース回路４１，４２を経由して直接アクセスすることができる。さらに、ＢＭＣＦＷ７は、ステータスレジスタ２１，２２を参照してＣＰＵ１１，１２それぞれに関する障害解析を行う障害処理機能を備えているとともに、ＢＩＯＳ６１，６２それぞれが解析した障害解析結果とＢＭＣＦＷ７自身が解析した障害解析結果とをマージして障害被疑割合を算出する機能を備えている。 The BMCFW 7 which is the firmware for the baseboard management controller operating on the BMC 3 can directly access the status registers 21 and 22 holding the states of the CPUs 11 and 12 via the interface circuits 41 and 42, respectively. . Further, the BMCFW 7 has a failure processing function for performing failure analysis on each of the CPUs 11 and 12 with reference to the status registers 21 and 22, and failure analysis results analyzed by the BIOS 61 and 62 and failure analysis analyzed by the BMCFW 7 itself. It has a function to calculate the failure suspect ratio by merging the results.

図４は、図１に示す情報処理装置においてＢＭＣ３が障害被疑部品に関する情報を保持している障害表の一例を説明するためのテーブルであり、ＢＩＯＳ６１，６２の障害解析結果に基づいて決まる障害被疑部品の被疑割合をリストアップしている表である。 FIG. 4 is a table for explaining an example of a failure table in which the BMC 3 holds information related to a suspected failure component in the information processing apparatus shown in FIG. 1, and the suspected failure determined based on the failure analysis result of the BIOS 61, 62. It is the table | surface which has listed the suspicious ratio of components.

図４に示す障害表は、エラーコード１０１、被疑部品１１０２、被疑部品２１０３、被疑割合１１０４、被疑割合２１０５を少なくとも含んで構成されていて、障害一覧として、障害検出元のＣＰＵ１１，１２のＢＩＯＳ６１，６２にて一意に決まるエラーコード１０１の各エラーコードごとに、被疑部品１１０２、被疑部品２１０３に示す故障の被疑部品と被疑割合１１０４、被疑割合２１０５に示す故障の被疑割合とがそれぞれ２つずつリストアップされている。ここで、図４の障害表のエラーコード１０１と図２のエラーコード表のエラーコード８１とに示すそれぞれのエラーコードは１対１に対応している。 The failure table shown in FIG. 4 includes at least an error code 101, a suspicious component 1102, a suspicious component 2 103, a suspicious proportion 1 104, and a suspicious proportion 2 105, and the failure detection source CPU 11, For each error code of the error code 101 uniquely determined by the 12 BIOSes 61 and 62, the suspected part 1 102, the suspected part 2 103, the suspected part 1 104, and the suspected part 2 105 suspected of the failure Two percentages are listed. Here, each error code shown in the error code 101 of the failure table of FIG. 4 and the error code 81 of the error code table of FIG. 2 has a one-to-one correspondence.

つまり、図４の障害表においては、ＢＩＯＳ６１，６２において一意に決定したエラーコードごとに、インターフェース回路５を介して接続される２つの部品（本実施形態においてはＣＰＵ１１，１２）それぞれの故障の可能性の程度を定量的に示す被疑割合を被疑割合１１０４、被疑割合２１０５としてあらかじめ用意している。 That is, in the failure table of FIG. 4, each of the two components (CPUs 11 and 12 in this embodiment) connected via the interface circuit 5 is possible for each error code uniquely determined by the BIOS 61 and 62. The suspicion ratios quantitatively indicating the degree of sex are prepared in advance as a suspicion ratio 1 104 and a suspicion ratio 2 105.

例えば、図２のエラーコード表において説明したように、エラーコード８１が'0x0000_0001'であった場合、ＢＩＯＳ６１，６２の解析結果としては、故障の可能性が高い故障被疑部品がＣＰＵ１１であるものと推定され、かつ、該エラーコードが示す障害がリンクアップ中に発生したリンク障害であることを示している。したがって、図４の障害表においては、エラーコード１０１が'0x0000_0001'であった場合には、被疑部品１１０２および被疑割合１１０４に示すように、故障被疑部品がＣＰＵ１１であるとする被疑割合は７０％であり、被疑部品２１０３および被疑割合２１０５に示すように、故障被疑部品がＣＰＵ１２であるとする被疑割合は３０％であるものとしてあらかじめ設定する。 For example, as described in the error code table of FIG. 2, when the error code 81 is “0x0000 — 0001”, the analysis result of the BIOS 61 and 62 indicates that the suspected failure part with high possibility of failure is the CPU 11. It is estimated that the failure indicated by the error code is a link failure that occurred during link-up. Therefore, in the failure table of FIG. 4, when the error code 101 is “0x0000 — 0001”, as shown in the suspicious component 1102 and the suspicious proportion 1 104, the suspicious rate that the failed suspicious component is the CPU 11 is It is 70%, and as shown in the suspicious part 2 103 and the suspicious ratio 2 105, the suspicious ratio that the failed suspicious part is the CPU 12 is set to 30% in advance.

また、図２のエラーコード表において説明したように、エラーコード８１が'0x0000_0002'であった場合、ＢＩＯＳ６１，６２の解析結果としては、故障部位がＣＰＵ１２であるものと推定され、かつ、該エラーコードが示す障害がリンクアップ中に発生したリンク障害である。したがって、図４の障害表においては、エラーコード１０１が'0x0000_0002'であった場合には、被疑部品１１０２および被疑割合１１０４に示すように、故障被疑部品がＣＰＵ１２であるとする被疑割合は７０％であり、被疑部品２１０３および被疑割合２１０５に示すように、故障被疑部品がＣＰＵ１１であるとする被疑割合は３０％であるものとしてあらかじめ設定する。 Further, as described in the error code table of FIG. 2, when the error code 81 is “0x0000_0002”, the analysis result of the BIOS 61, 62 is estimated that the failure part is the CPU 12, and the error The failure indicated by the code is a link failure that occurred during linkup. Therefore, in the failure table of FIG. 4, when the error code 101 is “0x0000_0002”, as shown in the suspicious component 1 102 and the suspicious proportion 1 104, the suspected rate that the suspected failure component is the CPU 12 is It is 70%, and as shown in the suspicious part 2 103 and the suspicious ratio 2 105, the suspicious ratio that the failed suspicious part is the CPU 11 is set to 30% in advance.

また、図２のエラーコード表において説明したように、エラーコード８１が'0x0000_1001'であった場合、ＢＩＯＳ６１，６２の解析結果としては、ＣＰＵ１１であるものと推定され、かつ、該エラーコードが示す障害は、エラーコード８１が'0x0000_0001'の場合とは異なり、リンク障害の場合ではなく、ＣＰＵ１１内部の故障である。したがって、図４のエラーコード表においては、エラーコード１０１が'0x0000_1001'であった場合には、被疑部品１１０２および被疑割合１１０４に示すように、故障被疑部品がＣＰＵ１１であるとする被疑割合は１００％であり、被疑部品２１０３および被疑割合２１０５に示すように、故障被疑部品がＣＰＵ１２であるとする被疑割合は０％であるものとしてあらかじめ設定する。 Further, as described in the error code table of FIG. 2, when the error code 81 is “0x0000_1001”, the analysis result of the BIOS 61, 62 is estimated to be the CPU 11, and the error code indicates Unlike the case where the error code 81 is “0x0000 — 0001”, the failure is not a link failure but a failure inside the CPU 11. Therefore, in the error code table of FIG. 4, when the error code 101 is “0x0000 — 1001”, as shown in the suspected part 1102 and the suspected ratio 1 104, the suspected ratio that the suspected faulty part is the CPU 11 Is 100%, and as shown in the suspicious component 2 103 and the suspicious proportion 2 105, the suspicious proportion that the failure suspicious component is the CPU 12 is set to 0% in advance.

（実施形態の動作の説明）
次に、図１に示す情報処理装置の動作の一例を、図５のフローチャートを参照しながら説明する。図５は、図１に示す情報処理装置の動作の一例を説明するためのフローチャートであり、図５（Ａ）が、２つの部品すなわちＣＰＵ１１，１２上でそれぞれ動作するベーシック入出力システムＢＩＯＳ６１，６２の動作の一例を示し、図５（Ｂ）がＢＭＣ３上で動作するベースボード管理コントローラ用ファームウェアＢＭＣＦＷ７の動作の一例を示している。なお、以下の説明においては、説明を分かり易くするために、２つのＣＰＵ１１，１２のうち、ＣＰＵ１１上で動作するＢＩＯＳ６１が、装置の立ち上げ中に障害を検出した場合の動作について説明するが、ＣＰＵ１２上で動作するＢＩＯＳ６２についても、ＢＩＯＳ６１とＢＩＯＳ６２とを読み替えるだけで、全く同様の動作となる。 (Description of operation of embodiment)
Next, an example of the operation of the information processing apparatus shown in FIG. 1 will be described with reference to the flowchart of FIG. FIG. 5 is a flowchart for explaining an example of the operation of the information processing apparatus shown in FIG. 1. FIG. 5A is a basic input / output system BIOS 61, 62 that operates on two components, that is, CPUs 11 and 12, respectively. FIG. 5B shows an example of the operation of the baseboard management controller firmware BMCFW 7 operating on the BMC 3. In the following description, in order to make the description easy to understand, the operation when the BIOS 61 operating on the CPU 11 of the two CPUs 11 and 12 detects a failure during the startup of the apparatus will be described. The BIOS 62 that operates on the CPU 12 also has exactly the same operation by simply replacing the BIOS 61 and the BIOS 62.

まず、ＣＰＵ１１上で動作するＢＩＯＳ６１が、ＣＰＵ１１の立ち上げ処理として、ＣＰＵ１１−ＣＰＵ１２間のインターフェース回路５の初期設定を行っている段階で障害を検出した場合の動作を中心にして、図５（Ａ）を用いて説明する。 First, the BIOS 61 operating on the CPU 11 focuses on the operation when the failure is detected during the initial setting of the interface circuit 5 between the CPU 11 and the CPU 12 as a startup process of the CPU 11. ).

図５（Ａ）において、ＣＰＵ１１のＢＩＯＳ６１が、立ち上げ中のＣＰＵ１１−ＣＰＵ１２間のインターフェース回路５の初期設定を行っている段階において、つまり、インターフェース回路５のリンク初期設定中（リンクアップ中）の段階において、"ＣＰＵ１２側からの応答がない"または"ＣＰＵ１２側から期待しない応答を受信した"等の異常を検出すると（ステップＡ１）、故障部位を特定するための障害処理が起動され、ステップＡ２に移行する。 5A, the BIOS 61 of the CPU 11 is performing the initial setting of the interface circuit 5 between the CPU 11 and the CPU 12 being started up, that is, the link initial setting (link up) of the interface circuit 5 is being performed. When an abnormality such as “no response from the CPU 12 side” or “an unexpected response has been received from the CPU 12 side” is detected in the stage (step A1), failure processing for identifying the faulty part is started, and step A2 Migrate to

障害処理が起動されると、ＢＩＯＳ６１は、図３に示したような構成からなるＣＰＵ１１のステータスレジスタ２１のビット［８：１５］にアクセスして、当該ＣＰＵ１１からＣＰＵ１２側への送信動作に異常が発生していないか否かを確認する。すなわち、ＢＩＯＳ６１は、ステータスレジスタ２１のビット［８：１５］を読み取ると、インターフェース回路５すなわちリンクの初期設定動作として送信動作に異常が発生しているか否かを分析し、異常が検出された場合、図２のエラーコード表を参照して、該当するエラーコードを特定する（ステップＡ２）。 When the failure processing is activated, the BIOS 61 accesses the bits [8:15] of the status register 21 of the CPU 11 having the configuration as shown in FIG. 3, and the transmission operation from the CPU 11 to the CPU 12 side is abnormal. Check if it has occurred. In other words, when the BIOS 61 reads the bits [8:15] of the status register 21, it analyzes whether or not an abnormality has occurred in the transmission operation as the interface circuit 5, that is, the link initial setting operation. Referring to the error code table of FIG. 2, the corresponding error code is specified (step A2).

つまり、ステップＡ２においては、まず、アクセスしたステータスレジスタ２１のビット［８］の「送信あり／なし」の状態を調査し、ビット［８］が'０'であれば、ＢＩＯＳ６１が送信動作を指示していたにも関わらず、何らかの異常により送信動作が行われなかったことを検出する。一方、ビット［８］が'１'であれば、インターフェース回路５を介してＣＰＵ１２に対して何らかの送信動作が行われていたことを確認する。 That is, in step A2, first, the status of the accessed bit [8] of the status register 21 is checked for “with / without transmission”. If the bit [8] is “0”, the BIOS 61 instructs the transmission operation. Despite this, it is detected that the transmission operation has not been performed due to some abnormality. On the other hand, if bit [8] is “1”, it is confirmed that some transmission operation has been performed to the CPU 12 via the interface circuit 5.

何らかの送信動作が行われていた場合には、次に、ステータスレジスタ２１のビット［９］の「送信正常終了の有無」の状態を調査し、ビット［９］が'０'であれば、送信動作が正常に終了していることになるが、一方、ビット［９］が'１'であれば、送信動作が何らかの異常により正常には実施できず異常終了していることを検出する。 If any transmission operation has been performed, the status of the bit [9] of the status register 21 is checked for the “presence / absence of normal transmission completion”. If bit [9] is “0”, transmission is performed. On the other hand, if the bit [9] is “1”, the transmission operation cannot be normally performed due to some abnormality, and it is detected that the operation has ended abnormally.

送信動作が異常終了していた場合には、次に、ステータスレジスタ２１のビット［１０：１５］を調査する。ステータスレジスタ２１のビット［１０：１５］には、前述したように、送信動作の進捗段階ごとの異常の有無を、各ビットごとに格納している。つまり、ビット［１０］〜ビット［１５］のそれぞれは、送信動作の第１段階１〜第６段階に対応し、異常がない場合は'０'であるが、異常が発生している場合は'１'を格納している。 If the transmission operation has ended abnormally, next, bits [10:15] of the status register 21 are examined. As described above, the bits [10:15] of the status register 21 store the presence / absence of abnormality for each progress stage of the transmission operation for each bit. That is, each of bits [10] to [15] corresponds to the first to sixth stages of the transmission operation, and is “0” when there is no abnormality, but when an abnormality occurs. '1' is stored.

以上のように、ステータスレジスタ２１のビット［８］が'０'であれば、送信動作が実施されない何らかの異常が発生していることになり、また、ステータスレジスタ２１のビット［８］が'１'であっても、ビット［９：１５］の中に'１'が存在していれば、送信動作が異常終了していることになる。一方、ステータスレジスタ２１のビット［８］が'１'で、かつ、ビット［９：１５］の中に'１'が存在していなければ、送信動作は正常終了し、送信動作以外で異常が発生しているものと判断することができる。また、ＢＩＯＳ６１は、障害を検出したときの処理内容に基づいて、例えば、インターフェース回路５に関連するリンク初期設定用のコマンドを送信中であったか否かに基づいて、インターフェース回路５に関連するリンク障害か否かを決定することができる。 As described above, if the bit [8] of the status register 21 is “0”, it means that an abnormality that the transmission operation is not performed has occurred, and the bit [8] of the status register 21 is “1”. Even if “,” if “1” exists in bits [9:15], the transmission operation has ended abnormally. On the other hand, if bit [8] of status register 21 is “1” and “1” does not exist in bits [9:15], the transmission operation ends normally, and there is an abnormality other than the transmission operation. It can be determined that it has occurred. Further, the BIOS 61, based on the processing content when the failure is detected, for example, based on whether or not a link initial setting command related to the interface circuit 5 is being transmitted, the link failure related to the interface circuit 5 Or not.

リンク障害と決定した場合であれば、ＢＩＯＳ６１は、ステータスレジスタ２１のビット［８：１５］の内容から送信動作が正常に行われたことが確認された場合には、インターフェース回路５の通信先のＣＰＵ１２を故障被疑部品として決定し、一方、ステータスレジスタ２１のビット［８：１５］の内容から送信動作に何らかの異常が発生したことが確認された場合には、ＣＰＵ１１自身を故障被疑部品として決定する。 If it is determined that a link failure has occurred, the BIOS 61 confirms that the transmission operation has been performed normally from the contents of bits [8:15] of the status register 21, and the communication destination of the interface circuit 5 is determined. If the CPU 12 is determined as a suspected faulty part, and if it is confirmed from the contents of bits [8:15] of the status register 21 that some abnormality has occurred in the transmission operation, the CPU 11 itself is determined as the suspected faulty part. .

かくのごとき異常判定処理結果に基づいて、ＢＩＯＳ６１は、図２に示したエラーコード表を参照し、当該ＢＩＯＳ６１が動作する処理対象の部品と、障害を検出したときの処理内容から決定したリンク障害の有無と、ステータスレジスタ２１のビット［８：１５］の内容から決定したリンク障害時の故障被疑部品とにより、処理対象部品８２、リンク障害８３、故障被疑部品８４を検索して、対応するエラーコードをエラーコード８１から抽出する。例えば、障害を検出した処理対象部品がＣＰＵ１１であり、ＣＰＵ１１が障害被疑部品となるリンク障害であった場合には、図２のエラーコード表に示すように、エラーコードは、'0x0000_0001'になる。また、障害を検出した処理対象部品がＣＰＵ１１であり、ＣＰＵ１１が障害被疑部品であっても、リンク障害ではない場合には、図２のエラーコード表に示すように、エラーコードは、'0x0000_1001'になる。 Based on the abnormality determination processing result as described above, the BIOS 61 refers to the error code table shown in FIG. 2 and determines the link failure determined from the processing target component on which the BIOS 61 operates and the processing content when the failure is detected. The processing target part 82, the link fault 83, and the fault suspected part 84 are searched by using the presence or absence of the fault and the fault suspected part at the time of the link fault determined from the contents of bits [8:15] of the status register 21, and the corresponding error is searched. The code is extracted from the error code 81. For example, if the processing target component that detected the failure is the CPU 11, and the CPU 11 is a link failure that becomes a suspected failure component, the error code is '0x0000_0001' as shown in the error code table of FIG. . If the processing target component that has detected the failure is the CPU 11 and the CPU 11 is a suspected failure component but is not a link failure, the error code is “0x0000_1001” as shown in the error code table of FIG. become.

図５（Ａ）のステップＡ２における障害内容の分析処理を実施すると、次に、ＢＭＣ３へ障害の検出を通知する情報を設定するために、まず、ＢＩＯＳ６１は、インターフェース回路５に関するリンク初期設定中にリンク障害を検出していたか否かを判定する（ステップＡ３）。 When the failure content analysis process in step A2 in FIG. 5A is performed, next, the BIOS 61 first sets the link initialization related to the interface circuit 5 in order to set information for notifying the BMC 3 of the detection of the failure. It is determined whether a link failure has been detected (step A3).

リンク障害を検出していた場合は（ステップＡ３のｙｅｓ）、ステップＡ２において決定したエラーコードとインターフェース回路５を介した通信の通信先となる部品を示すコードとを付した障害検出通知をＢＭＣ３にインターフェース回路４１を介して送信する（ステップＡ４）。例えば、障害を検出した処理対象部品がＣＰＵ１１であり、ＣＰＵ１１が障害被疑部品となるリンク障害であった場合には、前述のように、エラーコードは'0x0000_0001'であり、かつ、通信先の部品を示すコードはＣＰＵ１２を示すコードとなり、障害検出通知を送信しようとするＢＭＣ３に対して、インターフェース回路５の通信先のＣＰＵ１２側のステータスレジスタ２２をさらに調査して、障害被疑部位を解析することを依頼することになる。 If a link failure has been detected (yes in step A3), a failure detection notification with the error code determined in step A2 and a code indicating a communication destination component via the interface circuit 5 is sent to the BMC 3 Transmission is performed via the interface circuit 41 (step A4). For example, if the processing target component that detected the failure is the CPU 11 and the CPU 11 is a link failure that becomes a suspected failure component, as described above, the error code is '0x0000_0001' and the communication destination component The code indicating the CPU 12 is a code indicating the CPU 12, and the BMC 3 to which the failure detection notification is to be transmitted is further investigated in the status register 22 on the CPU 12 side as the communication destination of the interface circuit 5 to analyze the suspected failure portion. I will ask.

一方、リンク障害を検出していない場合は（ステップＡ３のｎｏ）、通信先となる部品を示すコードを含まないエラーコードのみからなる障害検出通知をＢＭＣ３にインターフェース回路４１を介して送信する（ステップＡ５）。例えば、障害を検出した処理対象部品がＣＰＵ１１であり、ＣＰＵ１１が障害被疑部品であっても、リンク障害ではない場合には、エラーコードが'0x0000_1001'であり、通信先の部品を示すコードとしてall'０'（通信先の部品すなわちＣＰＵ１２の調査が不要である旨を示すコード）を設定した障害検出通知を送信して、ＢＭＣ３に対して、該障害検出通知のみを用いて、障害被疑部位を解析することを依頼することになる。 On the other hand, if a link failure is not detected (no in step A3), a failure detection notification consisting only of an error code not including a code indicating a communication destination component is transmitted to the BMC 3 via the interface circuit 41 (step S3). A5). For example, if the processing target component that detected the failure is the CPU 11 and the CPU 11 is a suspected failure component but not a link failure, the error code is “0x0000_1001”, and the code indicating the communication destination component is all A failure detection notification in which “0” (a communication destination component, ie, a code indicating that the investigation of the CPU 12 is not necessary) is set is transmitted, and only the failure detection notification is used to identify the suspected failure site. You will be asked to analyze.

ＢＩＯＳ６１からＢＭＣ３へ送信する障害検出通知のフォーマットの一例を、図６に示す。すなわち、図６は、図１に示す情報処理装置において障害を検出したＢＩＯＳ６１からＢＭＣ３に対して送信する障害検出通知フォーマットの一例を示すテーブルである。なお、ＣＰＵ１２上で動作するＢＩＯＳ６２が障害を検出した場合であっても、図６と同様のフォーマットを用いて、ＢＭＣ３に通知することができることは言うまでもない。 An example of the format of the failure detection notification transmitted from the BIOS 61 to the BMC 3 is shown in FIG. That is, FIG. 6 is a table showing an example of a failure detection notification format transmitted from the BIOS 61 that has detected a failure to the BMC 3 in the information processing apparatus shown in FIG. Needless to say, even if the BIOS 62 operating on the CPU 12 detects a failure, the BMC 3 can be notified using the same format as in FIG.

図６のＢｉｔ欄１１１に示すように、ＢＭＣ３へ通知する障害検出通知フォーマットは、例えば４８ビットからなっており、説明欄１１２、内容欄１１３に示すように、ビット［０：３１］には、ＢＩＯＳ６１において障害分析結果として決定したエラーコード（すなわち、障害を特定するためのコード）が設定され、ビット［３２：４７］には、通信先の部品を示す通信先の部品コードが設定される。ここで、通信先の部品コードは、各部品に１対１に対応して付されているものであり、内容欄１１３に示すように、リンク障害を検出していた場合には、調査対象となる通信先の部品を示す'0x0000'以外のコード（例えば、ＣＰＵ１２の場合は、'0x0002'、ＣＰＵ１１の場合は、'0x0001'）が設定され、リンク障害以外の障害を検出していた場合には、調査対象となる通信先を指定していないことを示すコードとしてall'０'の'0x0000'というコードが設定される。 As shown in the Bit column 111 in FIG. 6, the failure detection notification format notified to the BMC 3 is, for example, 48 bits. As shown in the explanation column 112 and the content column 113, the bits [0:31] An error code (that is, a code for specifying a failure) determined as a failure analysis result in the BIOS 61 is set, and a communication destination component code indicating a communication destination component is set in bits [32:47]. Here, the part code of the communication destination is assigned to each part in a one-to-one correspondence. As shown in the content column 113, if a link failure is detected, When a code other than '0x0000' indicating the communication destination component (for example, '0x0002' for CPU 12 and '0x0001' for CPU 11) is set and a fault other than a link fault has been detected. Is set to code “0x0000” of all “0” as a code indicating that the communication destination to be investigated is not designated.

なお、インターフェース回路４１を介した通信量を抑制するために、ＢＩＯＳ６１からＢＭＣ３へ送信する障害検出通知のフォーマットの図６とは異なる他の例として、図７に示すようなフォーマットを用いるようにしても良い。図７は、図１に示す情報処理装置において障害を検出したＢＩＯＳ６１からＢＭＣ３に対して送信する障害検出通知フォーマットの他の例を示すテーブルである。ここで、ＣＰＵ１２上で動作するＢＩＯＳ６２が障害を検出した場合であっても、図７と同様のフォーマットを用いて、ＢＭＣ３に通知することができることは言うまでもない。 As another example different from the format of the failure detection notification transmitted from the BIOS 61 to the BMC 3 in order to suppress the traffic through the interface circuit 41, a format as shown in FIG. 7 is used. Also good. FIG. 7 is a table showing another example of a failure detection notification format transmitted from the BIOS 61 that has detected a failure to the BMC 3 in the information processing apparatus shown in FIG. Here, it goes without saying that even if the BIOS 62 operating on the CPU 12 detects a failure, the BMC 3 can be notified using the same format as in FIG.

図７のＢｉｔ欄１２１に示すように、ＢＭＣ３へ通知する障害検出通知フォーマットは、図６に比し情報量が少ない例えば３３ビットからなっており、説明欄１２２、内容欄１２３に示すように、ビット［０：３１］には、図６の場合と同様、エラーコードを設定するが、ビット［３２］には、図６の場合とは異なり、通信先の部品の調査を依頼するか否かを示す通信先調査依頼ビットを設定し、リンク障害を検出していた場合には、通信先の調査を依頼する旨を示す'１'（調査依頼識別子）が設定され、リンク障害以外の障害を検出していた場合には、通信先の調査を依頼していないことを示す'０'（調査不要識別子）が設定される。 As shown in the Bit column 121 of FIG. 7, the failure detection notification format notified to the BMC 3 is composed of, for example, 33 bits, which has a smaller information amount than that of FIG. 6, and as shown in the explanation column 122 and the content column 123, as shown in FIG. As in the case of FIG. 6, an error code is set in bits [0:31]. However, unlike in the case of FIG. If a communication failure investigation request bit is set and a link failure is detected, '1' (investigation request identifier) indicating that a communication destination investigation is requested is set, and a failure other than a link failure is set. If it has been detected, '0' (investigation unnecessary identifier) indicating that no investigation of the communication destination is requested is set.

ただし、図７のごとき障害検出通知フォーマットを用いる場合は、ＢＭＣ３上で動作するＢＭＣＦＷ７は、ＢＩＯＳ６１から受信した障害検出通知に通信先の調査を依頼する旨を示す'１'が設定されていた場合、当該障害検出通知に含まれているエラーコードに基づいて、調査対象となる通信先の部品を特定することができる通信先一覧表等を備えた構成としていることが前提になる。 However, when the failure detection notification format as shown in FIG. 7 is used, the BMCFW 7 operating on the BMC 3 is set to “1” indicating that the failure detection notification received from the BIOS 61 requests the investigation of the communication destination. Based on the error code included in the failure detection notification, it is assumed that the communication destination list or the like that can identify the communication destination component to be investigated is provided.

次に、図６または図７に示すような障害検出通知をＢＩＯＳ６１から受信したＢＭＣ３のベースボード管理コントローラ用ファームウェアＢＭＣＦＷ７の動作について、その一例を、図５（Ｂ）のフローチャートを用いて説明する。 Next, an example of the operation of the base board management controller firmware BMCFW 7 of the BMC 3 that has received the failure detection notification as shown in FIG. 6 or 7 from the BIOS 61 will be described with reference to the flowchart of FIG.

図５（Ｂ）に示すように、ＢＭＣ３上で動作するＢＭＣＦＷ７は、インターフェース回路４１を介して、ＢＩＯＳ６１から障害検出通知を受信すると（ステップＢ１）、故障部位を特定するための障害処理が起動され、ステップＢ２に移行する。 As shown in FIG. 5 (B), when the BMCFW 7 operating on the BMC 3 receives a failure detection notification from the BIOS 61 via the interface circuit 41 (step B1), failure processing for identifying the failed part is started. The process proceeds to step B2.

障害処理が起動されると、ＢＭＣＦＷ７は、受信した障害検出通知に含まれているエラーコードに基づいて、図４に示した障害表を参照して、該エラーコードに該当する障害被疑部品と障害被疑割合とを抽出する（ステップＢ２）。例えば、受信した障害検出通知に含まれているエラーコードが'0x0000_0001'であった場合は、図４の障害表の被疑部品１１０２、被疑部品２１０３、被疑割合１１０４、被疑割合２１０５に示すように、故障の部位を示す障害被疑部品がＣＰＵ１１である被疑割合が７０％であり、故障の部位を示す障害被疑部品がＣＰＵ１２である被疑割合が３０％であることを、故障部位の可能性を示す第１の被疑割合として抽出する。 When the failure processing is started, the BMC FW 7 refers to the failure table shown in FIG. 4 based on the error code included in the received failure detection notification, and the failure suspected component and the failure corresponding to the error code. The suspicious ratio is extracted (step B2). For example, when the error code included in the received failure detection notification is “0x0000_0001”, the suspected component 1102, suspected component 2 103, suspected rate 1 104, suspected rate 2 105 in the failure table of FIG. As shown in the figure, it is possible for the failure part that the suspected rate that the suspected failure part indicating the failure part is CPU 11 is 70% and the suspected part that the failure suspect part indicating the failure part is CPU 12 is 30%. Extracted as the first suspicious rate indicating sex.

次に、ＢＭＣＦＷ７は、受信した障害検出通知内のリンク障害か否かを示す情報（例えば、図６の場合は、ビット[３２：４７]、図７の場合は、ビット［３２］）を参照して、検出された障害がインターフェース回路５に関するリンク障害であったか否かをチェックし、インターフェース回路５の通信先の部品を調査する必要があるか否かを判定する（ステップＢ３）。ここで、障害検出通知が図６に示すようなフォーマットであれば、前述したように、ビット[３２：４７]には、リンク障害の場合、通信先の部品を示す'0x0000'以外のコードが設定されており、リンク障害ではない場合は、'0x0000'が設定されている。また、図７に示すようなフォーマットであれば、通信先の部品の調査依頼ビットであるビット[３２]にリンク障害の有無を示す情報が設定されている。 Next, the BMCFW 7 refers to information indicating whether or not there is a link failure in the received failure detection notification (for example, bits [32:47] in the case of FIG. 6 and bits [32] in the case of FIG. 7). Then, it is checked whether or not the detected failure is a link failure related to the interface circuit 5, and it is determined whether or not it is necessary to investigate the communication destination component of the interface circuit 5 (step B3). Here, if the failure detection notification has the format shown in FIG. 6, as described above, bits [32:47] include a code other than “0x0000” indicating the communication destination component in the case of a link failure. If it is set and there is no link failure, “0x0000” is set. Further, in the format as shown in FIG. 7, information indicating the presence or absence of a link failure is set in bit [32], which is a survey request bit of the communication destination component.

検出した障害が、リンク障害ではなく、通信先の部品を調査する必要がないと判定した場合には（ステップＢ３のｎｏ）、通信先の部品を調査することなく、直ちに、ＢＩＯＳ６１からの障害検出通知のみに基づいて、故障の被疑部位を指摘して処置するために、ステップＢ７に移行する。 If it is determined that the detected failure is not a link failure and it is not necessary to investigate the communication destination component (no in step B3), the failure detection from the BIOS 61 is immediately detected without investigating the communication destination component. Based on only the notification, the process proceeds to step B7 in order to point out and treat the suspected part of the failure.

一方、検出した障害が、インターフェース回路５に関するリンク障害であり、通信先の部品を調査する必要があると判定した場合には（ステップＢ３のｙｅｓ）、ステップＢ４に移行して、通信先の部品を特定して、該通信先の部品の状態を解析する（ステップＢ４）。 On the other hand, when it is determined that the detected failure is a link failure related to the interface circuit 5 and it is necessary to investigate the communication destination component (yes in step B3), the process proceeds to step B4, and the communication destination component. And the state of the communication destination component is analyzed (step B4).

ここで、ＢＭＣＦＷ７は、ＢＩＯＳ６１からの障害検出通知が図６に示すようなフォーマットであれば、前述したように、受信した障害検出通知のビット[３２：４７]に設定されている通信先の部品コードに基づいて、通信先の部品を決定し、また、受信した障害検出通知が図７に示すようなフォーマットであれば、前述したように、該障害検出通知に含まれているエラーコードに基づいて通信先一覧表等を検索することにより、通信先の部品を決定する。しかる後、決定した通信先の部品にアクセスして、当該通信先の部品の状態を解析する。 Here, if the failure detection notification from the BIOS 61 is in the format as shown in FIG. 6, the BMC FW 7 is the communication destination component set in the bits [32:47] of the received failure detection notification as described above. Based on the code, the communication destination component is determined, and if the received failure detection notification is in the format shown in FIG. 7, as described above, based on the error code included in the failure detection notification. By searching the communication destination list and the like, the communication destination parts are determined. Thereafter, the determined communication destination component is accessed to analyze the state of the communication destination component.

例えば、図１に示す情報処理装置において障害検出通知として図６のようなフォーマットを用いている場合は、インターフェース回路５に関するリンク障害を検出したＣＰＵ１１のＢＩＯＳ６１からの障害検出通知に含まれる通信先の部品コードには、'0x0002'とＣＰＵ１２を特定するコードが設定されていることになり、ＢＭＣＦＷ７は、通信先の部品として、インターフェース回路４２を介して、ＣＰＵ１２のステータスレジスタ２２に直接アクセスして、ＣＰＵ１２の状態を読み取って、ＣＰＵ１２のインターフェース回路５に対する送受信状態を解析する。 For example, when the format shown in FIG. 6 is used as the failure detection notification in the information processing apparatus shown in FIG. 1, the communication destination included in the failure detection notification from the BIOS 61 of the CPU 11 that detected the link failure related to the interface circuit 5. As the component code, “0x0002” and a code for specifying the CPU 12 are set, and the BMC FW 7 directly accesses the status register 22 of the CPU 12 via the interface circuit 42 as a communication destination component, and the CPU 12 And the transmission / reception state of the CPU 12 with respect to the interface circuit 5 is analyzed.

次いで、ＢＭＣＦＷ７は、図３に示したような構成からなるＣＰＵ１２のステータスレジスタ２２のビット［０：７］を取り出して、ＣＰＵ１２が、相手のＣＰＵ１１からの受信動作を正常に行っているか否かを確認する。 Next, the BMCFW 7 extracts the bits [0: 7] of the status register 22 of the CPU 12 configured as shown in FIG. 3, and determines whether or not the CPU 12 is normally performing a reception operation from the partner CPU 11. Check.

まず、取り出したＣＰＵ１２のステータスレジスタ２２のビット［０］の「受信あり／なし」の状態を調査し、ビット［０］が'０'であれば、相手のＣＰＵ１１からは送信動作が行われているにも関わらず、ＣＰＵ１２においては何らかの異常により受信動作が行われなかったことを検出する。一方、ビット［０］が'１'であれば、インターフェース回路５を介してＣＰＵ１１からの何らかの受信動作が行われていたことを確認する。 First, the state of “with / without reception” of the bit [0] of the status register 22 of the extracted CPU 12 is checked. If the bit [0] is “0”, a transmission operation is performed from the partner CPU 11. Nevertheless, the CPU 12 detects that the reception operation has not been performed due to some abnormality. On the other hand, if the bit [0] is “1”, it is confirmed that some reception operation from the CPU 11 has been performed via the interface circuit 5.

何らかの受信動作が行われていた場合には、次に、ステータスレジスタ２２のビット［１］の「受信正常終了の有無」の状態を調査し、ビット［１］が'０'であれば、受信動作が正常に終了していることになるが、一方、ビット［１］が'１'であれば、受信動作が何らかの異常により正常には実施できず異常終了していることを検出する。 If any reception operation has been performed, the status of the bit [1] of the status register 22 is checked for the “presence / absence of normal reception”, and if the bit [1] is “0”, reception is performed. The operation ends normally. On the other hand, if the bit [1] is “1”, it is detected that the reception operation cannot be normally performed due to some abnormality and ends abnormally.

受信動作が異常終了していた場合には、次に、ステータスレジスタ２２のビット［２：７］を調査する。ステータスレジスタ２２のビット［２：７］には、前述したように、受信動作の進捗段階ごとの異常の有無を、各ビットごとに格納している。つまり、ビット［２］〜ビット［７］のそれぞれは、受信動作の第１段階１〜第６段階に対応し、異常がない場合は'０'であるが、異常が発生している場合は'１'を格納している。 If the reception operation has ended abnormally, next, bits [2: 7] of the status register 22 are examined. As described above, the bit [2: 7] of the status register 22 stores the presence / absence of an abnormality for each progress stage of the reception operation for each bit. That is, each of bit [2] to bit [7] corresponds to the first stage to the sixth stage of the reception operation, and is “0” when there is no abnormality, but when an abnormality occurs. '1' is stored.

以上のように、ステータスレジスタ２２のビット［０］が'０'であれば、相手のＣＰＵ１１からの送信動作があったにも関わらず、何らかの異常により受信動作ができなかったものと判断し、また、ステータスレジスタ２２のビット［０］が'１'であっても、ビット［１：７］の中に'１'が存在していれば、受信動作が異常終了していることになる。一方、ステータスレジスタ２２のビット［０］が'１'で、かつ、ビット［１：７］の中に'１'が存在していなければ、受信動作は正常終了しており、通信先のＣＰＵ１２には異常が発生していないものと判断することができる。 As described above, if the bit [0] of the status register 22 is “0”, it is determined that the reception operation cannot be performed due to some abnormality despite the transmission operation from the other CPU 11. Even if the bit [0] of the status register 22 is “1”, if “1” exists in the bits [1: 7], the reception operation is abnormally terminated. On the other hand, if the bit [0] of the status register 22 is “1” and “1” does not exist in the bits [1: 7], the reception operation is normally completed, and the communication destination CPU 12 It can be determined that no abnormality has occurred.

次に、ＢＭＣＦＷ７は、ＣＰＵ１２のステータスレジスタ２２のビット［８：１５］を取り出して、ＣＰＵ１２が、相手のＣＰＵ１１からの送信に対する応答を返送する動作として、相手のＣＰＵ１１への送信動作を正常に行っているか否かを確認する。 Next, the BMCFW 7 extracts the bits [8:15] of the status register 22 of the CPU 12 and the CPU 12 normally performs the transmission operation to the partner CPU 11 as an operation of returning a response to the transmission from the partner CPU 11. Check if it is.

まず、取り出したＣＰＵ１２のステータスレジスタ２２のビット［８］の「送信あり／なし」の状態を調査し、ビット［８］が'０'であれば、ＣＰＵ１２が応答の送信動作を指示していたにも関わらず、何らかの異常により送信動作が行われていなかったことを検出する。一方、ビット［８］が'１'であれば、インターフェース回路５を介してＣＰＵ１１に対して何らかの送信動作が行われていたことを確認する。 First, the state of “with / without transmission” of bit [8] of the status register 22 of the extracted CPU 12 is checked. If bit [8] is “0”, the CPU 12 has instructed a response transmission operation. Nevertheless, it is detected that the transmission operation has not been performed due to some abnormality. On the other hand, if bit [8] is “1”, it is confirmed that some transmission operation has been performed to the CPU 11 via the interface circuit 5.

何らかの送信動作が行われていた場合には、次に、ステータスレジスタ２２のビット［９］の「送信正常終了の有無」の状態を調査し、ビット［９］が'０'であれば、送信動作が正常に終了していることになるが、一方、ビット［９］が'１'であれば、送信動作が何らかの異常により正常には実施できず異常終了していることを検出する。 If any transmission operation has been performed, then the status of the bit [9] of the status register 22 is checked for the "presence / absence of normal transmission", and if bit [9] is '0', transmission is performed. On the other hand, if the bit [9] is “1”, the transmission operation cannot be normally performed due to some abnormality, and it is detected that the operation has ended abnormally.

送信動作が異常終了していた場合には、次に、ステータスレジスタ２２のビット［１０：１５］を調査する。ステータスレジスタ２２のビット［１０：１５］には、前述したように、送信動作の進捗段階ごとの異常の有無を、各ビットごとに格納している。つまり、ビット［１０］〜ビット［１５］のそれぞれは、送信動作の第１段階１〜第６段階に対応し、異常がない場合は'０'であるが、異常が発生している場合は'１'を格納している。 If the transmission operation has ended abnormally, next, bits [10:15] of the status register 22 are examined. In the bits [10:15] of the status register 22, the presence / absence of an abnormality at each progress stage of the transmission operation is stored for each bit as described above. That is, each of bits [10] to [15] corresponds to the first to sixth stages of the transmission operation, and is “0” when there is no abnormality, but when an abnormality occurs. '1' is stored.

以上のように、ステータスレジスタ２２のビット［８］が'０'であれば、何らかの異常により、相手のＣＰＵ１１への送信動作ができなかったものと判断し、また、ステータスレジスタ２２のビット［８］が'１'であっても、ビット［９：１５］の中に'１'が存在していれば、送信動作が異常終了していることになる。一方、ステータスレジスタ２２のビット［８］が'１'で、かつ、ビット［９：１５］の中に'１'が存在していなければ、送信動作は正常終了しているものと判断することができる。 As described above, if the bit [8] of the status register 22 is “0”, it is determined that the transmission operation to the other CPU 11 cannot be performed due to some abnormality, and the bit [8] of the status register 22 is determined. ] Is “1”, if “1” exists in bits [9:15], the transmission operation is abnormally terminated. On the other hand, if bit [8] of status register 22 is “1” and “1” does not exist in bits [9:15], it is determined that the transmission operation is normally completed. Can do.

次いで、ＢＭＣＦＷ７は、ステップＢ４における通信先の部品であるＣＰＵ１２の送受信状態の解析結果に基づいて、障害検出元のＣＰＵ１１の障害被疑部品と通信先のＣＰＵ１２の障害被疑部品との双方の被疑割合を、故障部位の可能性を示す第２の被疑割合として決定する（ステップＢ５）。 Next, based on the analysis result of the transmission / reception state of the CPU 12 that is the communication destination component in Step B4, the BMCFW 7 calculates the suspect ratio of both the failure suspected component of the failure detection source CPU 11 and the failure suspected component of the communication destination CPU 12. Then, it is determined as the second suspected ratio indicating the possibility of the failure part (step B5).

つまり、ステップＢ５においては、ＢＭＣＦＷ７は、ＣＰＵ１１のＢＩＯＳ６１からの障害検出通知として、該障害検出通知に含まれているエラーコードが例えば'0x0000_0001'であって、リンク障害の旨が通知されてきた場合において、通信先のＣＰＵ１２のステータスレジスタ２２を読み取り、ステータスレジスタ２２のビット［０］および［８］が'１'であり、通信先のＣＰＵ１２が、障害検出側の相手のＣＰＵ１１との間の何らかの送受信動作を行っていた場合であって、ステータスレジスタ２２のビット［１：７］およびビット［９：１５］には'１'が存在していないと認識した場合には、通信先のＣＰＵ１２は、略正常に動作しているものと判定する。 That is, in step B5, the BMC FW 7 receives, as a failure detection notification from the BIOS 61 of the CPU 11, an error code included in the failure detection notification, for example, “0x0000_0001” and a link failure notification. , The status register 22 of the communication destination CPU 12 is read, and the bits [0] and [8] of the status register 22 are “1”. When the transmission / reception operation is performed and it is recognized that “1” does not exist in bits [1: 7] and bits [9:15] of the status register 22, the communication destination CPU 12 It is determined that the device is operating normally.

而して、ＢＭＣＦＷ７は、通信先のＣＰＵ１２の送受信状態の解析結果から、故障の被疑部品はＣＰＵ１１の可能性が高いものと判定し、障害検出元のＣＰＵ１１が障害被疑部品である被疑割合を例えば８０％とし、一方、通信先のＣＰＵ１２が障害被疑部品である被疑割合を例えば２０％と決定する。 Thus, the BMC FW 7 determines from the analysis result of the transmission / reception state of the communication destination CPU 12 that the suspected faulty component is highly likely to be the CPU 11, and sets the suspected rate that the fault detecting source CPU 11 is the suspected faulty component. On the other hand, the CPU 12 as the communication destination determines that the suspect ratio of the suspected failure part is 20%, for example.

一方、ＣＰＵ１１のＢＩＯＳ６１からの障害検出通知に含まれているエラーコードが例えば'0x0000_0001'であって、リンク障害の旨が通知されてきた場合において、通信先のＣＰＵ１２のステータスレジスタ２２のビット［０］または［８］が'０'であり、通信先のＣＰＵ１２が、障害検出側の相手のＣＰＵ１１と送受信動作を行うことができなかった場合、あるいは、ビット［０］および［１］が'１'であり、通信先のＣＰＵ１２が、障害検出側の相手のＣＰＵ１１と送受信動作を行うことができた場合であっても、ステータスレジスタ２２のビット［１：７］またはビット［９：１５］に'１'が存在していると認識した場合には、通信先のＣＰＵ１２は、正常に動作していないものと判定する。 On the other hand, when the error code included in the failure detection notification from the BIOS 61 of the CPU 11 is, for example, “0x0000_0001” and a link failure has been notified, the bit [0 of the status register 22 of the communication destination CPU 12 ] Or [8] is “0” and the communication destination CPU 12 cannot perform the transmission / reception operation with the CPU 11 on the failure detection side, or the bits [0] and [1] are “1”. Even if the communication destination CPU 12 can perform transmission / reception operation with the CPU 11 on the failure detection side, the bit [1: 7] or the bits [9:15] of the status register 22 are set. If it is recognized that “1” exists, the communication destination CPU 12 determines that it is not operating normally.

而して、ＢＭＣＦＷ７は、通信先のＣＰＵ１２の送受信状態の解析結果から、故障の被疑部品はＣＰＵ１２の可能性が高いものと判定し、通信先のＣＰＵ１２が障害被疑部品である被疑割合を例えば７０％とし、一方、障害検出元のＣＰＵ１１が障害被疑部品である被疑割合を例えば３０％と決定する。 Thus, the BMCFW 7 determines from the analysis result of the transmission / reception state of the communication destination CPU 12 that the suspected faulty component is highly likely to be the CPU 12, and the suspected rate that the communication target CPU 12 is the suspected faulty component is, for example, 70. On the other hand, the CPU 11 as the fault detection source determines the suspect ratio of the suspected fault part as 30%, for example.

ここで、障害検出元のＣＰＵ１１が障害被疑部品と判定した場合について、ステップＢ５における障害被疑部品の被疑割合の数値（すなわち第２の被疑割合の数値）が、ステップＢ２における障害被疑部品の被疑割合の数値（すなわち第１の被疑割合の数値）と異なる要因は、次の点を考慮したからである。すなわち、ステップＢ２においては、インターフェース回路５のリンク障害検出時において、「障害検出元であるＣＰＵ１１から通信先のＣＰＵ１２に対するインターフェース回路５を介した送信動作が正しく実施できたか否か」を、送信元のＣＰＵ１１のＢＩＯＳ６１自身が判定した結果であるのに対して、ステップＢ５においては、インターフェース回路５のリンク障害検出時において、「通信先のＣＰＵ１２における障害検出元のＣＰＵ１１とのインターフェース回路５を介した送受信動作が正しく実施できたか否か」を、通信先のＣＰＵ１２のステータスレジスタ２２の読み取り結果に基づいて、ＢＭＣＦＷ７が判定した結果であることによるからである。つまり、ステップＢ５における解析結果の方が、ステップＢ２における解析結果よりも、障害解析に対する信頼度がより高いと想定されるからである。 Here, when the failure detection source CPU 11 determines that the component is a suspected failure component, the numerical value of the suspected failure rate of the suspected failure component in step B5 (ie, the numerical value of the second suspected component) is the suspected rate of the suspected failure component in step B2. This is because the following points are taken into consideration as a factor different from the numerical value (that is, the numerical value of the first suspect ratio). That is, in step B2, when the link failure of the interface circuit 5 is detected, “whether or not the transmission operation from the failure detection source CPU 11 to the communication destination CPU 12 via the interface circuit 5 has been correctly performed” is determined. In step B5, when the link fault of the interface circuit 5 is detected, “in the communication destination CPU 12 via the interface circuit 5 with the fault detection source CPU 11”. This is because whether or not the transmission / reception operation has been correctly performed is a result of the BMCFW 7 determining based on the reading result of the status register 22 of the communication destination CPU 12. That is, it is assumed that the analysis result in step B5 is more reliable for failure analysis than the analysis result in step B2.

しかる後、ステップＢ２において障害検出元のＣＰＵ１１からの障害検出通知に基づいて決定した障害被疑部品の被疑割合（すなわち第１の被疑割合）と、ステップＢ５において通信先のＣＰＵ１２のステータスレジスタ２２の読み取り結果に基づいて決定した障害被疑部品の被疑割合（すなわち第２の被疑割合）とを、あらかじめ定めた規則にしたがってマージして最終的な被疑割合を算出する（ステップＢ６）。ここで、マージを行うためのあらかじめ定めた規則としては、単純な例として、ステップＢ２における解析結果とステップＢ５における解析結果との双方の被疑割合を加算して、'２'で割った単純平均値を当該障害被疑部品の最終的な被疑割合とするようにしても良い。 After that, in step B2, the suspect ratio of the suspected fault part (that is, the first suspect ratio) determined based on the fault detection notification from the fault detection source CPU 11, and the reading of the status register 22 of the communication destination CPU 12 in step B5. The final suspected ratio is calculated by merging the suspected ratio (that is, the second suspected ratio) of the suspected failure part determined based on the result in accordance with a predetermined rule (step B6). Here, as a predetermined rule for merging, as a simple example, a simple average obtained by adding the suspected ratios of both the analysis result in step B2 and the analysis result in step B5 and dividing by '2'. The value may be the final suspect ratio of the suspected faulty part.

例えば、第１番目の例として、ＣＰＵ１１から受信した障害検出通知に含まれているエラーコードが'0x0000_0001'であって、かつ、故障の可能性が高い部品がステップＢ２とステップＢ５とでいずれも障害検出元のＣＰＵ１１であった場合であり、前述のように、ステップＢ２の解析結果としてＣＰＵ１１が障害被疑部品である被疑割合が７０％、ＣＰＵ１２が障害被疑部品である被疑割合が３０％であり、一方、ステップＢ５の解析結果としてＣＰＵ１１が障害被疑部品である被疑割合が８０％、ＣＰＵ１２が障害被疑部品である被疑割合が２０％であった場合には、ステップＢ２の解析結果のＣＰＵ１１が障害被疑部品である被疑割合７０％とステップＢ５の解析結果のＣＰＵ１１が障害被疑部品である被疑割合８０％とを単純平均して、ＣＰＵ１１が障害被疑部品である最終的な被疑割合を７５％と決定する。また、ステップＢ２の解析結果のＣＰＵ１２が障害被疑部品である被疑割合３０％とステップＢ５の解析結果のＣＰＵ１２が障害被疑部品である被疑割合２０％とを単純平均して、ＣＰＵ１２が障害被疑部品である最終的な被疑割合を２５％と決定する。 For example, as a first example, the error code included in the failure detection notification received from the CPU 11 is '0x0000_0001', and a component with a high possibility of failure is both in step B2 and step B5. This is a case where the CPU 11 is the failure detection source. As described above, the suspected rate that the CPU 11 is the suspected failure component is 70% and the suspected rate that the CPU 12 is the suspected failure component is 30% as the analysis result of step B2. On the other hand, if the suspected ratio that the CPU 11 is a suspected fault part is 80% and the suspected percentage that the CPU 12 is a suspected fault part is 20% as an analysis result of the step B5, the CPU 11 of the analysis result of the step B2 is faulty. A simple average of the suspicious ratio 70% that is a suspicious part and the suspicious ratio 80% that the CPU 11 of the analysis result of step B5 is a suspicious part is C The final suspicion rate at which PU11 is a suspicious part is determined to be 75%. Further, the CPU 12 of the analysis result of Step B2 is simply averaged with the suspected ratio 30% that the suspected failure part is 30% and the suspected rate 20% that the CPU 12 of the analysis result of Step B5 is the suspected failure part. A final suspicion rate is determined to be 25%.

また、第２番目の例として、故障の可能性が高い部品が、ステップＢ２とステップＢ５で、前述のＣＰＵ１１の場合とは逆に、いずれも、通信先のＣＰＵ１２であって、ステップＢ２の解析結果とステップＢ５の解析結果とのいずれも、ＣＰＵ１１が障害被疑部品である被疑割合が３０％、ＣＰＵ１２が障害被疑部品である被疑割合が７０％であった場合には、ステップＢ２の解析結果のＣＰＵ１１が障害被疑部品である被疑割合３０％とステップＢ５の解析結果のＣＰＵ１１が障害被疑部品である被疑割合３０％とを単純平均して、ＣＰＵ１１が障害被疑部品である最終的な被疑割合を３０％と決定する。また、ステップＢ２の解析結果のＣＰＵ１２が障害被疑部品である被疑割合７０％とステップＢ５の解析結果のＣＰＵ１２が障害被疑部品である被疑割合７０％とを単純平均して、ＣＰＵ１２が障害被疑部品である最終的な被疑割合を７０％と決定する。 Also, as a second example, the parts having a high possibility of failure are the communication destination CPU 12 in steps B2 and B5, contrary to the case of the CPU 11 described above, and the analysis of step B2 is performed. If both the result and the analysis result of step B5 are 30% of the suspected ratio that the CPU 11 is a suspected fault part and 70% that the suspected part that the CPU 12 is a suspected fault part, the analysis result of the step B2 The CPU 11 is simply the average of the suspected ratio 30%, which is the suspected faulty part, and the CPU11, which is the suspected faulty part 30%, of the analysis result in step B5, and the final suspected percentage 30 is the suspected faulty part of the CPU11. %. Further, the CPU 12 of the analysis result of Step B2 is simply averaged with the suspected ratio 70% that the suspected failure part is 70% and the suspected rate 70% that the CPU 12 of the analysis result of Step B5 is the suspected failure part. A final suspicion rate is determined to be 70%.

また、第３番目の例として、故障の可能性が高い部品がステップＢ２とステップＢ５とで異なり、いずれも、状態の解析対象になった部品側が故障であるものと判定して、ステップＢ２の解析結果のＣＰＵ１１が障害被疑部品である被疑割合が７０％、ＣＰＵ１２が障害被疑部品である被疑割合が３０％であり、一方、ステップＢ５の解析結果のＣＰＵ１１が障害被疑部品である被疑割合が３０％、ＣＰＵ１２が障害被疑部品である被疑割合が７０％であった場合には、ステップＢ２の解析結果のＣＰＵ１１が障害被疑部品である被疑割合７０％とステップＢ５の解析結果のＣＰＵ１１が障害被疑部品である被疑割合３０％とを単純平均して、ＣＰＵ１１が障害被疑部品である最終的な被疑割合を５０％と決定する。また、ステップＢ２の解析結果のＣＰＵ１２が障害被疑部品である被疑割合３０％とステップＢ５の解析結果のＣＰＵ１２が障害被疑部品である被疑割合７０％とを単純平均して、ＣＰＵ１２が障害被疑部品である最終的な被疑割合を５０％と決定する。 Further, as a third example, a part having a high possibility of failure is different between Step B2 and Step B5, and in both cases, it is determined that the part on which the state is to be analyzed is a failure. The suspected ratio that the CPU 11 of the analysis result is the suspected fault part is 70%, and the suspect ratio that the CPU 12 is the suspected fault part is 30%, while the suspect ratio that the CPU 11 of the analysis result of step B5 is the suspected fault part is 30%. %, If the suspected ratio that the CPU 12 is a suspected fault part is 70%, the CPU 11 of the analysis result of Step B2 is the suspected percentage 70% that the suspected fault part and the CPU 11 of the analysis result of Step B5 is the suspected fault part The CPU 11 determines the final suspected ratio of 50% as a fault suspected part by simply averaging the suspected ratio 30%. Further, the CPU 12 of the analysis result of step B2 is a simple average of the suspected ratio 30% that the suspected failure part is 30% and the CPU 12 that is the analysis result of step B5 is the suspected failure part 70%, and the CPU 12 is the suspected failure part. A final suspicion rate is determined to be 50%.

また、第４番目の例として、故障の可能性が高い部品がステップＢ２とステップＢ５とで異なり、いずれも、ステータスレジスタ２１，２２に基づく解析をしていない相手の部品側が故障であるものと判定して、ステップＢ２の解析結果のＣＰＵ１１が障害被疑部品である被疑割合が３０％、ＣＰＵ１２が障害被疑部品である被疑割合が７０％であり、一方、ステップＢ５の解析結果の信頼性がステップＢ２の場合よりも高いものとして、ステップＢ５の解析結果のＣＰＵ１１が障害被疑部品である被疑割合が８０％、ＣＰＵ１２が障害被疑部品である被疑割合が２０％と設定した場合には、ステップＢ２の解析結果のＣＰＵ１１が障害被疑部品である被疑割合３０％とステップＢ５の解析結果のＣＰＵ１１が障害被疑部品である被疑割合８０％とを単純平均して、ＣＰＵ１１が障害被疑部品である最終的な被疑割合を５５％と決定する。また、ステップＢ２の解析結果のＣＰＵ１２が障害被疑部品である被疑割合７０％とステップＢ５の解析結果のＣＰＵ１２が障害被疑部品である被疑割合２０％とを単純平均して、ＣＰＵ１２が障害被疑部品である最終的な被疑割合を４５％と決定する。 Further, as a fourth example, a component having a high possibility of failure is different between Step B2 and Step B5, and in both cases, the other component side that has not been analyzed based on the status registers 21 and 22 has a failure. As a result of the determination, the suspicious ratio in which the CPU 11 of the analysis result in step B2 is a suspected fault part is 30%, and the suspected ratio in which the CPU 12 is a suspected fault part is 70%. If the CPU 11 in the analysis result of step B5 is set to 80%, the suspected ratio that the CPU 12 is the suspected faulty part is 20%, and the suspected ratio that the CPU 12 is a faulty suspected part is set to 20%. The analysis result CPU 11 is a suspected failure part 30% and the step B5 analysis result CPU 11 is a suspected failure part 8 % And by simply averaging, to determine CPU11 is the final suspect ratio is a disorder suspect component and 55%. Further, the CPU 12 of the analysis result of Step B2 is 70% of the suspected failure part, and the CPU 12 of the analysis result of Step B5 is 20% of the suspected failure part. A final suspicion rate is determined to be 45%.

つまり、第４番目の例に示すように、両者の被疑割合を単純平均した場合であっても、ステップＢ２とステップＢ５との被疑割合の設定基準を異なるようにし、解析結果の信頼性がより高いステップＢ５により大きな差異を持たせるように被疑割合を設定することにより、両者を単純平均してマージした場合であっても、例えば、ＣＰＵ１１の被疑割合が５５％、ＣＰＵ１２の被疑割合が４５％と、ステップＢ５における解析結果であるＣＰＵ１１が故障の可能性が高いとの結果を、マージ後の最終的な被疑割合として得ることができる。 In other words, as shown in the fourth example, even if the suspicion ratios of both are simply averaged, the setting criteria of the suspicion ratios in step B2 and step B5 are made different so that the reliability of the analysis result is more By setting the suspicion ratio so as to have a large difference in the high step B5, even if both are simply averaged and merged, for example, the suspicion ratio of the CPU 11 is 55% and the suspicion ratio of the CPU 12 is 45%. Then, the result that the CPU 11 that is the analysis result in step B5 has a high possibility of failure can be obtained as the final suspicious ratio after merging.

従来の障害処理技術においては、インターフェース回路５に関するリンク障害が検出された場合であっても、図５（Ａ）のステップＡ２に示したように、ＢＩＯＳ６１により障害検出元の部品すなわちＣＰＵ１１の状態のみを解析し、かつ、故障の被疑部品として、図２の故障被疑部品８４に示したように、障害検出元のＣＰＵ１１か通信先のＣＰＵ１２かのいずれかしか指摘することができなかった。 In the conventional failure processing technique, even if a link failure relating to the interface circuit 5 is detected, only the state of the failure detection source component, that is, the state of the CPU 11 is detected by the BIOS 61 as shown in step A2 of FIG. As shown in the failure suspected component 84 of FIG. 2, only the failure detection source CPU 11 or the communication destination CPU 12 can be pointed out as a suspected failure component.

しかし、ＢＭＣ３に障害処理機能を備えた本実施形態においては、ステップＢ５において通信先の部品すなわちＣＰＵ１２の状態を解析することによって得られた解析結果と、ステップＢ２の解析結果とを、ステップＢ６においてマージ処理を行い、最終的な故障の被疑割合を決定するので、ＣＰＵ１１，１２の両方のＣＰＵについて故障の可能性の程度を示す被疑部位を、より精度良く指摘することができるようになる。 However, in the present embodiment in which the BMC 3 has a failure processing function, the analysis result obtained by analyzing the communication destination component, that is, the state of the CPU 12 in step B5, and the analysis result in step B2 are obtained in step B6. Since the merging process is performed and the final suspicious ratio of the failure is determined, the suspicious part indicating the degree of the possibility of the failure can be pointed out more accurately for both the CPUs 11 and 12.

また、前述の第４番目の例のように、ステップＢ２の解析結果においては、ＣＰＵ１２を故障の被疑部品としていた場合であっても、ステップＢ２の解析結果よりもより信頼度が高いステップＢ５の解析結果とマージすることによって、ステップＢ５における解析結果として故障の可能性が高いと判定したＣＰＵ１１を故障の被疑部品として指摘する最終的な被疑割合をより多くし、かつ、ＣＰＵ１２の最終的な被疑割合をより少なくすることができる。 Further, as in the fourth example described above, in the analysis result of step B2, even if the CPU 12 is a suspected part of failure, the reliability of step B5 is higher than the analysis result of step B2. By merging with the analysis result, the final suspicion ratio that points out the CPU 11 that has been determined to have a high possibility of failure as the analysis result in step B5 as the suspected component of the failure is increased, and the final suspicion of the CPU 12 The proportion can be reduced.

さらに、各ＣＰＵ１１，１２の状態を解析することにより、故障検出元のＣＰＵ１１だけで判断した障害被疑部品の被疑割合よりも、より多くのバリエーションで、障害被疑部品に関する被疑部位の指摘を行うことができるようになるので、故障部位の判定に関する信頼度と精度とを向上させることができる。 Further, by analyzing the states of the CPUs 11 and 12, the suspected part regarding the suspected part can be pointed out with more variations than the suspected ratio of the suspected part that is determined only by the CPU 11 that is the failure detection source. As a result, the reliability and accuracy with respect to the determination of the failure part can be improved.

なお、ステップＢ２の解析結果とステップＢ５の解析結果とのマージを行うためにあらかじめ定めた規則としては、前述したような単純平均を行う場合のみに限るものではない。例えば、両者の被疑割合の加重平均を行うことにより、より信頼度が高いステップＢ５における解析結果がより重み付けされた結果が得られるようにしても良い。 Note that the predetermined rule for merging the analysis result of step B2 and the analysis result of step B5 is not limited to the simple averaging as described above. For example, by performing a weighted average of the suspicious ratios of both, a result obtained by weighting the analysis result in step B5 with higher reliability may be obtained.

図５（Ｂ）のフローチャートに戻って、最後に、ステップＢ６のマージ結果によって決定した障害被疑部品の最終的な被疑割合に応じて、故障の部位を決定して、決定した故障部品を指摘するとともに、当該故障部品に対する処置を実施する（ステップＢ７）。例えば、マージ結果として、障害検出元のＣＰＵ１１が障害被疑部品であるとする最終的な被疑割合が、通信先のＣＰＵ１２の最終的な被疑割合よりも大きい場合には、ＣＰＵ１１が故障の部品であると判定して、当該ＣＰＵ１１を、情報処理装置の運用系から切り離して、情報処理装置を再起動する。この結果、故障のＣＰＵ１１のＢＩＯＳ６１は、情報処理装置の立ち上げ中に動作することがなくなるので、ＢＩＯＳ６１が故障をさらに検出してしまうことがなくなり、立ち上げ動作を順調に進めることができるようになる。 Returning to the flowchart of FIG. 5B, finally, the failure part is determined according to the final suspected ratio of the suspected faulty part determined by the merge result of step B6, and the determined faulty part is pointed out. At the same time, a measure is taken for the failed part (step B7). For example, if the final suspected rate that the failure detection source CPU 11 is a suspected failure component is greater than the final suspected rate of the communication destination CPU 12 as a merge result, the CPU 11 is a failed component. The CPU 11 is disconnected from the operational system of the information processing apparatus, and the information processing apparatus is restarted. As a result, the BIOS 61 of the failed CPU 11 does not operate during startup of the information processing apparatus, so that the BIOS 61 does not further detect the failure and the startup operation can proceed smoothly. Become.

次に、インターフェース回路５に関するリンク障害以外の障害を検出した場合の動作についてさらに説明する。インターフェース回路５に関するリンク障害以外の障害とは、ＢＩＯＳ６１が、ＣＰＵ１１の初期化動作中であっても、インターフェース回路５を介して通信先のＣＰＵ１２との送受信動作を行う場面ではないＣＰＵ１１内部の初期設定動作中に、ＣＰＵ１１側の内部障害を検出した場合などである。 Next, the operation when a failure other than a link failure relating to the interface circuit 5 is detected will be further described. The failure other than the link failure related to the interface circuit 5 is an internal setting of the CPU 11 that is not a scene in which the BIOS 61 performs a transmission / reception operation with the communication destination CPU 12 via the interface circuit 5 even when the CPU 11 is performing an initialization operation. This is the case when an internal failure on the CPU 11 side is detected during operation.

まず、図５（Ａ）のフローチャートにおいて、ＣＰＵ１１の立ち上げ処理中に、かかるＣＰＵ１１側の内部障害を、ＣＰＵ１１上で動作するＢＩＯＳ６１が検出すると、前述したように、障害処理が起動されて(ステップＡ１)、障害を検出したときに処理対象だった部品、リンク障害であるか否か、故障の被疑部品に基づいて、図２に示すエラーコード表を参照して、該当するエラーコードを抽出する（ステップＡ２）。例えば、ＣＰＵ１１が処理対象の部品であり、インターフェース回路５を介した通信先のＣＰＵ１２との間の送受信動作に関するリンク障害でなく、ＣＰＵ１１側の内部処理中においてＣＰＵ１１側の内部障害を検出した場合、障害被疑部品はＣＰＵ１１であり、該当するエラーコードは、図２のエラーコード表に示すように、'0x0000_1001'になる。 First, in the flowchart of FIG. 5A, when the BIOS 61 operating on the CPU 11 detects the internal failure on the CPU 11 side during the startup process of the CPU 11, the failure processing is started as described above (step A1), referring to the error code table shown in FIG. 2 and extracting the corresponding error code based on the part that was the object of processing when the fault was detected, whether or not it is a link fault, and the suspected faulty part (Step A2). For example, when the CPU 11 is a component to be processed and an internal failure on the CPU 11 side is detected during internal processing on the CPU 11 side, not a link failure related to a transmission / reception operation with the CPU 12 that is a communication destination via the interface circuit 5, The suspected failure part is the CPU 11, and the corresponding error code is '0x0000_1001' as shown in the error code table of FIG.

次いで、図５（Ａ）のステップＡ２における障害内容の分析処理として、リンク障害ではないことを分析しているので（ステップＡ３のｎｏ）、通信先となる部品を示すコードを含まないエラーコードのみからなる障害検出通知をＢＭＣ３にインターフェース回路４１を介して送信する（ステップＡ５）。例えば、障害を検出した処理対象部品がＣＰＵ１１であり、ＣＰＵ１１が障害被疑部品であり、リンク障害ではない場合には、前述のように、ビット［０：３１］のエラーコードが'0x0000_1001'であり、図６のフォーマットの場合にはビット［３２：４７］の通信先の部品コードがall'０'の'0x0000'を設定した障害検出通知を送信して、ＢＭＣ３に対して、障害被疑部位を解析することを依頼することになる。 Next, as the failure content analysis processing in step A2 of FIG. 5A, since it is analyzed that it is not a link failure (no in step A3), only an error code that does not include a code indicating a communication destination component is included. A failure detection notification consisting of is transmitted to the BMC 3 via the interface circuit 41 (step A5). For example, if the processing target component that detected the failure is the CPU 11, the CPU 11 is the suspected failure component, and is not a link failure, the error code of bits [0:31] is “0x0000_1001” as described above. In the case of the format of FIG. 6, a failure detection notification in which “0x0000” in which the part code of the communication destination of bits [32:47] is all “0” is set is transmitted, and the suspected failure site is indicated to BMC 3. You will be asked to analyze.

次に、図５（Ｂ）に示すように、ＢＭＣ３上で動作するＢＭＣＦＷ７が、インターフェース回路４１を介して、ＢＩＯＳ６１から障害検出通知を受信すると（ステップＢ１）、故障部位を特定するための障害処理が起動され、ＢＭＣＦＷ７は、受信した障害検出通知に含まれているエラーコードに基づいて、図４に示した障害表を参照して、該エラーコードに該当する障害被疑部品と障害被疑割合とを抽出する（ステップＢ２）。 Next, as shown in FIG. 5B, when the BMCFW 7 operating on the BMC 3 receives the failure detection notification from the BIOS 61 via the interface circuit 41 (step B1), the failure processing for specifying the failure part BMCFW 7 refers to the failure table shown in FIG. 4 based on the error code included in the received failure detection notification, and determines the failure suspected part and failure suspected rate corresponding to the error code. Extract (step B2).

例えば、受信した障害検出通知に含まれているエラーコードが'0x0000_1001'であった場合は、図４の障害表の被疑部品１１０２、被疑部品２１０３、被疑割合１１０４、被疑割合２１０５に示すように、故障の部位を示す障害被疑部品がＣＰＵ１１である被疑割合が１００％であり、故障の部位を示す障害被疑部品がＣＰＵ１２である被疑割合が０％であることを、故障部位の可能性を示す第１の被疑割合として抽出する。 For example, when the error code included in the received failure detection notification is “0x0000_1001”, the suspected component 1102, suspected component 2 103, suspected rate 1 104, suspected rate 2 105 in the failure table of FIG. As shown in the figure, it is possible for the failure part that the suspected rate that the failure suspected part indicating the failure part is CPU 11 is 100% and the suspected part that the failure suspected part indicating the failure part is CPU 12 is 0%. Extracted as the first suspicious rate indicating sex.

次に、ＢＭＣＦＷ７は、受信した障害検出通知内のリンク障害か否かを示す情報を参照して、検出された障害がリンク障害であったか否かをチェックし、通信先の部品を調査する必要があるか否かを判定する（ステップＢ３）。ここで、障害検出通知が図６に示すようなフォーマットであれば、前述したように、ビット[３２：４７]には、リンク障害ではない場合には、'0x0000'が設定されている。 Next, it is necessary for the BMCFW 7 to check whether or not the detected failure is a link failure by referring to the information indicating whether or not the link failure is in the received failure detection notification, and to investigate the communication destination component. It is determined whether or not there is (step B3). Here, if the failure detection notification has the format as shown in FIG. 6, as described above, bits [32:47] are set to “0x0000” if there is no link failure.

検出した障害が、リンク障害ではなく、通信先の部品を調査する必要がないと判定した場合には（ステップＢ３のｎｏ）、通信先の部品を調査することなく、ＢＩＯＳ６１からの障害検出通知のみに基づいて、故障の被疑部位を指摘して処置するために、直ちにステップＢ７に移行し、ステップＢ２にて障害表から抽出した故障の被疑割合に応じて、故障の部品を決定して、決定した故障部品を指摘するとともに、当該故障部品に対する処置を実施する（ステップＢ７）。 If it is determined that the detected failure is not a link failure and it is not necessary to investigate the communication destination component (no in step B3), only the failure detection notification from the BIOS 61 is checked without investigating the communication destination component. Based on the above, in order to point out and treat the suspected part of the failure, the process immediately proceeds to step B7, and the part of failure is determined according to the suspected ratio of the failure extracted from the trouble table in step B2. The failed part is pointed out, and a measure for the failed part is performed (step B7).

つまり、受信した障害検出通知に含まれているエラーコードが'0x0000_1001'であった場合には、ステップＢ２の解析結果として、ＣＰＵ１１が障害被疑部品の被疑割合が１００％になっているので、ＣＰＵ１１が故障の部品であると判定して、当該ＣＰＵ１１を、情報処理装置の運用系から切り離して、情報処理装置を再起動する。この結果、故障のＣＰＵ１１のＢＩＯＳ６１は、情報処理装置の立ち上げ中に動作することがなくなるので、ＢＩＯＳ６１が故障をさらに検出してしまうことがなくなり、立ち上げ動作を順調に進めることができるようになる。 In other words, if the error code included in the received failure detection notification is “0x0000_1001”, the CPU 11 indicates that the suspected proportion of the suspected failure component is 100% as the analysis result of step B2. Is a faulty part, the CPU 11 is disconnected from the operational system of the information processing apparatus, and the information processing apparatus is restarted. As a result, the BIOS 61 of the failed CPU 11 does not operate during startup of the information processing apparatus, so that the BIOS 61 does not further detect the failure and the startup operation can proceed smoothly. Become.

（本実施形態の効果の説明）
以上に詳細に説明したように、本実施形態においては次のような効果が得られる。 (Description of the effect of this embodiment)
As described in detail above, the following effects are obtained in the present embodiment.

従来の技術においては、情報処理装置の立ち上げ中に、ＣＰＵ１１のＢＩＯＳ６１が、インターフェース回路５に関するリンク障害を検出した場合、ＣＰＵ１２との通信が断絶状態に陥るため、ＣＰＵ１１側の情報のみに基づいて、障害部位の解析を行わなければならなかった。 In the conventional technique, when the BIOS 61 of the CPU 11 detects a link failure related to the interface circuit 5 during the startup of the information processing apparatus, communication with the CPU 12 falls into a disconnected state, and therefore, based on only information on the CPU 11 side. I had to do a site analysis.

これに対して、本実施形態においては、リンク障害を検出したＣＰＵ１１のＢＩＯＳ６１が、ＢＭＣ３に対して障害検出通知を行う際に、障害の被疑部品の解析結果を示すエラーコードを通知する他に、通信先のＣＰＵ１２の状態の調査を依頼することにより、該調査依頼を受け取ったＢＭＣ３は、通信先のＣＰＵ１２にアクセスして、ＣＰＵ１１のＢＩＯＳ６１からはアクセスすることができなかった通信先のＣＰＵ１２の状態を解析することができるので、故障の被疑部品の判定結果に関する信頼度をより向上させることができる。 On the other hand, in the present embodiment, when the BIOS 61 of the CPU 11 that has detected the link failure notifies the BMC 3 of the failure detection, in addition to notifying an error code indicating the analysis result of the suspected component of the failure, By requesting the investigation of the state of the communication destination CPU 12, the BMC 3 that has received the investigation request accesses the communication destination CPU 12 and the state of the communication destination CPU 12 that could not be accessed from the BIOS 61 of the CPU 11. Therefore, it is possible to further improve the reliability related to the determination result of the suspected faulty part.

また、本実施形態においては、障害検出元のＣＰＵ１１のＢＩＯＳ６１が自部品のＣＰＵ１１の状態を解析した結果とＢＭＣ３が通知先の部品のＣＰＵ１２の状態を解析した結果との双方を参照して適宜マージすることができるので、従来の技術よりも、高い信頼度で、故障の部位を指摘することができる。而して、検出した障害に関して、より詳細な被疑部位の指摘を行うことも可能となり、故障指摘精度を向上させることができる。 In this embodiment, the BIOS 61 of the failure detection source CPU 11 analyzes the state of the CPU 11 of the self component and the BMC 3 merges appropriately with reference to both the result of analysis of the state of the CPU 12 of the notification destination component. Therefore, the location of the failure can be pointed out with higher reliability than in the conventional technique. Thus, it becomes possible to point out the suspected part in more detail with respect to the detected fault, and the fault indication accuracy can be improved.

つまり、本実施形態の効果を纏めると、次の通りである。 That is, the effects of this embodiment are summarized as follows.

第１の効果は、情報処理装置の立ち上げ中に、ＣＰＵ１１上で動作するＢＩＯＳ６１が検出したインターフェース回路５に関するリンク障害についても、精度良く故障の被疑部位を指摘することができることにある。その理由は、ＢＩＯＳ６１から障害検出通知を受信したＢＭＣ３が、ＢＩＯＳ６１が解析した障害被疑部品の情報のみならず、リンク障害が発生したインターフェース回路５の通信先の部品（例えばＣＰＵ１２）に直接アクセスして取得した通信先の部品の状態の解析結果から得られる障害被疑部品の情報をも用いて、故障部位の解析を行うことができるためである。 The first effect is that the suspected part of the failure can be pointed out with high accuracy with respect to the link failure related to the interface circuit 5 detected by the BIOS 61 operating on the CPU 11 during the startup of the information processing apparatus. The reason is that the BMC 3 that has received the failure detection notification from the BIOS 61 directly accesses the communication destination component (for example, the CPU 12) of the interface circuit 5 in which the link failure has occurred, as well as information on the suspected failure component analyzed by the BIOS 61. This is because the failure part can be analyzed using information on the suspected failure part obtained from the analysis result of the state of the communication destination part.

第２の効果は、情報処理装置の立ち上げ中にＢＩＯＳ６１からはアクセスすることができない部品（例えばＣＰＵ１２）の状態を把握することができることにある。その理由は、ＢＩＯＳ６１が、ＢＭＣ３に対して、調査依頼として、アクセスしたい部品（例えばＣＰＵ１２）を通知するコードを障害検出通知に含めて送信し、該障害検出通知を受け取ったＢＭＣ３により、調査依頼があった部品（例えばＣＰＵ１２）に直接アクセスする仕組みを備えているためである。 The second effect is that the state of a component (for example, the CPU 12) that cannot be accessed from the BIOS 61 during startup of the information processing apparatus can be grasped. The reason for this is that the BIOS 61 sends the BMC 3 a code for notifying the component (for example, the CPU 12) to be accessed in the failure detection notification as an investigation request, and the BMC 3 receiving the failure detection notification sends the investigation request. This is because a mechanism for directly accessing the part (for example, the CPU 12) is provided.

第３の効果は、情報処理装置の立ち上げ時間を短縮することができることにある。その理由は、情報処理装置の立ち上げ時に、インターフェース回路に関するリンク障害が検出された場合であっても、ＢＭＣ３において精度良く故障の被疑部位の指摘を行うことができるので、同じ障害に対して、切り離し処理とリブート動作とが繰り返されることを防止することができるためである。 The third effect is that the startup time of the information processing apparatus can be shortened. The reason for this is that even when a link failure related to the interface circuit is detected at the time of starting up the information processing device, the BMC 3 can point out the suspected portion of the failure with high accuracy. This is because the separation process and the reboot operation can be prevented from being repeated.

（本発明の他の実施形態）
次に、本発明による情報処理装置の構成として、図１に示した前述の実施形態とは異なる他の構成例について、図８を用いて説明する。図８は、本発明による情報処理装置のブロック構成の図１とは異なる他の例を示すブロック構成図である。図８に示す情報処理装置は、図１の場合とは異なり、情報処理装置を構成する部品として、ＣＰＵ１１，１２の２個のみではなく、複数個（図８の場合は４個）のＣＰＵと、複数個（図８の場合は２個）のＩＯＨｕｂ（ＩＯ機器の接続用制御装置）と、複数個（図８の場合は２個）のＮＣ（Network Controller）とから構成され、各部品が、ベースボード管理コントローラＢＭＣにそれぞれのインターフェース回路を介して接続され、ベースボード管理コントローラＢＭＣが、各部品それぞれに備えているステータスレジスタに直接アクセスすることができる構成例を示している。 (Other embodiments of the present invention)
Next, as a configuration of the information processing apparatus according to the present invention, another configuration example different from the above-described embodiment shown in FIG. 1 will be described with reference to FIG. FIG. 8 is a block configuration diagram showing another example of the block configuration of the information processing apparatus according to the present invention which is different from FIG. Unlike the case of FIG. 1, the information processing apparatus shown in FIG. 8 includes not only two CPUs 11 and 12 but also a plurality of (four in the case of FIG. 8) CPUs as components constituting the information processing apparatus. Each component is composed of a plurality (two in the case of FIG. 8) IO Hub (IO device connection control device) and a plurality (two in the case of FIG. 8) NC (Network Controller). However, a configuration example is shown in which the baseboard management controller BMC is directly connected to the status register included in each component by being connected to the baseboard management controller BMC via each interface circuit.

つまり、図８に示す情報処理装置においては、ＣＰＵ１１とＣＰＵ１２とはインターフェース回路５１で接続されており、かつ、ＣＰＵ１１，１２それぞれは、インターフェース回路２２１，２２２で同一のＩＯＨｕｂ２０１と接続され、かつ、インターフェース回路２３１，２３２で同一のＮＣ２１１と接続されている。同様に、ＣＰＵ１３とＣＰＵ１４とはインターフェース回路５２で接続されており、かつ、ＣＰＵ１３，１４それぞれは、インターフェース回路２２３，２２４で同一のＩＯＨｕｂ２０２と接続され、かつ、インターフェース回路２３３，２３４で同一のＮＣ２１２と接続されている。 That is, in the information processing apparatus shown in FIG. 8, the CPU 11 and the CPU 12 are connected by the interface circuit 51, and the CPUs 11 and 12 are connected to the same IO Hub 201 by the interface circuits 221 and 222, respectively. Interface circuits 231 and 232 are connected to the same NC 211. Similarly, the CPU 13 and the CPU 14 are connected by the interface circuit 52, and the CPUs 13 and 14 are connected to the same IO Hub 202 by the interface circuits 223 and 224, respectively, and the same NC 212 by the interface circuits 233 and 234. Connected with.

また、ＩＯＨｕｂ２０１とＮＣ２１１とはインターフェース回路５３で接続され、ＩＯＨｕｂ２０２とＮＣ２１２とはインターフェース回路５４で接続されており、ＮＣ２１１とＮＣ２１２とはインターフェース回路５５で接続されている。 The IO Hub 201 and the NC 211 are connected by the interface circuit 53, the IO Hub 202 and the NC 212 are connected by the interface circuit 54, and the NC 211 and NC 212 are connected by the interface circuit 55.

また、ＣＰＵ１１，１２，１３，１４の各部品には、ＢＩＯＳ６１,６２,６３，６４がそれぞれ内蔵されている。さらに、ＣＰＵ１１，１２，１３，１４、ＩＯＨｕｂ２０１，２０２、ＮＣ２１１，２１２の各部品には、ステータスレジスタ２１，２２，２３，２４，２５，２６，２７，２８がそれぞれ備えられている。また、ＣＰＵ１１，１２，１３，１４、ＩＯＨｕｂ２０１，２０２、ＮＣ２１１，２１２の各部品は、それぞれ、インターフェース回路４１，４２，４３，４４，４５，４６，４７，４８でＢＭＣ３に接続されている。 In addition, BIOS 61, 62, 63, 64 is incorporated in each component of the CPUs 11, 12, 13, 14. Furthermore, status registers 21, 22, 23, 24, 25, 26, 27, and 28 are provided in the CPU 11, 12, 13, 14, IO Hub 201, 202, and NC 211, 212, respectively. Further, the CPU 11, 12, 13, 14, IO Hub 201, 202, and NC 211, 212 are connected to the BMC 3 by interface circuits 41, 42, 43, 44, 45, 46, 47, 48, respectively.

図８のごとき構成の情報処理装置において、例えば、ＣＰＵ１１上で動作するＢＩＯＳ６１は、情報処理装置の立ち上げ中に、インターフェース回路５１〜５５，２２１〜２２４，２３１〜２３４のリンクアップを行う。ここで、図８のごとき構成であっても、インターフェース回路５１〜５５，２２１〜２２４，２３１〜２３４のいずれかのリンクアップ中にインターフェース障害が発生した場合、該インターフェース障害に対して、前述した実施形態の場合と全く同様の障害処理を行うことによって、障害被疑部品の決定と最終的な被疑割合の決定とを行い、故障部位の特定を精度良く行うことができる。 In the information processing apparatus configured as shown in FIG. 8, for example, the BIOS 61 operating on the CPU 11 links up the interface circuits 51 to 55, 221 to 224, and 231 to 234 during startup of the information processing apparatus. Here, even in the configuration as shown in FIG. 8, when an interface failure occurs during link-up of any of the interface circuits 51 to 55, 221 to 224, 231 to 234, the interface failure has been described above. By performing the same failure processing as in the case of the embodiment, it is possible to determine the fault suspected part and the final suspect ratio, and to specify the faulty part with high accuracy.

なお、本発明に係る情報処理装置は、図１のＣＰＵ１１，１２の２個の部品から情報処理装置が構成される場合や、図８のＣＰＵ１１，１２，１３，１４、ＩＯＨｕｂ２０１，２０２、ＮＣ２１１，２１２の８個の部品から情報処理装置が構成される場合に限るものではなく、所望の任意の個数の部品からなる情報処理装置を対象とすることができる。 Note that the information processing apparatus according to the present invention includes the case where the information processing apparatus is composed of the two components of the CPUs 11 and 12 in FIG. 1, or the CPUs 11, 12, 13, and 14, the IO Hubs 201 and 202, and the NC 211 in FIG. , 212, the information processing apparatus is not limited to the case where the information processing apparatus is configured by eight components, and an information processing apparatus including any desired number of components can be targeted.

さらに、情報処理装置の構成についても、ＣＰＵ１１，１２，１３，１４、ＩＯＨｕｂ２０１，２０２、ＮＣ２１１，２１２等の各部品がＢＭＣ３と接続するためのインターフェース回路を有し、各部品に内蔵のＢＩＯＳ６１,６２,６３，６４，６５，６６，６７，６８がＢＭＣ３に内蔵のＢＭＣＦＷ７と連携して、各部品から前述の実施形態にて説明したような障害検出通知をＢＭＣ３に送信することが可能な構成であれば、図１の構成や図８の構成に限るものではない。 Further, regarding the configuration of the information processing apparatus, each component such as the CPU 11, 12, 13, 14, IO Hub 201, 202, NC 211, 212 has an interface circuit for connecting to the BMC 3, and the BIOS 61, built in each component, Configuration in which 62, 63, 64, 65, 66, 67, 68 can transmit a failure detection notification as described in the above-described embodiment from each component to the BMC 3 in cooperation with the BMC FW 7 built in the BMC 3. If so, it is not limited to the configuration of FIG. 1 or the configuration of FIG.

また、以上の実施形態の説明においては、ＣＰＵ１１上で動作するＢＩＯＳ６１が情報処理装置の立ち上げを行う例を示したが、他のＣＰＵ上で動作するＢＩＯＳを用いて情報処理装置の立ち上げを行わせることも可能である。 In the above description of the embodiment, the BIOS 61 operating on the CPU 11 starts up the information processing apparatus. However, the information processing apparatus is started up using the BIOS operating on another CPU. It is also possible to do this.

以上、本発明の好適な実施形態の構成を説明した。しかし、かかる実施形態は、本発明の単なる例示に過ぎず、何ら本発明を限定するものではないことに留意されたい。本発明の要旨を逸脱することなく、特定用途に応じて種々の変形変更が可能であることが、当業者には容易に理解できよう。 The configuration of the preferred embodiment of the present invention has been described above. However, it should be noted that such embodiments are merely examples of the present invention and do not limit the present invention in any way. Those skilled in the art will readily understand that various modifications and changes can be made according to a specific application without departing from the gist of the present invention.

３ＢＭＣ
５インターフェース回路
７ＢＭＣＦＷ
１１ＣＰＵ
１２ＣＰＵ
１３ＣＰＵ
１４ＣＰＵ
２１ステータスレジスタ
２２ステータスレジスタ
２３ステータスレジスタ
２４ステータスレジスタ
２５ステータスレジスタ
２６ステータスレジスタ
２７ステータスレジスタ
２８ステータスレジスタ
４１インターフェース回路
４２インターフェース回路
４３インターフェース回路
４４インターフェース回路
４５インターフェース回路
４６インターフェース回路
４７インターフェース回路
４８インターフェース回路
５１インターフェース回路
５２インターフェース回路
５３インターフェース回路
５４インターフェース回路
５５インターフェース回路
６１ＢＩＯＳ
６２ＢＩＯＳ
６３ＢＩＯＳ
６４ＢＩＯＳ
８１エラーコード
８２処理対象部品
８３リンク障害
８４故障被疑部品
９１Ｂｉｔ欄
９２説明欄
９３内容欄
１０１エラーコード
１０２被疑部品１
１０３被疑部品２
１０４被疑割合１
１０５被疑割合２
１１１Ｂｉｔ欄
１１２説明欄
１１３内容欄
１２１Ｂｉｔ欄
１２２説明欄
１２３内容欄
２０１ＩＯＨｕｂ
２０２ＩＯＨｕｂ
２１１ＮＣ
２１２ＮＣ
２２１インターフェース回路
２２２インターフェース回路
２２３インターフェース回路
２２４インターフェース回路
２３１インターフェース回路
２３２インターフェース回路
２３３インターフェース回路
２３４インターフェース回路 3 BMC
5 Interface circuit 7 BMCFW
11 CPU
12 CPU
13 CPU
14 CPU
21 status register 22 status register 23 status register 24 status register 25 status register 26 status register 27 status register 27 status register 28 status register 41 interface circuit 42 interface circuit 43 interface circuit 44 interface circuit 45 interface circuit 46 interface circuit 47 interface circuit 48 interface circuit 51 interface Circuit 52 Interface circuit 53 Interface circuit 54 Interface circuit 55 Interface circuit 61 BIOS
62 BIOS
63 BIOS
64 BIOS
81 Error code 82 Processing target component 83 Link failure 84 Failure suspected component 91 Bit column 92 Description column 93 Contents column 101 Error code 102 Suspected component 1
103 Suspected parts 2
104 Suspicious rate 1
105 Suspicious rate 2
111 Bit field 112 Description field 113 Contents field 121 Bit field 122 Explanation field 123 Contents field 201 IO Hub
202 IO Hub
211 NC
212 NC
221 interface circuit 222 interface circuit 223 interface circuit 224 interface circuit 231 interface circuit 232 interface circuit 233 interface circuit 234 interface circuit

Claims

A baseboard management controller (BMC: Baseboard Management) that has a multiprocessor configuration composed of a plurality of processors connected to each other via an interface circuit, connects the processors, and manages and monitors each connected processor. A firmware (BMCFW: BMC Firmware) that operates on the baseboard management controller (BMC) is a basic input / output system (BIOS) that operates on each processor. System) has a fault processing function for determining a fault site and separating the fault site from the active system, and the basic input / output system (BIOS) is executing a device startup operation. If a failure is detected, the basic entry / exit An error code obtained as a result of analyzing an operation state held in a status register provided in a processor in which a system (BIOS) operates, and an interface circuit when the detected failure is determined to be a link failure related to the interface circuit The base that has received the fault detection notification is transmitted to the baseboard management controller (BMC), and a fault detection notification including a communication destination processor investigation request for requesting an analysis of the state of the counterpart processor that is the communication destination The firmware (BMCFW) operating on the board management controller (BMC) may be a suspected failure site based on the error code included in the failure detection notification for the processor connected via the interface circuit. The ratio indicating the first covered When the communication destination processor investigation request is included in the failure detection notification, a status register provided in the communication destination partner processor of the interface circuit in which the failure is detected is determined as a ratio. Based on the result of reading and analyzing the operation state to be held, a ratio indicating the possibility of failure of the suspected failure site is determined as a second suspected rate, and the determined first suspected rate and the second suspected rate are determined. Merging according to a predetermined rule to obtain a final suspicion ratio, and determining a part having the highest final suspicion ratio as a faulty part and separating the faulty part from the operational system, Information processing apparatus.

The basic input / output system (BIOS) that detects the failure detects the link failure in the communication destination processor investigation request included in the failure detection notification when the detected failure is determined to be a link failure related to the interface circuit. The firmware (BMCFW) that operates on the baseboard management controller (BMC) that has set the component code for identifying the processor on the other side of the communication destination of the interface circuit and has received the failure detection notification is included in the failure detection notification. Based on the first and second suspect ratios by reading the operating state held by the status register of the processor on the other party of the communication destination specified by the included component code and determining the second suspect ratio The operation of discriminating the failed part was performed, while the failure was detected The basic input / output system (BIOS) determines that the detected failure is not a link failure related to the interface circuit but a failure in the processor in which the basic input / output system (BIOS) operates, and is included in the failure detection notification. A firmware (BMCFW) that operates on the baseboard management controller (BMC) that receives the failure detection notification is set in the communication destination processor investigation request with a code that is different from the component code that identifies the processor. Assuming that there is no request for investigation of the processor on the other side, the operation for determining the faulty part is performed using only the first suspect ratio without performing the operation of reading the status register of the processor on the other party of the communication destination The information processing apparatus according to claim 1.

The basic input / output system (BIOS) that detects the failure sets an identifier indicating the investigation request in the communication destination processor investigation request included in the failure detection notification when the detected failure is determined to be a link failure relating to the interface circuit. Then, the firmware (BMCFW) that operates on the baseboard management controller (BMC) that has received the failure detection notification refers to the communication destination partner processor identified by referring to the communication destination list provided in advance. By reading the operation state held by the status register and determining the second suspect ratio, the operation of discriminating the failed part is performed based on the first and second suspect ratios, while a failure is detected. In the basic input / output system (BIOS), the detected fault is If it is determined that the failure is not a link failure related to the circuit but a failure in the processor in which the basic input / output system (BIOS) operates, an identifier indicating that no investigation is required is set in the communication destination processor investigation request included in the failure detection notification. The firmware (BMCFW) operating on the baseboard management controller (BMC) that has received the failure detection notification assumes that there is no investigation request for the other party processor of the communication destination, and the status of the other party processor of the communication destination 2. The information processing apparatus according to claim 1, wherein the operation of determining the failed part is performed using only the first suspect ratio without performing a register reading operation.

As the rule for merging the first suspicion rate and the second suspicion rate, a simple average of the first suspicion rate and the second suspicion rate is obtained, or the second 4. The information processing apparatus according to claim 1, wherein any one of rules for obtaining a weighted average obtained by assigning a predetermined appropriate weight to the suspect ratio is used.

A failure location determination method for determining a failure location in an information processing apparatus having a multiprocessor configuration including a plurality of processors connected to each other via an interface circuit, wherein each processor is connected and management of each connected processor is performed A baseboard management controller (BMC) that executes monitoring, and firmware (BMCFW) that operates on the baseboard management controller (BMC) is a basic input / output that operates on each processor. By cooperating with a system (BIOS: Basic Input / Output System), it has a fault processing function for determining a faulty part and separating the faulty part from the operational system. The basic input / output system (BIOS) Failure during start-up operation When an error is detected, an error code obtained as a result of analyzing an operation state held by a status register included in a processor in which the basic input / output system (BIOS) operates, and the detected failure is a link failure related to the interface circuit. If it is determined, the failure detection notification consisting of the communication destination processor investigation request for requesting the analysis of the state of the counterpart processor as the communication destination of the interface circuit is transmitted to the baseboard management controller (BMC), The firmware (BMCFW) that operates on the baseboard management controller (BMC) that has received the failure detection notification is related to the error code included in the failure detection notification for the processor connected via the interface circuit. Based on failure When the ratio indicating the possibility of the suspicious part is determined as the first suspected ratio and the communication destination processor investigation request is included in the failure detection notification, the communication destination of the interface circuit in which the failure is detected Based on the result of reading and analyzing the operation state held in the status register provided in the other processor of the other party, the ratio indicating the possibility of failure of the suspected failure part is determined as the second suspected ratio, and the determined By merging the first suspect ratio and the second suspect ratio according to a predetermined rule to obtain a final suspect ratio, the part having the highest final suspect ratio is determined as a failure part. , A failure site determination method characterized in that the failure site is separated from the operational system.

The basic input / output system (BIOS) that detects the failure detects the link failure in the communication destination processor investigation request included in the failure detection notification when the detected failure is determined to be a link failure related to the interface circuit. The firmware (BMCFW) that operates on the baseboard management controller (BMC) that has set the component code for identifying the processor on the other side of the communication destination of the interface circuit and has received the failure detection notification is included in the failure detection notification. Based on the first and second suspect ratios by reading the operating state held by the status register of the processor on the other party of the communication destination specified by the included component code and determining the second suspect ratio The operation of discriminating the failed part was performed, while the failure was detected The basic input / output system (BIOS) determines that the detected failure is not a link failure related to the interface circuit but a failure in the processor in which the basic input / output system (BIOS) operates, and is included in the failure detection notification. A firmware (BMCFW) that operates on the baseboard management controller (BMC) that receives the failure detection notification is set in the communication destination processor investigation request with a code that is different from the component code that identifies the processor. Assuming that there is no request for investigation of the processor on the other side, the operation for determining the faulty part is performed using only the first suspect ratio without performing the operation of reading the status register of the processor on the other party of the communication destination The fault site determination method according to claim 5, wherein:

The basic input / output system (BIOS) that detects the failure sets an identifier indicating the investigation request in the communication destination processor investigation request included in the failure detection notification when the detected failure is determined to be a link failure relating to the interface circuit. Then, the firmware (BMCFW) that operates on the baseboard management controller (BMC) that has received the failure detection notification refers to the communication destination partner processor identified by referring to the communication destination list provided in advance. By reading the operation state held by the status register and determining the second suspect ratio, the operation of discriminating the failed part is performed based on the first and second suspect ratios, while a failure is detected. In the basic input / output system (BIOS), the detected fault is If it is determined that the failure is not a link failure related to the circuit but a failure in the processor in which the basic input / output system (BIOS) operates, an identifier indicating that no investigation is required is set in the communication destination processor investigation request included in the failure detection notification. The firmware (BMCFW) operating on the baseboard management controller (BMC) that has received the failure detection notification assumes that there is no investigation request for the other party processor of the communication destination, and the status of the other party processor of the communication destination 6. The failure part determination method according to claim 5, wherein the operation of determining the failure part is performed using only the first suspect ratio without performing a register reading operation.

As the rule for merging the first suspicion rate and the second suspicion rate, a simple average of the first suspicion rate and the second suspicion rate is obtained, or the second 8. The failure site determination method according to claim 5, wherein any one of rules for obtaining a weighted average obtained by assigning a predetermined appropriate weight to the suspect ratio is used.

9. A failure site determination program, wherein the failure site determination method according to claim 5 is implemented as a program executable by a computer.