JP2017207903A

JP2017207903A - Processor, method and program

Info

Publication number: JP2017207903A
Application number: JP2016099624A
Authority: JP
Inventors: 鈴木　健吾; Kengo Suzuki; 健吾鈴木
Original assignee: NEC Platforms Ltd
Current assignee: NEC Platforms Ltd
Priority date: 2016-05-18
Filing date: 2016-05-18
Publication date: 2017-11-24
Anticipated expiration: 2036-05-18
Also published as: JP6504610B2

Abstract

PROBLEM TO BE SOLVED: To provide a processor, a method and a program capable of easily analyzing a failure on a main processing system when the main processing system fails.SOLUTION: Disclosed processor 10 includes: a main processing part 11 that performs major processing; a failure detection part 13 which is disposed outside the main processing part 11, detects an occurrence of failure on the main processing part 11, and acquires a piece of information relevant to the state of the main processing part 11 which is notified irrespective of the occurrence of failure; and a monitoring control part 12 that selects a piece of state information corresponding to a point when the failure has occurred from the state information acquired by the failure detection part 13 and stores the information in an accessible state from the outside.SELECTED DRAWING: Figure 1

Description

本発明は、処理装置、方法及びプログラムに関するものであり、特に、障害を解析するための情報を出力する監視制御部を含む処理装置、方法及びプログラムに関する。 The present invention relates to a processing apparatus, method, and program, and more particularly, to a processing apparatus, method, and program that include a monitoring control unit that outputs information for analyzing a failure.

サーバなどの処理装置では、信頼性の向上や可用性の向上を目的として、ＣＰＵ(Central Processing Unit)を有するメイン処理系の動作を、別のＣＰＵを有する監視制御系により監視するのが一般的である。すなわち、処理装置のメイン処理系で障害が発生した場合、処理装置の監視制御系で対応する。例えば、処理装置のメイン処理系においてデッドロックなどの障害が発生した場合、処理装置の監視制御系でこのデッドロックを検出して対処する。しかしながら、メイン処理系においてデッドロックが発生した場合、監視制御系はデッドロックが発生したことは認識できるが、メイン処理系の詳細な状態がわからずに、メイン処理系の障害の解析に支障をきたすという問題があった。 In a processing device such as a server, it is common to monitor the operation of a main processing system having a CPU (Central Processing Unit) by a monitoring control system having another CPU for the purpose of improving reliability and improving availability. is there. That is, when a failure occurs in the main processing system of the processing apparatus, the monitoring control system of the processing apparatus handles it. For example, when a failure such as a deadlock occurs in the main processing system of the processing apparatus, the deadlock is detected and dealt with by the monitoring control system of the processing apparatus. However, if a deadlock occurs in the main processing system, the supervisory control system can recognize that the deadlock has occurred, but it does not know the detailed state of the main processing system, and this may hinder the analysis of the main processing system failure. There was a problem of coming.

処理装置の障害検出に関する技術は、種々提案されている。その一つが特許文献１に開示されている。特許文献１には、内部にウォッチドッグタイマーを備えた情報処理装置において障害が発生した場合、このウォッチドッグタイマーを使用して障害が発生したことを認識することが開示されている。すなわち、情報処理装置において発生した障害を、内部のウォッチドッグタイマーのタイムアウト信号を使用して検出することが開示されている。しかしながら、特許文献１には、処理装置の内部のウォッチドッグタイマーに障害が発生した場合については開示されていない。 Various techniques relating to failure detection of processing devices have been proposed. One of them is disclosed in Patent Document 1. Patent Document 1 discloses that when a failure occurs in an information processing apparatus having a watchdog timer therein, this watchdog timer is used to recognize that a failure has occurred. That is, it is disclosed that a failure occurring in an information processing apparatus is detected using a timeout signal of an internal watchdog timer. However, Patent Document 1 does not disclose a case where a failure occurs in the watchdog timer inside the processing apparatus.

また、特許文献１には、情報処理装置において障害が発生した場合、情報処理装置が有するＢＭＣ(Baseboard Management Controller)ファームウェアが、ＢＩＯＳ(Basic Input Output System)のストールを検出し、ＳＭＩ(System Management Interrupt)を発生させてＢＩＯＳが採取するＣＰＵのログ情報を収集することが開示されている。しかしながら、特許文献１には、処理装置のＳＭＩに障害が発生した場合については開示されていない。 Further, in Patent Document 1, when a failure occurs in an information processing apparatus, the BMC (Baseboard Management Controller) firmware of the information processing apparatus detects a BIOS (Basic Input Output System) stall, and an SMI (System Management Interrupt) ) To collect CPU log information collected by the BIOS. However, Patent Document 1 does not disclose a case where a failure occurs in the SMI of the processing apparatus.

特許文献２には、中央処理装置と、中央処理装置の処理プログラムを格納するメモリーとを有する監視制御装置であって、プログラムにおいて予め定められたチェックポイントを処理した際に、中央処理装置の動作情報を外部処理装置へ出力するポートと、ポートへ出力された中央処理装置の動作情報を格納する記憶手段とを含むことを特徴とする監視制御装置が開示されている。しかしながら、特許文献２には、処理装置に障害が発生した場合、メイン処理系の詳細な状態を監視制御系に伝えることについては開示されていない。 Patent Document 2 discloses a monitoring and control device having a central processing unit and a memory for storing a processing program of the central processing unit, and the operation of the central processing unit when a predetermined checkpoint is processed in the program There is disclosed a monitoring control device including a port for outputting information to an external processing device, and storage means for storing operation information of the central processing unit outputted to the port. However, Patent Document 2 does not disclose that a detailed state of the main processing system is transmitted to the monitoring control system when a failure occurs in the processing apparatus.

特開２０１５−１３００２３号公報Japanese Patent Laid-Open No. 2015-130023 特開２０００−２９３４０７号公報JP 2000-293407 A

上述のように、処理装置のメイン処理系においてデッドロックが発生した場合、監視制御系はデッドロックが発生したことは認識できるが、メイン処理系の詳細な状態がわからずに、メイン処理系の障害の解析に支障をきたすという問題があった。 As described above, when a deadlock occurs in the main processing system of the processing device, the supervisory control system can recognize that the deadlock has occurred, but the detailed state of the main processing system is not known, and the main processing system There was a problem that the trouble analysis was hindered.

本発明は、このような問題点を解決するためになされたものであり、メイン処理系に障害が発生した場合、メイン処理系の障害の解析を容易に行うことが可能な処理装置、方法及びプログラムを提供することを目的とする。 The present invention has been made to solve such problems, and when a failure occurs in the main processing system, a processing apparatus, a method, and a method capable of easily analyzing the failure of the main processing system, and The purpose is to provide a program.

本発明に係る処理装置は、主要な処理を行うメイン処理部と、前記メイン処理部の外部に設けられ、前記メイン処理部の障害の発生を検出し、前記障害の発生に関係なく通知される前記メイン処理部の状態情報を取得する障害検出部と、前記障害検出部が取得した状態情報から、前記障害の発生時に対応する状態情報を選択し、外部からアクセス可能な状態で保存する監視制御部と、を備える。 A processing apparatus according to the present invention is provided outside a main processing unit that performs main processing and the main processing unit, detects the occurrence of a failure in the main processing unit, and is notified regardless of the occurrence of the failure. A failure detection unit that acquires state information of the main processing unit, and monitoring control that selects state information corresponding to the occurrence of the failure from the state information acquired by the failure detection unit, and saves the state information in an externally accessible state A section.

本発明に係る方法は、メイン処理部の障害を検出するステップと、前記障害の発生に関係なく通知される前記メイン処理部の状態情報を取得するステップと、前記取得した状態情報から、前記障害の発生時に対応する状態情報を選択するステップと、外部からアクセス可能な状態で保存するステップと、を備える。 The method according to the present invention includes a step of detecting a failure of a main processing unit, a step of acquiring state information of the main processing unit that is notified regardless of the occurrence of the failure, and the failure from the acquired state information. Selecting state information corresponding to the occurrence of the error and storing the state information in a state accessible from the outside.

本発明に係るプログラムは、メイン処理部の障害を検出するステップと、前記障害の発生に関係なく通知される前記メイン処理部の状態情報を取得するステップと、前記取得した状態情報から、前記障害の発生時に対応する状態情報を選択するステップと、外部からアクセス可能な状態で保存するステップと、をコンピュータに実現させる。 The program according to the present invention includes a step of detecting a failure of the main processing unit, a step of acquiring the state information of the main processing unit notified regardless of the occurrence of the failure, and the failure from the acquired state information The step of selecting the state information corresponding to the occurrence of the error and the step of storing the state information in a state accessible from the outside is realized by the computer.

本発明によれば、メイン処理系に障害が発生した場合、メイン処理系の障害の解析を容易に行うことが可能な処理装置、方法及びプログラムを提供することができる。 According to the present invention, it is possible to provide a processing apparatus, method, and program capable of easily analyzing a failure in the main processing system when a failure occurs in the main processing system.

実施の形態１に係る処理装置を例示するブロック図である。2 is a block diagram illustrating a processing apparatus according to Embodiment 1. FIG. 実施の形態１に係る処理装置を例示するブロック図である。2 is a block diagram illustrating a processing apparatus according to Embodiment 1. FIG. 実施の形態１に係る処理装置の動作を例示するシーケンス図である。FIG. 3 is a sequence diagram illustrating the operation of the processing apparatus according to the first embodiment. プラットフォームイベント生成部が生成するイベント情報を例示する図である。It is a figure which illustrates the event information which a platform event generation part generates. 実施の形態１の比較例１に係る処理装置を例示するブロック図である。3 is a block diagram illustrating a processing apparatus according to Comparative Example 1 of Embodiment 1. FIG. 実施の形態１の比較例１に係る処理装置の動作を例示するシーケンス図である。6 is a sequence diagram illustrating the operation of a processing apparatus according to Comparative Example 1 of Embodiment 1. FIG. 実施の形態２に係る処理装置を例示するブロック図である。FIG. 6 is a block diagram illustrating a processing apparatus according to a second embodiment. 実施の形態３に係る処理装置を例示するブロック図である。FIG. 10 is a block diagram illustrating a processing apparatus according to a third embodiment. 実施の形態３に係る処理装置の一部を例示するブロック図である。FIG. 10 is a block diagram illustrating a part of a processing apparatus according to a third embodiment.

［実施の形態１］
以下、図面を参照して本発明の実施の形態について説明する。 [Embodiment 1]
Embodiments of the present invention will be described below with reference to the drawings.

図１は、実施の形態１に係る処理装置を例示するブロック図である。 FIG. 1 is a block diagram illustrating a processing apparatus according to the first embodiment.

図１に示すように、実施の形態１に係る処理装置１０は、メイン処理部１１と監視制御部１２と障害検出部１３とを備える。メイン処理部１１は、外部に提供する情報の処理などの主要な処理を行う。障害検出部１３は、メイン処理部１１の外部に設けられ、メイン処理部１１の障害を検出する。また、障害検出部１３は、障害の発生に関係なくメイン処理部１１から通知されるメイン処理部１１の状態情報を取得する。監視制御部１２は、障害検出部１３が取得した状態情報から、障害の発生時に対応する状態情報を選択し、外部からアクセス可能な状態で保存する。 As illustrated in FIG. 1, the processing apparatus 10 according to the first embodiment includes a main processing unit 11, a monitoring control unit 12, and a failure detection unit 13. The main processing unit 11 performs main processing such as processing of information provided to the outside. The failure detection unit 13 is provided outside the main processing unit 11 and detects a failure of the main processing unit 11. Further, the failure detection unit 13 acquires the state information of the main processing unit 11 notified from the main processing unit 11 regardless of the occurrence of the failure. The monitoring control unit 12 selects state information corresponding to the occurrence of a failure from the state information acquired by the failure detection unit 13, and stores it in a state accessible from the outside.

実施の形態１に係る処理装置１０について詳細に説明する。
図２は、実施の形態１に係る処理装置を例示するブロック図である。 The processing apparatus 10 according to the first embodiment will be described in detail.
FIG. 2 is a block diagram illustrating the processing apparatus according to the first embodiment.

処理装置１０は、表示部１１５をさらに備える。また、メイン処理部１１は、ＣＰＵ１１１と記憶部１１２〜１１４とを有する。 The processing device 10 further includes a display unit 115. The main processing unit 11 includes a CPU 111 and storage units 112 to 114.

処理装置１０は、例えば、ウェブサーバやファイルサーバなどの装置である。ＣＰＵ１１１は、ＣＰＵ(Central Processing Unit)、すなわち中央演算部を有し、処理装置１０の主たる機能を実現し、外部に対して出力し提供する情報を処理する。記憶部１１２は、例えば、ＲＯＭ(Read-Only Memory)であり、ＢＩＯＳ(Basic Input Output System)などのブートローダが格納される。記憶部１１３は、例えば、ＲＡＭ(Randum Access Memory)であり、処理装置１０の主記憶部である。記憶部１１４は、ＯＳ(Operating System)やファイルシステム・アプリケーションが格納される。 The processing device 10 is a device such as a web server or a file server, for example. The CPU 111 has a central processing unit (CPU), that is, a central processing unit, realizes the main functions of the processing apparatus 10, and processes information to be output to the outside and provided. The storage unit 112 is, for example, a ROM (Read-Only Memory), and stores a boot loader such as a BIOS (Basic Input Output System). The storage unit 113 is, for example, a RAM (Randum Access Memory) and is a main storage unit of the processing device 10. The storage unit 114 stores an OS (Operating System) and a file system application.

なお、ＢＩＯＳとは、ファームウェアの一つで、コンピュータ等の処理装置に搭載されたプログラムのうち、ハードウェアとの間で、最も低レベルな入出力を行うためのプログラムである。ＢＩＯＳは、処理装置の電源投入時に実行される。ＢＩＯＳの機能には、処理装置のハードウェアの初期化や、記憶部からのブートローダの呼び出しがある。 The BIOS is one of firmware, and is a program for performing the lowest level input / output with hardware among programs installed in a processing device such as a computer. The BIOS is executed when the processing apparatus is powered on. The BIOS function includes initialization of hardware of the processing device and call of a boot loader from the storage unit.

また、ブートとは、処理装置の起動を意味し、電源を投入時のＯＳ（Operating System）等、処理装置の動作環境を立ち上げるまでの処理がこれに該当する。また、ブートローダとは、ブート時に、処理装置の動作環境の立ち上げに必要なプログラムの読み込みを行うプログラムのことである。 The boot means starting of the processing device, and corresponds to processing up to starting up the operating environment of the processing device such as an OS (Operating System) when the power is turned on. The boot loader is a program that reads a program necessary for starting up the operating environment of the processing device at the time of booting.

障害検出部１３は、ウォッチドッグタイマ（ＷＤＴ）制御部１３１とＣＰＵ状態保持部１３４とアラーム生成部１３２とアラーム付加情報生成部１３３とを有する。 The failure detection unit 13 includes a watchdog timer (WDT) control unit 131, a CPU state holding unit 134, an alarm generation unit 132, and an alarm additional information generation unit 133.

障害検出部１３は、処理装置１０の内部であってメイン処理部１１の外部に設けられる。障害検出部１３のＣＰＵ状態保持部１３４は、メイン処理部１１から、例えば、定期的に送信されるメイン処理部１１の状態情報を取得する。 The failure detection unit 13 is provided inside the processing apparatus 10 and outside the main processing unit 11. The CPU status holding unit 134 of the failure detection unit 13 acquires, for example, status information of the main processing unit 11 that is periodically transmitted from the main processing unit 11.

ここで、メイン処理部１１で発生する障害について説明する。メイン処理部１１で発生する障害の例としては、デッドロックが挙げられる。このデッドロックを検出する方法としては、ウォッチドッグタイマを使用した方法が挙げられる。ウォッチドッグタイマは、ＳｏＣ(System-on-a-chip)に含まれるものを使用する場合もあるし、外付けのウォッチドッグタイマを使用する場合もある。ＳｏＣが所有する資源や将来的なメンテナンスのし易さから、外付けのウォッチドッグタイマを使用してもよい。 Here, a failure that occurs in the main processing unit 11 will be described. An example of a failure that occurs in the main processing unit 11 is deadlock. As a method for detecting this deadlock, a method using a watchdog timer can be mentioned. As the watchdog timer, one included in SoC (System-on-a-chip) may be used, or an external watchdog timer may be used. An external watchdog timer may be used because of resources owned by the SoC and ease of future maintenance.

この例では、外付けのウォッチドッグタイマを使用する。処理装置１０は、障害検出部１３のウォッチドッグタイマ制御部１３１内にウォッチドッグタイマ１３１ａを有し、これを外付けのウォッチドッグタイマとして使用する。ウォッチドッグタイマ制御部１３１は、ウォッチドッグタイマ１３１ａを使用してメイン処理部１１のＣＰＵ１１１のデッドロックを検出する。すなわち、メイン処理部１１は、障害検出部１３に対してメイン処理部１１の障害の発生を検出するための検出用信号（リロード要求）を送信する。障害検出部１３のウォッチドッグタイマ制御部１３１は、例えば、所定期間αの間、ＣＰＵ１１１がウォッチドッグタイマ１３１ａに対して行う検出用信号が無い場合、ＣＰＵ１１１においてデッドロックなどの障害の発生を検出したと判断する。このようにして障害検出部１３は、タイムアウト情報などの障害情報を取得する。なお、検出用信号をリロード要求と呼ぶ。また、検出用信号は所定期間α内に送信される。また、検出用信号は、定期的に行われてもよい。 In this example, an external watchdog timer is used. The processing device 10 includes a watchdog timer 131a in the watchdog timer control unit 131 of the failure detection unit 13, and uses this as an external watchdog timer. The watchdog timer control unit 131 detects a deadlock of the CPU 111 of the main processing unit 11 using the watchdog timer 131a. That is, the main processing unit 11 transmits a detection signal (reload request) for detecting the occurrence of a failure in the main processing unit 11 to the failure detection unit 13. For example, the watchdog timer control unit 131 of the failure detection unit 13 detects the occurrence of a failure such as a deadlock in the CPU 111 when there is no detection signal that the CPU 111 performs for the watchdog timer 131a for a predetermined period α. Judge. In this way, the failure detection unit 13 acquires failure information such as timeout information. The detection signal is called a reload request. The detection signal is transmitted within a predetermined period α. Further, the detection signal may be periodically performed.

ウォッチドッグタイマ制御部１３１は、障害が発生したと判断した場合、タイムアウト情報などの障害情報をアラーム付加情報生成部１３３とアラーム生成部１３２とに通知する。 When the watchdog timer control unit 131 determines that a failure has occurred, the watchdog timer control unit 131 notifies the alarm additional information generation unit 133 and the alarm generation unit 132 of failure information such as timeout information.

なお、デッドロックとは、複数のプロセスが互いに相手の占有している資源の解放を待ち、処理が停止してしまう障害のことである。また、ウォッチドッグタイマは、ウォッチドッグタイマ機能を意味する場合もある。 Note that deadlock is a failure in which a plurality of processes waits for the release of resources occupied by each other and stop processing. The watchdog timer may mean a watchdog timer function.

ＣＰＵ状態保持部１３４は、処理装置１０の起動時のＢＩＯＳのステータス（ＰＯＳＴ（ＰｏｗｅｒＯｎＳｅｌｆＴｅｓｔ）ステータス）などのＣＰＵ１１１から通知されるＣＰＵ１１１の状態情報を保持する。また、ＣＰＵ状態保持部１３４は、ＣＰＵ１１１の状態情報を表示部１１５に表示する。 The CPU state holding unit 134 holds state information of the CPU 111 notified from the CPU 111 such as a BIOS status (POST (Power On Self Test) status) when the processing apparatus 10 is activated. Further, the CPU state holding unit 134 displays the state information of the CPU 111 on the display unit 115.

アラーム付加情報生成部１３３は、ウォッチドッグタイマ制御部１３１から通知されたタイムアウト情報などの障害情報と、ＣＰＵ状態保持部１３４に保持されているＣＰＵ１１１の状態情報とを、ＢＭＣ１２２に通知するための形式に加工する。なお、加工された障害情報と状態情報とをアラーム付加情報と呼ぶ。アラーム付加情報生成部１３３は、アラーム付加情報を、ＢＭＣ１２２のアラーム付加情報取得部１２２ｂに出力する。 The alarm additional information generation unit 133 is a format for notifying the BMC 122 of failure information such as timeout information notified from the watchdog timer control unit 131 and the state information of the CPU 111 held in the CPU state holding unit 134. To process. The processed fault information and state information are referred to as alarm additional information. The alarm additional information generation unit 133 outputs the alarm additional information to the alarm additional information acquisition unit 122b of the BMC 122.

アラーム生成部１３２は、ウォッチドッグタイマ制御部１３１から通知されたタイムアウト情報などの障害情報をＢＭＣ１２２のアラーム取得部１２２ｃに出力する。なお、アラーム生成部１３２から出力される情報をアラーム情報と呼ぶ。 The alarm generation unit 132 outputs failure information such as timeout information notified from the watchdog timer control unit 131 to the alarm acquisition unit 122c of the BMC 122. Information output from the alarm generation unit 132 is referred to as alarm information.

監視制御部１２は、ＢＭＣ１２２と記憶部１２１とを有する。ＢＭＣ１２２は、ＣＰＵ１１１が有するＣＰＵとは別のＣＰＵを有する。記憶部１２１は、例えば、ＥＥＰＲＯＭ(Electrically Erasable Programmable Read-Only Memory)である。 The monitoring control unit 12 includes a BMC 122 and a storage unit 121. The BMC 122 includes a CPU different from the CPU included in the CPU 111. The storage unit 121 is, for example, an EEPROM (Electrically Erasable Programmable Read-Only Memory).

ＢＭＣ１２２は、プラットフォームイベント（ＰｌａｔｆｏｒｍＥｖｅｎｔ）生成部１２２ａとアラーム付加情報取得部１２２ｂとアラーム取得部１２２ｃとを有する。プラットフォームイベント（ＰｌａｔｆｏｒｍＥｖｅｎｔ）生成部１２２ａは、アラーム情報とアラーム付加情報とを、後述するイベント情報に加工する。アラーム付加情報取得部１２２ｂは、アラーム付加情報生成部１３３から出力されたアラーム付加情報を取得し、取得した状態情報から、障害の発生時に対応する状態情報を選択して保存する。アラーム取得部１２２ｃは、アラーム生成部１３２から出力された障害情報を取得する。図２に示す経路Ｌ１１〜Ｌ１３を経由して状態情報（ステータス情報）がアラーム付加情報取得部１２２ｂに通知され、経路Ｌ２１〜経路Ｌ２３を経由してアラーム情報がアラーム取得部１２２ｃに通知される。 The BMC 122 includes a platform event generation unit 122a, an alarm additional information acquisition unit 122b, and an alarm acquisition unit 122c. The platform event (PlatformEvent) generation unit 122a processes the alarm information and the alarm additional information into event information described later. The alarm additional information acquisition unit 122b acquires the alarm additional information output from the alarm additional information generation unit 133, and selects and stores state information corresponding to the occurrence of a failure from the acquired state information. The alarm acquisition unit 122c acquires the failure information output from the alarm generation unit 132. Status information (status information) is notified to the alarm additional information acquisition unit 122b via the routes L11 to L13 shown in FIG. 2, and alarm information is notified to the alarm acquisition unit 122c via the routes L21 to L23.

プラットフォームイベント生成部１２２ａが加工したイベント情報は、外部からアクセス可能な状態で記憶部１２１に記憶される、又は、外部装置２０にイベント情報として出力される。なお、監視制御部１２から外部装置２０へのイベント情報の出力は、例えば、ＩＰＭＢ(Intelligent Platform Management Bus)３を介して出力される。 The event information processed by the platform event generation unit 122a is stored in the storage unit 121 in a state accessible from the outside, or is output to the external device 20 as event information. Note that the output of event information from the monitoring control unit 12 to the external device 20 is output via, for example, an IPMB (Intelligent Platform Management Bus) 3.

表示部１１５は、例えば、７ｓｅｇＬＥＤ(７ segment Lazer Emitting Diode)などで構成され、処理装置１０の起動時のＢＩＯＳのステータスなどを表示する。また、表示部１１５は、ＣＰＵ１１１の状態情報とＣＰＵ１１１の障害情報とを表示する。 The display unit 115 includes, for example, a 7 segment LED (7 segment Lazer Emitting Diode), and displays the status of the BIOS when the processing apparatus 10 is activated. Further, the display unit 115 displays the state information of the CPU 111 and the failure information of the CPU 111.

実施の形態１においては、ＣＰＵ１１１の障害の発生を検出するためのウォッチドッグタイマ１３１ａを、ＣＰＵ１１１から独立させた障害検出部１３内に設けている。ウォッチドッグタイマによる障害の検出を障害検出部１３が自発的に行っている。これにより、ＣＰＵ１１１に障害が発生し異常動作している場合においても、ＣＰＵ１１１の障害を検出し、その障害情報を障害検出部１３経由でＢＭＣ１２２に通知することができる。 In the first embodiment, a watchdog timer 131 a for detecting the occurrence of a failure in the CPU 111 is provided in the failure detection unit 13 that is independent from the CPU 111. The failure detection unit 13 voluntarily detects a failure by the watchdog timer. Thereby, even when a failure occurs in the CPU 111 and the CPU 111 is operating abnormally, it is possible to detect the failure of the CPU 111 and notify the failure information to the BMC 122 via the failure detection unit 13.

また、ＣＰＵ１１１の状態情報を、ＣＰＵ１１１から独立させ障害検出部１３のＣＰＵ状態保持部１３４にステータス通知して保持している。そして、ＣＰＵ１１１の状態情報をＣＰＵ状態保持部１３４経由でＢＭＣ１２２に通知している。すなわち、ＣＰＵ１１１から障害検出部１３に対してステータスを通知し、状態情報（ステータス）をＢＭＣ１２２に通知している。これらの動作を、障害検出部１３とＢＭＣ１２２とが自発的に行っている。これにより、ＣＰＵ１１１に障害が発生し以上動作している場合においても、ＣＰＵ１１１の状態情報を障害検出部１３経由でＢＭＣ１２２に出力することができる。 Further, status information of the CPU 111 is made independent from the CPU 111 and notified to the CPU status holding unit 134 of the failure detection unit 13 for status notification. Then, the state information of the CPU 111 is notified to the BMC 122 via the CPU state holding unit 134. That is, the CPU 111 notifies the failure detection unit 13 of the status and notifies the BMC 122 of the status information (status). These operations are performed spontaneously by the failure detection unit 13 and the BMC 122. As a result, even when a failure occurs in the CPU 111 and the CPU 111 is operating as described above, the state information of the CPU 111 can be output to the BMC 122 via the failure detection unit 13.

次に、実施の形態１に係る処理装置の動作について説明する。
図３は、実施の形態１に係る処理装置の動作を例示するシーケンス図である Next, the operation of the processing apparatus according to the first embodiment will be described.
FIG. 3 is a sequence diagram illustrating the operation of the processing apparatus according to the first embodiment.

図３に示すように、メイン処理部１１のＣＰＵ１１１は、処理装置１０の起動時などにＢＩＯＳのＰＯＳＴステータスの進行具合に応じたＰＯＳＴステータスコードであるステータス情報（ＣＰＵ１１１の状態情報）を障害検出部１３のＣＰＵ状態保持部１３４に通知する（ステップＳ１０１）。 As shown in FIG. 3, the CPU 111 of the main processing unit 11 receives status information (status information of the CPU 111) that is a POST status code corresponding to the progress status of the POST status of the BIOS when the processing device 10 is started up, etc. 13 CPU status holding units 134 (step S101).

ＣＰＵ１１１は、ステップＳ１０１と共にウォッチドッグタイマ制御部１３１に対してウォッチドッグタイマのリロード要求（検出用信号の送信）を行う（ステップＳ１０２）。 The CPU 111 requests the watchdog timer controller 131 to reload the watchdog timer (send detection signal) together with step S101 (step S102).

ステップＳ１０１により、障害検出部１３は、メイン処理部１１から障害の発生に関係なく通知されるメイン処理部１１の状態情報を取得する。ステップＳ１０１とステップＳ１０２とは定期的に行われてもよい。ステップＳ１０１の通知を略してステータス通知と呼ぶ。 In step S <b> 101, the failure detection unit 13 acquires state information of the main processing unit 11 notified from the main processing unit 11 regardless of the occurrence of the failure. Step S101 and step S102 may be performed periodically. The notification in step S101 is abbreviated as status notification.

ＣＰＵ状態保持部１３４は、ステップＳ１０１のステータス通知の内容を表示部１１５に表示させる。なお、ウォッチドッグタイマ制御部１３１は、ステップＳ１０１のステータス通知の前に、ＣＰＵ１１１用のウォッチドッグタイマを予め開始しておく（ステップＳ１０３）。 The CPU state holding unit 134 causes the display unit 115 to display the contents of the status notification in step S101. Note that the watchdog timer control unit 131 starts a watchdog timer for the CPU 111 in advance before the status notification in step S101 (step S103).

ステップＳ１０２において、定期的にウォッチドッグタイマのリロード要求が行われていれば、メイン処理部１１は正常に動作しているとする。 In step S102, it is assumed that the main processing unit 11 is operating normally if a watchdog timer reload request is made periodically.

メイン処理部１１においてデッドロックが発生した場合（ステップＳ１０５）、ウォッチドッグタイマのリロード要求は滞る。リロード要求が滞り、所定期間αの間、リロード要求が無い場合、障害検出部１３のウォッチドッグタイマ制御部１３１は、ＣＰＵ１１１に障害が発生したと判断し、ウォッチドッグタイマをタイムアウトする（ステップＳ１０６）。ステップＳ１０６により、障害検出部１３は、メイン処理部１１の障害を検出し、このタイムアウトなどの障害情報を取得する。そして、ウォッチドッグタイマ制御部１３１は、このタイムアウトに関するタイムアウト情報をアラーム付加情報生成部１３３とアラーム生成部１３２とに通知する（ステップＳ１０７ａ、ステップＳ１０７ｃ）。タイムアウト情報の通知を略してタイムアウト通知と呼ぶ。 When a deadlock occurs in the main processing unit 11 (step S105), the watch dog timer reload request is delayed. If the reload request is delayed and there is no reload request for a predetermined period α, the watchdog timer control unit 131 of the failure detection unit 13 determines that a failure has occurred in the CPU 111 and times out the watchdog timer (step S106). . In step S106, the failure detection unit 13 detects a failure in the main processing unit 11, and acquires failure information such as a timeout. Then, the watchdog timer control unit 131 notifies the alarm additional information generation unit 133 and the alarm generation unit 132 of the timeout information regarding the timeout (step S107a, step S107c). The notification of timeout information is abbreviated and called timeout notification.

ステップＳ１０７ａのタイムアウト通知を受信したアラーム付加情報生成部１３３は、ＣＰＵ状態保持部１３４に、ＣＰＵ１１１の現在のステータス情報（ＣＰＵ１１１の状態情報）の問い合わせを行い、ＣＰＵ状態保持部１３４から現在のステータス情報を取込む（ステップＳ１０７ｂ）。また、アラーム付加情報生成部１３３は、ステータス情報とタイムアウト情報とをＢＭＣ１２２に通知する形式に加工してアラーム付加情報を生成する（ステップＳ１０８ａ）。 The alarm additional information generating unit 133 that has received the time-out notification in step S107a inquires of the CPU status holding unit 134 about the current status information of the CPU 111 (status information of the CPU 111), and the current status information from the CPU status holding unit 134. (Step S107b). Further, the alarm additional information generation unit 133 processes the status information and the timeout information into a format for notifying the BMC 122, and generates alarm additional information (step S108a).

タイムアウト情報を受信したアラーム生成部１３２は、ウォッチドッグタイマアラームが発生した旨のウォッチドッグタイマアラームフラグを起立する（ステップＳ１０８ｂ）。ウォッチドッグタイマアラームフラグを起立するとは、例えば、ウォッチドッグタイマのアラームビットを設け、アラームビットをオンにすることである。アラームビットのオン又はオフの情報をアラーム情報と呼ぶ。アラーム生成部１３２は、メイン処理部１１のＣＰＵ１１１のタイムアウト情報（障害情報）に基づいてアラーム情報を生成する。 Receiving the timeout information, the alarm generation unit 132 raises a watchdog timer alarm flag indicating that a watchdog timer alarm has occurred (step S108b). Raising the watchdog timer alarm flag means, for example, providing an alarm bit for the watchdog timer and turning on the alarm bit. Information on whether the alarm bit is on or off is called alarm information. The alarm generation unit 132 generates alarm information based on timeout information (failure information) of the CPU 111 of the main processing unit 11.

監視制御部１２のＢＭＣ１２２は、ウォッチドッグタイマアラームフラグを監視する（ステップＳ１０４ａ）と共にアラーム付加情報が生成されているか否かを監視する（ステップＳ１０４ｂ）。ステップＳ１０４ａにおいて、ＢＭＣ１２２は、アラーム生成部１３２に対して、例えば、ポーリングを行うことによりウォッチドッグタイマアラームフラグが起立しているか否かを監視する。また、ステップＳ１０４ｂにおいて、ＢＭＣ１２２は、アラーム付加情報生成部１３３に対して、ポーリングを行うことによりアラーム付加情報が生成されているか否かを監視する。すなわち、監視制御部１２は、障害検出部１３のアラーム付加情報生成部１３３に対して障害確認要求を行い、ＣＰＵ１１１の状態情報をアラーム付加情報生成部１３３から取得する。また、監視制御部１２は、障害検出部１３のアラーム生成部１３２に対して障害確認要求を行い、障害情報をアラーム生成部１３２から取得する。ステップＳ１０４ａとステップＳ１０４ｂとにおける障害確認要求は、定期的に行われる。 The BMC 122 of the monitoring controller 12 monitors the watchdog timer alarm flag (step S104a) and monitors whether alarm additional information is generated (step S104b). In step S104a, the BMC 122 monitors whether or not the watchdog timer alarm flag is raised by, for example, polling the alarm generation unit 132. In step S104b, the BMC 122 monitors whether the alarm additional information is generated by polling the alarm additional information generation unit 133. That is, the monitoring control unit 12 issues a failure confirmation request to the alarm additional information generation unit 133 of the failure detection unit 13 and acquires the state information of the CPU 111 from the alarm additional information generation unit 133. In addition, the monitoring control unit 12 issues a failure confirmation request to the alarm generation unit 132 of the failure detection unit 13 and acquires failure information from the alarm generation unit 132. The failure confirmation request in step S104a and step S104b is periodically performed.

ウォッチドッグタイマアラームフラグが起立し、ウォッチドッグタイマアラームが発生していることを認識した場合（ステップＳ１０９ａ）、ＢＭＣ１２２は、プラットフォームイベント生成部１２２ａにて情報を整形し、記憶部１２１などにウォッチドッグタイマアラームの情報（ＳＥＬ（ＳｙｓｔｅｍＥｖｅｎｔＬｏｇ））を記憶し登録する（ステップＳ１１０ａ）。 When the watchdog timer alarm flag is raised and it is recognized that the watchdog timer alarm is generated (step S109a), the BMC 122 shapes the information in the platform event generation unit 122a and stores the watchdog in the storage unit 121 or the like. The timer alarm information (SEL (System Event Log)) is stored and registered (step S110a).

また、ウォッチドッグタイマアラームが発生していることを認識した場合（ステップＳ１０９ａ）、ＢＭＣ１２２は、外部装置２０に対してＩＰＭＢ３を介してイベント通知を行う（ステップＳ１１１ａ）。 If it is recognized that a watchdog timer alarm has occurred (step S109a), the BMC 122 notifies the external device 20 of an event via IPMB3 (step S111a).

アラーム付加情報が生成されていることを認識した場合（ステップＳ１０９ｂ）、ＢＭＣ１２２は、プラットフォームイベント生成部１２２ａにて情報を整形し、記憶部１２１などにアラーム付加情報（ＳＥＬ（ＳｙｓｔｅｍＥｖｅｎｔＬｏｇ））を記憶し登録する（ステップＳ１１０ｂ）。ステップＳ１１０ｂにおいて、ＢＭＣ１２２は、アラーム付加情報生成部１３３から取得したＣＰＵ１１１の状態情報から、障害の発生時に対応する状態情報を選択し、これを外部からアクセス可能な状態で記憶部１２１に保存する。 When recognizing that alarm additional information has been generated (step S109b), the BMC 122 formats the information in the platform event generation unit 122a, and stores alarm additional information (SEL (System Event Log)) in the storage unit 121 or the like. Store and register (step S110b). In step S110b, the BMC 122 selects state information corresponding to the occurrence of a failure from the state information of the CPU 111 acquired from the alarm additional information generation unit 133, and stores this in the storage unit 121 in a state where it can be accessed from the outside.

また、アラーム付加情報が生成されていることを認識した場合（ステップＳ１０９ｂ）、ＢＭＣ１２２は、外部装置２０に対してＩＰＭＢ３を介してイベント通知を行う（ステップＳ１１１ｂ）。すなわち、ＢＭＣ１２２は、状態情報と障害情報とを関連付けて外部に出力する。 When it is recognized that alarm additional information has been generated (step S109b), the BMC 122 notifies the external device 20 of an event via IPMB3 (step S111b). That is, the BMC 122 associates the state information and the failure information and outputs them to the outside.

次に、プラットフォームイベント生成部１２２ａが生成するイベント情報について説明する。
図４は、プラットフォームイベント生成部が生成するイベント情報を例示する図である。
図４は、障害検出部１３のアラーム付加情報生成部１３３と、ＢＭＣ１２２のアラーム付加情報取得部１２２ｂと、の間でやり取りする情報を示す。 Next, event information generated by the platform event generation unit 122a will be described.
FIG. 4 is a diagram illustrating event information generated by the platform event generation unit.
FIG. 4 shows information exchanged between the alarm additional information generation unit 133 of the failure detection unit 13 and the alarm additional information acquisition unit 122b of the BMC 122.

図４に示す情報Ｄ４０１（アラーム付加情報）は、アラーム付加情報生成部１３３とアラーム付加情報取得部１２２ｂとがやり取りする情報である。情報Ｄ４０１は、アラーム付加情報生成部１３３が生成し、アラーム付加情報取得部１２２ｂに対して出力するアラーム付加情報である。情報Ｄ４０１は、ステータス発行元とステータスコードとステータスコード（拡張）とを含む。ステータス発行元とステータスコード（拡張）は将来の拡張用である。ステータスコードは、メイン処理部１１におけるデッドロック発生時（障害発生時）のＣＰＵ１１１のステータス情報（状態情報）を示す。 Information D401 (alarm additional information) illustrated in FIG. 4 is information exchanged between the alarm additional information generation unit 133 and the alarm additional information acquisition unit 122b. Information D401 is alarm additional information generated by the alarm additional information generation unit 133 and output to the alarm additional information acquisition unit 122b. The information D401 includes a status issuer, a status code, and a status code (extended). The status issuer and status code (extended) are for future expansion. The status code indicates status information (state information) of the CPU 111 when a deadlock occurs (when a failure occurs) in the main processing unit 11.

図４に示す情報Ｄ４０２は、ＢＭＣ１２２のプラットフォームイベント（ＰｌａｔｆｏｒｍＥｖｅｎｔ）生成部１２２ａが生成するイベント情報である。プラットフォームイベント生成部１２２ａは、アラーム付加情報取得部１２２ｂがアラーム付加情報生成部１３３から取得した情報Ｄ４０１（アラーム付加情報）を、一般的にＩＰＭＩ(Intelligent Platform Management Interface)で定義されているＰｌａｔｆｏｒｍＥｖｅｎｔの形式に加工する。プラットフォームイベント生成部１２２ａが生成したイベント情報は、記憶部１２１や外部装置２０に格納される。 Information D402 illustrated in FIG. 4 is event information generated by the platform event (PlatformEvent) generation unit 122a of the BMC 122. The platform event generation unit 122a uses the PlatformEvent format generally defined by IPMI (Intelligent Platform Management Interface) as information D401 (alarm additional information) acquired by the alarm additional information acquisition unit 122b from the alarm additional information generation unit 133. To process. The event information generated by the platform event generation unit 122a is stored in the storage unit 121 or the external device 20.

このようにして、メイン処理部１１のＣＰＵ１１１のデッドロック（障害情報）だけでなく、メイン処理部１１のステータス情報（状態情報）も、イベント情報として外部装置２０などに伝えることができる。 In this way, not only deadlock (failure information) of the CPU 111 of the main processing unit 11 but also status information (state information) of the main processing unit 11 can be transmitted to the external device 20 or the like as event information.

実施の形態１に係る処理装置の特徴は、ＣＰＵ１１１の異常を外付けのウォッチドッグタイマ１３１ａにより検出し、例えば、ウォッチドッグタイマアラームなどの障害発生時のＣＰＵ１１１の状態情報をイベント情報情報として外部装置２０などに通知する点である。 A feature of the processing apparatus according to the first embodiment is that an abnormality of the CPU 111 is detected by an external watchdog timer 131a, and for example, the state information of the CPU 111 at the time of occurrence of a failure such as a watchdog timer alarm is used as event information information. 20 or the like.

実施の形態１の効果について説明する。
実施の形態１においては、メイン処理部１１のＣＰＵ１１１から障害検出部１３に対して随時ステータス情報を通知し、ウォッチドックタイマアラームの検出を自発的に行い、ＢＭＣ１２２への報告を自発的に行っている。そして、ＣＰＵ１１１が有するウォッチドックタイマを使用せずに、外部のウォッチドックタイマを使用してＣＰＵ１１１の障害を検出している。これにより、ＣＰＵ１１１に障害が発生している場合でも、ＣＰＵ１１１の障害を検出してＣＰＵ１１１の障害状況をＢＭＣ１２２に通知することができる。 The effect of the first embodiment will be described.
In the first embodiment, the CPU 111 of the main processing unit 11 notifies the failure detection unit 13 of status information as needed, detects the watchdog timer alarm spontaneously, and reports to the BMC 122 voluntarily. Yes. Then, the CPU 111 detects the failure of the CPU 111 using an external watchdog timer without using the watchdog timer of the CPU 111. As a result, even when a failure occurs in the CPU 111, the failure of the CPU 111 can be detected and the failure status of the CPU 111 can be notified to the BMC 122.

また、実施の形態１においては、ＣＰＵ１１１が有するＳＭＩ(System Management Interrupt)機能を使用せずに、ＣＰＵ１１１の外部に存在する障害検出部１３から、ＣＰＵ１１１のステータス情報とウォッチドックタイマアラーム情報とを監視制御部１２に通知している。これにより、ＣＰＵ１１１に障害が発生している場合でも、ＣＰＵ１１１のステータスをＢＭＣ１２２に通知することができる。 In the first embodiment, the status information of the CPU 111 and the watchdog timer alarm information are monitored from the failure detection unit 13 existing outside the CPU 111 without using the SMI (System Management Interrupt) function of the CPU 111. This is notified to the control unit 12. Thereby, even when a failure occurs in the CPU 111, the status of the CPU 111 can be notified to the BMC 122.

また、実施の形態１に係る処理装置１０は、メイン処理部１１がデッドロックした際に、その原因解析に必要となるＣＰＵ１１１のステータス情報とアラーム情報とを外部装置２０などに伝えることができる。これにより、処理装置１０の障害の解析を容易に行うことができる。その結果、メイン処理部に障害が発生した場合、メイン処理部の障害の解析を容易に行うことが可能な処理装置、方法及びプログラムを提供することができる。 Further, when the main processing unit 11 deadlocks, the processing device 10 according to the first embodiment can transmit the status information and alarm information of the CPU 111 necessary for the cause analysis to the external device 20 or the like. Thereby, the analysis of the failure of the processing apparatus 10 can be easily performed. As a result, when a failure occurs in the main processing unit, it is possible to provide a processing device, method, and program capable of easily analyzing the failure of the main processing unit.

なお、この実施例においては、ＢＩＯＳのステータスコードを例にして説明したが、ＯＳ(Operating System)やアプリケーションのプロセス番号に当てはめて応用してもよい。 In this embodiment, the BIOS status code has been described as an example. However, the present invention may be applied to an OS (Operating System) or an application process number.

また、この実施例においては、外付けのウォッチドッグタイマを例にして説明したが、この外付けのウォッチドッグタイマを、監視制御のＳｏＣ(System-on-a-Chip)に含めて１チップの構成で実現してもよい。 In this embodiment, the external watchdog timer has been described as an example. However, this external watchdog timer is included in the SoC (System-on-a-Chip) for monitoring and control. You may implement | achieve with a structure.

［実施の形態１の比較例１］
図５は、実施の形態１の比較例１に係る処理装置を例示するブロック図である。 [Comparative Example 1 of Embodiment 1]
FIG. 5 is a block diagram illustrating a processing apparatus according to comparative example 1 of the first embodiment.

図５に示すように、比較例１に係る処理装置１０ａは、メイン処理部１１ａと監視制御部１２ａと障害検出部１３ａとを有する。 As shown in FIG. 5, the processing apparatus 10a according to the comparative example 1 includes a main processing unit 11a, a monitoring control unit 12a, and a failure detection unit 13a.

メイン処理部１１ａは、例えば、７セグメントＬＥＤ(Lazer Emitting Diode)などの簡素な表示部１１５を有し、処理装置１０ａの起動時のＢＩＯＳステータス（ＰＯＳＴ（ＰｏｗｅｒＯｎＳｅｌｆＴｅｓｔ）ステータス）を表示部１１５に表示させる。 The main processing unit 11a includes a simple display unit 115 such as a 7-segment LED (Lazer Emitting Diode), for example, and displays a BIOS status (POST (Power On Self Test) status) when the processing apparatus 10a is activated. To display.

監視制御部１２ａは、実施の形態１の監視制御部１２と比べてアラーム付加情報取得部１２２ｂが設けられていない。 Compared with the monitoring control unit 12 of the first embodiment, the monitoring control unit 12a is not provided with the alarm additional information acquisition unit 122b.

障害検出部１３ａは、実施の形態１の障害検出部１３と比べてＣＰＵ状態保持部１３４とアラーム付加情報生成部１３３が設けられていない。 The failure detection unit 13a is not provided with the CPU state holding unit 134 and the alarm additional information generation unit 133 as compared with the failure detection unit 13 of the first embodiment.

図６は、実施の形態１の比較例１に係る処理装置の動作を例示するシーケンス図である。 FIG. 6 is a sequence diagram illustrating the operation of the processing apparatus according to the first comparative example of the first embodiment.

図６に示すように、メイン処理部１１ａのＣＰＵ１１１は、処理装置１０ａの起動時などにＢＩＯＳのＰＯＳＴステータスの進行具合に応じたＰＯＳＴステータスコードを表示部１１５に伝える（ステップＳ１０１）と共にウォッチドッグタイマ制御部１３１に対して、定期的にウォッチドッグタイマのリロード要求を行う（ステップＳ１０２）。 As shown in FIG. 6, the CPU 111 of the main processing unit 11a transmits a POST status code corresponding to the progress status of the POST status of the BIOS to the display unit 115 when the processing apparatus 10a is started up (step S101) and a watchdog timer A watchdog timer reload request is periodically sent to the control unit 131 (step S102).

メイン処理部１１ａにデッドロックが発生し（ステップＳ１０５）、ウォッチドッグタイマのリロード指示が滞ると、ウォッチドッグタイマ制御部１３１はタイムアウトし（ステップＳ１０６）、その旨をアラーム生成部１３２に伝える（ステップＳ１０７）。 When a deadlock occurs in the main processing unit 11a (step S105) and the watchdog timer reload instruction is delayed, the watchdog timer control unit 131 times out (step S106) and notifies the alarm generation unit 132 of this (step S106). S107).

アラーム生成部１３２は、ウォッチドッグタイマアラームが発生した旨のフラグを起立する（ステップＳ１０８）。 The alarm generation unit 132 raises a flag indicating that a watchdog timer alarm has occurred (step S108).

一方、監視制御部１２ａのＢＭＣ１２２は、ウォッチドッグタイマアラームフラグを定期的に監視し（ステップＳ１０４）、アラームを認識すると（ステップＳ１０９）、記憶部１２１などにアラーム情報（ＳＥＬ（ＳｙｓｔｅｍＥｖｅｎｔＬｏｇ））を記憶し登録し（ステップＳ１１０）、外部装置２０にＩＰＭＢ３を通してイベント通知を行う（ステップＳ１１１）。 On the other hand, the BMC 122 of the monitoring control unit 12a periodically monitors the watchdog timer alarm flag (step S104), and when the alarm is recognized (step S109), alarm information (SEL (System Event Log)) is stored in the storage unit 121 or the like. Is stored and registered (step S110), and an event is notified to the external device 20 through the IPMB3 (step S111).

しかしながら、実施の形態１の比較例１においては、メイン処理部１１ａにおいてデッドロックが発生した際、故障が発生したというアラーム情報しか認識することができない。ＣＰＵ１１１の状態情報を認識することができない。従って、処理装置１０の障害の解析をすることは難しい。 However, in the first comparative example of the first embodiment, when a deadlock occurs in the main processing unit 11a, only alarm information that a failure has occurred can be recognized. The state information of the CPU 111 cannot be recognized. Therefore, it is difficult to analyze the failure of the processing apparatus 10.

［実施の形態１の比較例２］
実施の形態１の比較例２においては、実施の形態１と比較して、ＣＰＵ１１１が有するウォッチドッグタイマ機能を使用してＣＰＵ１１１の障害を検出する点が異なる。本比較例２においては、ウォッチドッグタイマ機能に障害が発生しＣＰＵ１１１が異常動作している場合、ＣＰＵ１１１の障害を検出することが難しい。 [Comparative Example 2 of Embodiment 1]
The comparative example 2 of the first embodiment is different from the first embodiment in that a failure of the CPU 111 is detected using the watch dog timer function of the CPU 111. In the second comparative example, when a failure occurs in the watchdog timer function and the CPU 111 is operating abnormally, it is difficult to detect the failure of the CPU 111.

［実施の形態１の比較例３］
実施の形態１の比較例３においては、実施の形態１と比較して、ＣＰＵ１１１が有するＳＭＩ(System Management Interrupt)機能を使用してＣＰＵの障害発生の通知を行う点が異なる。本比較例３においては、ＳＭＩ機能に障害が発生しＣＰＵ１１１が異常動作している場合、ＣＰＵ１１１の障害を通知することが難しい。 [Comparative Example 3 of Embodiment 1]
The comparative example 3 of the first embodiment is different from the first embodiment in that a CPU failure notification is made using an SMI (System Management Interrupt) function of the CPU 111. In the third comparative example, when a failure occurs in the SMI function and the CPU 111 is operating abnormally, it is difficult to notify the failure of the CPU 111.

［実施の形態２］
次に、実施の形態２について説明する。
図７は、実施の形態２に係る処理装置を例示するブロック図である。 [Embodiment 2]
Next, a second embodiment will be described.
FIG. 7 is a block diagram illustrating a processing apparatus according to the second embodiment.

図７に示すように、実施の形態２は、前述の実施の形態１と比べて、障害検出部１３に付加情報制御部１３５を有する点が異なる。付加情報制御部１３５は、デッドロックが発生後にステータスコードを複数回取得し、取得したステータスコードを一定時間毎に何回、アラーム付加情報取得部１２２ｂに対して出力するかを制御する。なお、一定時間の時間間隔及び一定時間毎に出力するステータスコードの回数は、付加情報制御部１３５により所望の値が設定される。 As shown in FIG. 7, the second embodiment is different from the first embodiment in that the failure detection unit 13 includes an additional information control unit 135. The additional information control unit 135 acquires the status code a plurality of times after the deadlock has occurred, and controls how many times the acquired status code is output to the alarm additional information acquisition unit 122b at regular intervals. A desired value is set by the additional information control unit 135 for the time interval of a certain time and the number of status codes output at every certain time.

実施の形態２においては、付加情報制御部１３５が一定時間の時間間隔と一定時間毎に出力するステータスコードの回数とを設定する。これにより、ウォッチドッグタイマのリロード間隔の長さにより、ＣＰＵ１１１において、例えば、ソフトウェアの暴走などの障害が発生しているのか否か、又はソフトウェアはある程度適正に動いているのか否かの判断を行うことができる。 In the second embodiment, the additional information control unit 135 sets a time interval of a fixed time and the number of status codes output every fixed time. As a result, the CPU 111 determines whether a failure such as a software runaway has occurred, or whether the software is operating properly to some extent, based on the length of the reload interval of the watchdog timer. be able to.

例えば、監視制御部１２は、障害の発生前の最後の検出用信号から障害の発生後の最初の検出用信号までの時間に基づいてメイン処理部１１の障害の度合いを判断することができる。 For example, the monitoring control unit 12 can determine the degree of failure of the main processing unit 11 based on the time from the last detection signal before the occurrence of the failure to the first detection signal after the occurrence of the failure.

また、監視制御部１２は、障害の発生後の最初の検出用信号から２番目の検出用信号までの時間に基づいてメイン処理部１１の障害の度合いを判断してもよい。 Further, the monitoring control unit 12 may determine the degree of failure of the main processing unit 11 based on the time from the first detection signal to the second detection signal after the occurrence of the failure.

また、監視制御部１２は、障害の発生前の最後の検出用信号とさらに１つ前の検出用信号との間の時間に基づいてメイン処理部１１の障害の度合いを判断してもよい。 Further, the monitoring control unit 12 may determine the degree of failure of the main processing unit 11 based on the time between the last detection signal before the occurrence of the failure and the previous detection signal.

［実施の形態３］
次に、実施の形態３について説明する。
図８は、実施の形態３に係る処理装置を例示するブロック図である。
図９は、実施の形態３に係る処理装置の一部を例示するブロック図である。 [Embodiment 3]
Next, Embodiment 3 will be described.
FIG. 8 is a block diagram illustrating a processing apparatus according to the third embodiment.
FIG. 9 is a block diagram illustrating a part of the processing apparatus according to the third embodiment.

図８に示すように、実施の形態３に係る障害検出部１３は、前述の実施の形態１と比べて、異常検出部１３６と記憶部１３７とをさらに有する点が異なる。異常検出部１３６は、ウォッチドッグタイマ制御部１３１と同様な機能、仕組みであって、ＣＰＵ１１１とは別の部位の障害を検出するための機能を有する。異常検出部１３６は、複数の部位の障害をそれぞれ検出するために、検出機能Ｆａ、検出機能Ｆｂ、検出機能Ｆｃなどの複数の検出機能を有する。 As shown in FIG. 8, the failure detection unit 13 according to the third embodiment is different from the first embodiment in that it further includes an abnormality detection unit 136 and a storage unit 137. The abnormality detection unit 136 has the same function and structure as the watchdog timer control unit 131, and has a function for detecting a failure in a part other than the CPU 111. The abnormality detection unit 136 has a plurality of detection functions such as a detection function Fa, a detection function Fb, and a detection function Fc in order to detect failures at a plurality of parts.

障害検出部１３が、例えば、ＦＰＧＡ(Field-Programmable Gate Array)で構成されている場合、ＦＰＧＡをコンフィグするためのコンフィグ用ファイルは、フラッシュＲＯＭ(Read Only Memory)などにより構成された記憶部１３７に格納される。 For example, when the failure detection unit 13 is configured by an FPGA (Field-Programmable Gate Array), a configuration file for configuring the FPGA is stored in a storage unit 137 configured by a flash ROM (Read Only Memory) or the like. Stored.

ＦＰＧＡをコンフィグする場合、コンフィグ用ファイルが格納された記憶部１３７にアクセスする。このとき、異常検出部１３６は、記憶部１３７へのアクセス異常などを検出し、これをアラーム付加情報としてＢＭＣ１２２に通知する。アラーム付加情報としては、例えば、図４に示す情報Ｄ４０１であるステータス発行元、ステータスコード及びステータスコード（拡張）を使用して情報の判別をする。このようにして、異常検出部１３６は、障害を検出する。 When configuring the FPGA, the storage unit 137 storing the configuration file is accessed. At this time, the abnormality detection unit 136 detects an access abnormality to the storage unit 137 and notifies the BMC 122 of this as alarm additional information. As the alarm additional information, for example, the information is discriminated using the status issuer, status code, and status code (extended) which are information D401 shown in FIG. In this way, the abnormality detection unit 136 detects a failure.

また、異常検出部１３６は、記憶部１３７へのアクセス異常の他にも、別の部位の異常を、検出機能Ｆａ、検出機能Ｆｂ及検出機能びＦｃなどを使用して行う。 Further, the abnormality detection unit 136 performs an abnormality in another part in addition to an abnormality in access to the storage unit 137 using the detection function Fa, the detection function Fb, the detection function, and Fc.

異常検出部１３６及びウォッチドッグタイマ制御部１３１が、複数の部位の異常を同時に検出し、それらの異常が発生した旨をアラーム付加情報生成部１３３に通知する場合、通知する情報間で競合が起こり、通知する情報が消失する可能性がある。このような情報の消失を避けるため、例えば、図９に示すように、障害検出部１３内にＦＩＦＯ(First In First Out)を設ける。 When the abnormality detection unit 136 and the watchdog timer control unit 131 detect an abnormality in a plurality of parts at the same time and notify the alarm additional information generation unit 133 that the abnormality has occurred, a conflict occurs between the information to be notified. , Information to notify may be lost. In order to avoid such information loss, for example, a FIFO (First In First Out) is provided in the failure detection unit 13 as shown in FIG.

次に、実施の形態３に係るアラーム付加情報生成部１３３の動作について説明する。 Next, the operation of the alarm additional information generation unit 133 according to Embodiment 3 will be described.

図９に示すように、アラーム付加情報生成部１３３は、ウォッチドッグタイマ制御部１３１及び異常検出部１３６からアラームの書き込み要求があると、ＦＩＦＯに次々と情報を書き込む。 As shown in FIG. 9, when there is an alarm write request from the watchdog timer control unit 131 and the abnormality detection unit 136, the alarm additional information generation unit 133 writes information in the FIFO one after another.

アラーム付加情報生成部１３３は、ＦＩＦＯに情報が書き込まれている場合、アラーム付加情報の有無を示すレジスタＲ４０１を確認する。レジスタＲ４０１のフラグが立っている場合、ＦＩＦＯに情報が書き込まれている状態を示す。また、レジスタＲ４０１のフラグが立っていない場合、ＦＩＦＯに情報が書き込まれていない状態を示す。 When the information is written in the FIFO, the alarm additional information generation unit 133 checks the register R401 indicating the presence / absence of the alarm additional information. When the flag of the register R401 is set, this indicates a state in which information is written in the FIFO. Further, when the flag of the register R401 is not set, this indicates a state in which no information is written in the FIFO.

アラーム付加情報生成部１３３は、レジスタＲ４０１のフラグが立っている場合、何もしない。また、アラーム付加情報生成部１３３は、レジスタＲ４０１のフラグが立っていない場合、ＦＩＦＯから情報を取り出し、レジスタＲ４０２に取り出した情報を反映すると共にレジスタＲ４０１のフラグを立てる。 The alarm additional information generation unit 133 does nothing when the flag of the register R401 is set. Further, when the flag of the register R401 is not set, the alarm additional information generation unit 133 extracts information from the FIFO, reflects the extracted information in the register R402, and sets the flag of the register R401.

一方、ＢＭＣ１２２は、レジスタＲ４０１を監視し、レジスタＲ４０１のフラグが立っていない場合、何もしない。また、ＢＭＣ１２２は、レジスタＲ４０１のフラグが立っている場合、レジスタＲ４０２の内容を読み出すと共にレジスタＲ４０１のフラグを落とす。 On the other hand, the BMC 122 monitors the register R401 and does nothing if the flag of the register R401 is not set. In addition, when the flag of the register R401 is set, the BMC 122 reads the contents of the register R402 and clears the flag of the register R401.

なお、レジスタＲ４０２は、ステータス発行元、ステータスコード及びステータスコード（拡張）を示すレジスタである。 The register R402 is a register indicating a status issue source, a status code, and a status code (extended).

また、上記の実施の形態では、本発明を主にハードウェアの構成として説明したが、本発明はこれに限定されるものではない。本発明は、各構成要素の処理を、ＣＰＵ（Central Processing Unit）にコンピュータプログラムを実行させることにより実現することも可能である。 In the above embodiments, the present invention has been mainly described as a hardware configuration, but the present invention is not limited to this. The present invention can also realize processing of each component by causing a CPU (Central Processing Unit) to execute a computer program.

上記の例において、プログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実態のある記録媒体(trangible storage medium)を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）、ＣＤ−ＲＯＭ（Read Only Memory）、ＣＤ−Ｒ、ＣＤ−Ｒ／Ｗ、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ（Programable ROM）、ＥＰＲＯＭ(Erasable PROM)）、フラッシュＲＯＭ、ＲＡＭ(Random Access Memory)を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 In the above example, the program can be stored using various types of non-transitory computer readable media and supplied to the computer. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (for example, flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (for example, magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R / W, semiconductor memory (for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM)), flash ROM, RAM (Random Access Memory) are included. The program may also be supplied to the computer by various types of transitory computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can supply the program to the computer via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。 Note that the present invention is not limited to the above-described embodiment, and can be changed as appropriate without departing from the spirit of the present invention.

１０、１０ａ…処理装置１１、１１ａ…メイン処理部１２、１２ａ…監視制御部１３、１３ａ…障害検出部１１１…ＣＰＵ１１２２〜１１４…記憶部１１５…表示部１２１…記憶部１２２…ＢＭＣ１２２ａ…プラットフォームイベント生成部１２２ｂ…アラーム付加情報取得部１２２ｃ…アラーム取得部１３１…ウォッチドッグタイマ制御部１３１ａ…ウォッチドッグタイマ１３２…アラーム生成部１３３…アラーム付加情報生成部１３４…ＣＰＵ状態保持部１３５…付加情報制御部１３６…異常検出部１３７…記憶部Ｄ４０１、Ｄ４０２…情報Ｒ４０１、Ｒ４０２…レジスタＦａ、Ｆｂ、Ｆｃ…検出機能Ｌ１１、Ｌ１２、Ｌ１３、Ｌ２１、Ｌ２２、Ｌ２３…経路 α…所定期間 DESCRIPTION OF SYMBOLS 10, 10a ... Processing apparatus 11, 11a ... Main processing part 12, 12a ... Monitoring control part 13, 13a ... Fault detection part 111 ... CPU 1122-114 ... Storage part 115 ... Display part 121 ... Storage part 122 ... BMC 122a ... Platform Event generation unit 122b ... alarm additional information acquisition unit 122c ... alarm acquisition unit 131 ... watchdog timer control unit 131a ... watchdog timer 132 ... alarm generation unit 133 ... alarm additional information generation unit 134 ... CPU state holding unit 135 ... additional information control Unit 136 ... Abnormality detection unit 137 ... Storage unit D401, D402 ... Information R401, R402 ... Register Fa, Fb, Fc ... Detection function L11, L12, L13, L21, L22, L23 ... Path α ... Predetermined period

Claims

A main processing unit that performs main processing;
A fault detection unit that is provided outside the main processing unit, detects the occurrence of a fault in the main processing unit, and obtains status information of the main processing unit to be notified regardless of the occurrence of the fault;
From the status information acquired by the fault detection unit, select the status information corresponding to the occurrence of the fault, and save and control in a state accessible from the outside,
A processing apparatus comprising:

The main processing unit transmits a detection signal for detecting the occurrence of the failure to the failure detection unit,
The failure detection unit determines that the occurrence of the failure has been detected when there is no detection signal for a predetermined period.
The processing apparatus according to claim 1.

The monitoring control unit determines the degree of failure of the main processing unit based on the time from the last detection signal before the occurrence of the failure to the first detection signal after the occurrence of the failure.
The processing apparatus according to claim 2.

The monitoring control unit determines a degree of failure of the main processing unit based on a time from the first detection signal to the second detection signal after the occurrence of the failure;
The processing apparatus according to claim 2 or 3.

The monitoring control unit determines the degree of failure of the main processing unit based on the time between the last detection signal before the occurrence of the failure and the previous detection signal.
The processing apparatus as described in any one of Claims 2-4.

The processing apparatus according to claim 2, wherein the detection signal is transmitted within the predetermined period.

The monitoring control unit makes a failure confirmation request to the failure detection unit, and acquires failure information indicating whether or not the failure has occurred and the state information from the failure detection unit,
The processing apparatus as described in any one of Claims 1-6.

The processing apparatus according to claim 7, further comprising a display unit that displays the state information and the failure information.

Detecting a failure of the main processing unit;
Obtaining status information of the main processing unit to be notified regardless of the occurrence of the failure;
Selecting status information corresponding to the occurrence of the failure from the acquired status information;
A step of saving in an externally accessible state;
A method comprising:

Detecting a failure of the main processing unit;
Obtaining status information of the main processing unit to be notified regardless of the occurrence of the failure;
Selecting status information corresponding to the occurrence of the failure from the acquired status information;
A step of saving in an externally accessible state;
A program that makes a computer realize.