JPWO2014112039A1

JPWO2014112039A1 - Information processing apparatus, information processing apparatus control method, and information processing apparatus control program

Info

Publication number: JPWO2014112039A1
Application number: JP2014557215A
Authority: JP
Inventors: 正信古越
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-01-15
Filing date: 2013-01-15
Publication date: 2017-01-19
Also published as: WO2014112039A1

Abstract

信号変動判定部（１４２）は、ハードディスクドライブ（１５）の出力データを基に出力異常を検出する。ＨＤコントローラ（１３）は、信号変動判定部（１４２）により出力異常が検出された場合、ハードディスクドライブ（１５）に対してリセット信号を送信して前記ハードディスクドライブ（１５）を再起動させるリセット処理を行う。復旧可否判定部（１４４）は、ＨＤコントローラ（１３）によるリセット処理の回数が閾値を超えた場合、ハードディスクドライブ（１５）の電源のオンオフを行う。ＣＰＵは、復旧可否判定部（１４４）による電源のオンオフによりハードディスクドライブ（１５）が起動した場合、前記ハードディスクドライブ（１５）に障害記録を格納する障害記録採取処理を行う。The signal fluctuation determination unit (142) detects an output abnormality based on the output data of the hard disk drive (15). The HD controller (13) performs reset processing for transmitting a reset signal to the hard disk drive (15) and restarting the hard disk drive (15) when an output abnormality is detected by the signal fluctuation determination unit (142). Do. When the number of reset processes by the HD controller (13) exceeds a threshold, the recovery possibility determination unit (144) turns on / off the power of the hard disk drive (15). When the hard disk drive (15) is activated by turning on / off the power by the recovery possibility determination unit (144), the CPU performs a failure record collecting process for storing a failure record in the hard disk drive (15).

Description

本発明は、情報処理装置、情報処理装置制御方法及び情報処理装置制御プログラムに関する。 The present invention relates to an information processing apparatus, an information processing apparatus control method, and an information processing apparatus control program.

サーバなどの情報処理装置において、ハードディスクドライブ（ＨＤＤ：Hard Disk Drive）の信号端子は、インタフェース信号バスを通じてハードディスクコントローラと接続されている。また、ハードディスクドライブの電源端子は、ＨＤＤ給電線を通じて、電源回路と接続されており、動作のための電力を得ている。そして、ＯＳ（Operation System）やその他のソフトウェアは、ハードディスクドライブから読み出され、メモリ上に展開され、ＣＰＵ（Central Processing Unit）により実行される。 In an information processing apparatus such as a server, a signal terminal of a hard disk drive (HDD) is connected to a hard disk controller through an interface signal bus. The power supply terminal of the hard disk drive is connected to the power supply circuit through the HDD power supply line, and obtains power for operation. An OS (Operation System) and other software are read from the hard disk drive, expanded on a memory, and executed by a CPU (Central Processing Unit).

そして、ＯＳやその他のソフトウェアに影響を与える障害が発生し、ＯＳがハングアップすると、以下のような処理が発生する。まず、マイクロコントローラであるＢＭＣ（Baseboard Management Controller）により、ハングアップが検出され、ＯＳに対して強制ダンプの命令が発動される。ここで、ＢＭＣは、サーバ内蔵のＣＰＵやメモリから独立して、それらの監視及びコントロール等を行う管理用のコントローラである。次に、ＯＳのクラッシュダンプ機能により、メモリ上のデータが一旦ハードディスクドライブのスワップ領域に退避させられる。次に、ＯＳのクラッシュダンプ機能により、サーバのリセット処理が動作する。さらに、ＯＳの再起動後、ＯＳのクラッシュダンプ機能により、ＯＳの再起動時にスワップ領域に退避していたデータが、ハードディスクドライブ上のクラッシュダンプ格納ディレクトリにセーブされる。このようにＯＳのクラッシュダンプ機能によりデータを採取することにより、情報処理装置は、障害記録を残すことができる。そして、情報処理装置の管理者は、障害記録を解析することで、障害の原因究明などを行うことができる。 When a failure that affects the OS and other software occurs and the OS hangs up, the following processing occurs. First, a hangup is detected by a BMC (Baseboard Management Controller), which is a microcontroller, and a forced dump command is issued to the OS. Here, the BMC is a management controller that performs monitoring and control of them independently of the CPU and memory built in the server. Next, data on the memory is temporarily saved in the swap area of the hard disk drive by the crash dump function of the OS. Next, the server reset process operates by the crash dump function of the OS. Further, after the OS is restarted, the data saved in the swap area when the OS is restarted is saved in the crash dump storage directory on the hard disk drive by the OS crash dump function. Thus, by collecting data by the crash dump function of the OS, the information processing apparatus can leave a failure record. Then, the administrator of the information processing apparatus can investigate the cause of the failure by analyzing the failure record.

なお、ハードディスク制御装置のウォッチドッグタイマの状態を監視し、ウォッチドッグタイマの動作を複数回検出した場合、信号によるリセット及び電源のＯＮ及びＯＦＦにより、ハードディスク制御装置の復旧を図る従来技術がある（例えば、特許文献１参照）。また、ハードディスクドライブからの応答が無い場合又はエラー応答の場合、ハードディスクドライブを再起動する従来技術がある（例えば、特許文献２参照）。 In addition, there is a conventional technique for monitoring the state of the watchdog timer of the hard disk control device and, when detecting the operation of the watchdog timer a plurality of times, recovering the hard disk control device by resetting with a signal and turning the power on and off ( For example, see Patent Document 1). Further, there is a conventional technique for restarting the hard disk drive when there is no response from the hard disk drive or when there is an error response (for example, see Patent Document 2).

特開２００３−９１９２号公報JP 2003-9192 A 特開２０１１−７６６６２号公報JP 2011-76662 A

しかしながら、ハードディスクドライブが動作を停止し、さらにハードディスクドライブのファームウェアのバグなどの要因によりリセット信号などを用いても復旧しないことを要因としてハングアップが発生することが考えられる。このような場合、ＯＳのクラッシュダンプ機能が動作しようとしても、ハードディスクドライブが動作しないため、上述したようなデータ採取などの動作を行うことができない。 However, it is conceivable that a hang-up may occur due to the fact that the hard disk drive stops operating and that the recovery is not performed even if a reset signal is used due to a bug in the hard disk drive firmware. In such a case, even if the crash dump function of the OS tries to operate, the hard disk drive does not operate, and thus the operations such as data collection as described above cannot be performed.

システムによっては、ネットワーク上の他のサーバからｐｉｎｇなどを用いた応答の有無の確認によるサーバの動作正常性のチェックを行っている場合がある。しかし、ハードディスクの動作停止などの障害の場合、応答ができてしまうことが多く、障害の検出が困難である。そのため、ハングアップ前に、ハードディスクの動作停止の障害を検出することは困難である。 Depending on the system, there is a case where the normality of operation of the server is checked by checking the presence or absence of a response using ping or the like from another server on the network. However, in the case of a failure such as an operation stop of the hard disk, it is often possible to respond and it is difficult to detect the failure. Therefore, it is difficult to detect a failure of the hard disk operation stop before the hang-up.

また、ハードディスクドライブが動作停止した場合、それ以外の部分は正常稼動であれば、ハードディスクドライブに対する電源再投入で復旧する可能性がある場合が多い。しかし、ハードディスクドライブに対する電源再投入を適切に行う手立てが無い場合、ハードディスクドライブの復旧を適切に行うことが困難である。 In addition, when the hard disk drive stops operating, if the other parts are operating normally, there is a possibility that the hard disk drive may be restored by turning on the power again. However, it is difficult to properly restore the hard disk drive if there is no way to properly turn the hard disk drive on again.

以上のようなことから、ハードディスクドライブの動作停止に起因してサーバに障害が発生したことをシステム上検出することは困難であり、ハードディスクドライブの動作異常時による障害記録の未採取の発生を低減することは困難である。 As described above, it is difficult to detect on the system that a server has failed due to the hard disk drive being stopped, reducing the occurrence of uncollected failure records due to abnormal hard disk drive operation. It is difficult to do.

また、ウォッチドッグタイマの動作を基にハードディスク制御装置の復旧を図る従来技術では、ハードディスクドライブがアイドル状態なのか異常が発生しているのかの切り分けが困難であり、ハードディスクドライブの動作異常を適切に検出することが困難である。また、ハードディスクドライブからの応答の状態を基に復旧を行う従来技術においても、ハードディスクドライブがアイドル状態なのか異常が発生しているのかの切り分けが困難であり、ハードディスクドライブの動作異常を適切に検出することが困難である。そのため、これらの従来技術を用いても、ハードディスクドライブの動作異常時による障害記録の未採取の発生を低減することは困難である。 In addition, with the conventional technology that restores the hard disk controller based on the operation of the watchdog timer, it is difficult to determine whether the hard disk drive is in an idle state or an abnormality has occurred. It is difficult to detect. Also, even in the conventional technology that recovers based on the response status from the hard disk drive, it is difficult to determine whether the hard disk drive is in an idle state or an abnormality has occurred, and the hard disk drive operation abnormality is detected appropriately. Difficult to do. Therefore, even if these conventional techniques are used, it is difficult to reduce the occurrence of uncollected failure records due to abnormal operation of the hard disk drive.

開示の技術は、上記に鑑みてなされたものであって、ハードディスクドライブの動作異常による障害記録の未採取の発生を低減する、情報処理装置、情報処理装置制御方法及び情報処理装置制御プログラムを提供することを目的とする。 The disclosed technology has been made in view of the above, and provides an information processing apparatus, an information processing apparatus control method, and an information processing apparatus control program that reduce the occurrence of uncollected failure records due to abnormal operation of a hard disk drive The purpose is to do.

本願の開示する情報処理装置、情報処理装置制御方法及び情報処理装置制御プログラムは、一つの態様において、出力異常検出部は、ハードディスクドライブの出力データを基に出力異常を検出する。リセット部は、前記出力異常検出部により出力異常が検出された場合、前記ハードディスクドライブに対してリセット信号を送信して前記ハードディスクドライブを再起動させるリセット処理を行う。ＨＤＤ電源制御部は、前記リセット部による前記リセット処理の回数が閾値を超えた場合、前記ハードディスクドライブの電源のオンオフを行う。障害記録採取部は、前記ＨＤＤ電源制御部による電源のオンオフにより前記ハードディスクドライブが起動した場合、障害記録の採取を行う。 In one aspect of the information processing apparatus, the information processing apparatus control method, and the information processing apparatus control program disclosed in the present application, the output abnormality detection unit detects an output abnormality based on the output data of the hard disk drive. When an output abnormality is detected by the output abnormality detection unit, the reset unit performs a reset process for transmitting a reset signal to the hard disk drive to restart the hard disk drive. The HDD power control unit turns the hard disk drive on and off when the number of reset processes by the reset unit exceeds a threshold. The failure record collecting unit collects a failure record when the hard disk drive is activated by turning on / off the power by the HDD power supply control unit.

本願の開示する情報処理装置、情報処理装置制御方法及び情報処理装置制御プログラムの一つの態様によれば、ハードディスクドライブの動作異常による障害記録の未採取の発生を低減することができるという効果を奏する。 According to one aspect of the information processing device, the information processing device control method, and the information processing device control program disclosed in the present application, it is possible to reduce the occurrence of uncollected failure records due to abnormal operation of the hard disk drive. .

図１は、実施例１に係るサーバのブロック図である。FIG. 1 is a block diagram of a server according to the first embodiment. 図２は、信号監視部の詳細を表すブロック図である。FIG. 2 is a block diagram showing details of the signal monitoring unit. 図３は、実施例１に係る情報処理装置におけるハードディスクドライブの障害検出処理のフローチャートである。FIG. 3 is a flowchart of hard disk drive failure detection processing in the information processing apparatus according to the first embodiment. 図４は、実施例２に係る情報処理装置におけるダンプ処理のフローチャートである。FIG. 4 is a flowchart of the dump process in the information processing apparatus according to the second embodiment. 図５は、各実施例に係るサーバのハードウェア構成の一例の図である。FIG. 5 is a diagram illustrating an example of a hardware configuration of a server according to each embodiment.

以下に、本願の開示する情報処理装置、情報処理装置制御方法及び情報処理装置制御プログラムの実施例を図面に基づいて詳細に説明する。なお、以下の実施例により本願の開示する情報処理装置、情報処理装置制御方法及び情報処理装置制御プログラムが限定されるものではない。 Embodiments of an information processing apparatus, an information processing apparatus control method, and an information processing apparatus control program disclosed in the present application will be described below in detail with reference to the drawings. The information processing apparatus, the information processing apparatus control method, and the information processing apparatus control program disclosed in the present application are not limited by the following embodiments.

図１は、実施例１に係るサーバのブロック図である。図１に示すように、本実施例に係るサーバ１は、ＣＰＵ１１、メモリ１２、ＨＤコントローラ１３、信号監視部１４、ハードディスクドライブ１５、カウンタリセットタイマ１６、電源スイッチ１７、ＢＭＣ１８、サーバ電源１９及びＨＤＤ電源２０を有している。 FIG. 1 is a block diagram of a server according to the first embodiment. As shown in FIG. 1, the server 1 according to this embodiment includes a CPU 11, a memory 12, an HD controller 13, a signal monitoring unit 14, a hard disk drive 15, a counter reset timer 16, a power switch 17, a BMC 18, a server power supply 19, and an HDD. A power supply 20 is included.

ここで、本実施例に係るサーバ１は、ＲＡＩＤなどが構成されていないＤＡＳ（Direct Attached Storage）の情報処理装置である。例えば、サーバ１は、ハードディスクドライブが１台しか搭載されていない通信用の情報処理装置などである。 Here, the server 1 according to the present embodiment is a DAS (Direct Attached Storage) information processing apparatus in which RAID or the like is not configured. For example, the server 1 is a communication information processing apparatus in which only one hard disk drive is mounted.

ＨＤＤ電源２０は、ハードディスクドライブ１５に供給する電力の供給源である。図１では、一点鎖線によりＨＤＤ電源２０からハードディスクドライブ１５への電力の供給経路を表している。 The HDD power supply 20 is a power supply source for supplying power to the hard disk drive 15. In FIG. 1, a power supply path from the HDD power supply 20 to the hard disk drive 15 is represented by a one-dot chain line.

電源スイッチ１７は、ＦＥＴ（Field effect transistor）スイッチなどである。電源スイッチ１７がオンの場合、ＨＤＤ電源２０からの電力がハードディスクドライブ１５へ供給される。また、電源スイッチ１７がオフの場合、ＨＤＤ電源２０からの電力のハードディスクドライブ１５への供給が停止される。 The power switch 17 is an FET (Field effect transistor) switch or the like. When the power switch 17 is on, power from the HDD power supply 20 is supplied to the hard disk drive 15. When the power switch 17 is off, the supply of power from the HDD power supply 20 to the hard disk drive 15 is stopped.

サーバ電源１９は、サーバ１に搭載されたＣＰＵ１１やメモリ１２といった各部への電力の供給源である。サーバ電源１９は、例えば、図１における点線で囲われた内部に存在する各部へ電力を供給する。 The server power source 19 is a power supply source to each unit such as the CPU 11 and the memory 12 mounted on the server 1. For example, the server power supply 19 supplies power to each unit existing inside the dotted line in FIG.

ＣＰＵ１１は、ＨＤコントローラ１３に対してハードディスクドライブ１５へのデータの書き込み及びデータの読み出しを指示する。このように、実際には、ＣＰＵ１１はＨＤコントローラ１３を介してハードディスクドライブ１５に対するデータの読み書きを行うが、以下の説明では、便宜上ＣＰＵ１１がハードディスクドライブ１５に対してデータの読み書きを行うように説明する場合がある。ＣＰＵ１１は、例えば、ＨＤコントローラ１３を介してハードディスクドライブ１５に格納されたＯＳやその他のプログラムなどを読み出しメモリ１２などに展開する。そして、ＣＰＵ１１は、メモリ１２等を使用して演算処理などの各種処理を行う。 The CPU 11 instructs the HD controller 13 to write data to and read data from the hard disk drive 15. As described above, the CPU 11 actually reads / writes data from / to the hard disk drive 15 via the HD controller 13. However, in the following description, for convenience, the CPU 11 reads / writes data from / to the hard disk drive 15. There is a case. The CPU 11 reads out, for example, an OS and other programs stored in the hard disk drive 15 via the HD controller 13 and expands them in the memory 12 or the like. Then, the CPU 11 performs various processes such as a calculation process using the memory 12 or the like.

また、ハードディスクドライブ１５の応答異常時に、ハードディスクドライブ１５に対してリセット信号の送信をＨＤコントローラ１３へ指示する。ここで、ハードディスクドライブ１５の応答異常には、例えば、ハードディスクドライブ１５からの応答が無い状態などが含まれる。 Further, when the response of the hard disk drive 15 is abnormal, the HD controller 13 is instructed to transmit a reset signal to the hard disk drive 15. Here, the abnormal response of the hard disk drive 15 includes, for example, a state where there is no response from the hard disk drive 15.

また、ＯＳがハングアップすると、ＣＰＵ１１は、ＢＭＣ１８から強制的にメモリ１２内のデータの保存を実行する強制ダンプの割り込みを受ける。強制ダンプの割り込みを受けると、ＣＰＵ１１は、ＯＳのクラッシュダンプ機能を実行し、メモリ１２上のデータをメモリ１２から読み出す。そして、ＣＰＵ１１は、ＯＳのクラッシュダンプ機能により、読み出したデータをハードディスクドライブ１５のスワップ領域に格納する。 When the OS hangs up, the CPU 11 receives a forced dump interrupt forcibly saving data in the memory 12 from the BMC 18. Upon receiving a forced dump interrupt, the CPU 11 executes a crash dump function of the OS and reads data on the memory 12 from the memory 12. Then, the CPU 11 stores the read data in the swap area of the hard disk drive 15 by the crash dump function of the OS.

次に、ＯＳのクラッシュダンプ機能により、サーバ１の再起動が行われる。その後、ＣＰＵ１１は、ＯＳのクラッシュダンプ機能により、ハードディスクドライブ１５のスワップ領域に退避させておいたデータをハードディスクドライブ１５のクラッシュダンプ格納ディレクトリに格納する。 Next, the server 1 is restarted by the crash dump function of the OS. Thereafter, the CPU 11 stores the data saved in the swap area of the hard disk drive 15 in the crash dump storage directory of the hard disk drive 15 by the crash dump function of the OS.

メモリ１２には、ＣＰＵ１１によりＯＳやその他のプログラムなどが展開される。また、ＯＳのクラッシュダンプ機能が実行された場合、メモリ１２上のデータが読み出されハードディスクドライブ１５に格納される。 An OS and other programs are expanded in the memory 12 by the CPU 11. When the OS crash dump function is executed, the data on the memory 12 is read and stored in the hard disk drive 15.

ＣＰＵ１１及びメモリ１２が、「障害記録採取部」の一例にあたる。 The CPU 11 and the memory 12 correspond to an example of a “failure record collection unit”.

ＨＤコントローラ１３は、ＣＰＵ１１からの指示を受け、ハードディスクドライブ１５へのデータの書き込み及びハードディスクドライブ１５からのデータの読み出しを行う。ＨＤコントローラ１３は、ハードディスクドライブ１５から読み出したデータをＣＰＵ１１へ出力する。具体的には、ＨＤコントローラ１３は、例えば、ハードディスクドライブ１５との間でＨＤＤインタフェース信号を送受信することによりデータの読み書きを行う。 The HD controller 13 receives an instruction from the CPU 11 and writes data to the hard disk drive 15 and reads data from the hard disk drive 15. The HD controller 13 outputs the data read from the hard disk drive 15 to the CPU 11. Specifically, the HD controller 13 reads and writes data by transmitting and receiving HDD interface signals to and from the hard disk drive 15, for example.

また、ＨＤコントローラ１３は、ハードディスクドライブ１５の応答異常時にＣＰＵ１１からの指示を受けて、リセット信号を信号監視部１４へ送信する。ＨＤコントローラ１３は、応答異常が復旧するまでリセット信号の送信を行う。 Further, the HD controller 13 receives an instruction from the CPU 11 when the response of the hard disk drive 15 is abnormal, and transmits a reset signal to the signal monitoring unit 14. The HD controller 13 transmits a reset signal until the response abnormality is recovered.

信号監視部１４は、ＨＤコントローラ１３とハードディスクドライブ１５との間に設けられる。図２は、信号監視部の詳細を表すブロック図である。図２に示すように、信号監視部１４は、データ変動計測タイマ１４１、信号変動判定部１４２、リセットカウンタ１４３及び復旧可否判定部１４４を有している。 The signal monitoring unit 14 is provided between the HD controller 13 and the hard disk drive 15. FIG. 2 is a block diagram showing details of the signal monitoring unit. As shown in FIG. 2, the signal monitoring unit 14 includes a data variation measurement timer 141, a signal variation determination unit 142, a reset counter 143, and a recovery possibility determination unit 144.

データ変動計測タイマ１４１は、予め決められた所定時間であるｎ秒毎に、信号変動判定部１４２に対して割り込みを行う。ここで、所定時間であるｎ秒は、サーバ１の運用状態、すなわち、どのようなプログラムを使用しているかなどに応じて設定することが好ましい。本実施例では、例えば、１回のデータの読み出しは１分以内で終わることが多いので、所定時間であるｎ秒を１分と設定する。 The data fluctuation measurement timer 141 interrupts the signal fluctuation determination unit 142 every n seconds, which is a predetermined time. Here, the predetermined time n seconds is preferably set according to the operating state of the server 1, that is, what program is used. In the present embodiment, for example, since one data read often ends within one minute, n seconds, which is a predetermined time, is set as one minute.

信号変動判定部１４２は、データの書き込みの場合、ＨＤコントローラ１３から書き込みデータを受信する。そして、信号変動判定部１４２は、受信した書き込みデータをハードディスクドライブ１５へ格納する。 The signal fluctuation determination unit 142 receives write data from the HD controller 13 when writing data. Then, the signal variation determination unit 142 stores the received write data in the hard disk drive 15.

データの読み出しの場合、信号変動判定部１４２は、ハードディスクドライブ１５から読み出すデータをＨＤインタフェース信号で受信する。そして、信号変動判定部１４２は、受信したＨＤインタフェース信号をＨＤコントローラ１３へ出力する。また、信号変動判定部１４２は、ｎ秒毎に割り込みをデータ変動計測タイマ１４１から受ける。そして、信号変動判定部１４２は、データ変動計測タイマ１４１からの割り込みを契機に、予め決められた所定時間に受信したＨＤインタフェース信号に変動があるか否かを判定する。ここで、ＨＤインタフェース信号の変動が無いとは、同じ信号が連続していることを指す。そして、連続する信号としては、例えば、アイドルを表す信号、０などのＬｏｗを表す信号又は１などのＨｉｇｈを表す信号などである。 In the case of data reading, the signal variation determination unit 142 receives data read from the hard disk drive 15 with an HD interface signal. Then, the signal variation determination unit 142 outputs the received HD interface signal to the HD controller 13. In addition, the signal fluctuation determination unit 142 receives an interrupt from the data fluctuation measurement timer 141 every n seconds. Then, the signal variation determination unit 142 determines whether or not there is a variation in the HD interface signal received at a predetermined time, triggered by an interruption from the data variation measurement timer 141. Here, that there is no fluctuation of the HD interface signal indicates that the same signal is continuous. The continuous signal is, for example, a signal representing idle, a signal representing Low such as 0, or a signal representing High such as 1 or the like.

ＨＤインタフェース信号に変動が無いと判定した場合、信号変動判定部１４２は、ＯＳの指示によりＣＰＵ１１から出力されるリセット信号のカウントをリセットカウンタ１４３に指示する。 When it is determined that there is no fluctuation in the HD interface signal, the signal fluctuation determination unit 142 instructs the reset counter 143 to count the reset signal output from the CPU 11 according to an instruction from the OS.

信号変動判定部１４２は、リセットカウンタ１４３を監視し、リセットカウンタ１４３のカウンタ値が初期値にリセットされた場合、ｎ秒毎のＨＤＤインタフェース信号の変動の有無の判定を再度繰り返す。 The signal fluctuation determination unit 142 monitors the reset counter 143, and when the counter value of the reset counter 143 is reset to the initial value, the determination of the presence or absence of fluctuation of the HDD interface signal every n seconds is repeated again.

また、信号変動判定部１４２は、後述する復旧可否判定部１４４によりハードディスクドライブ１５の電源のオンオフが行われ、ハードディスクドライブ１５が起動すると、ハードディスクドライブ１５から起動割り込みを受信する。その場合、信号変動判定部１４２は、ハードディスクドライブ１５の起動割り込みをＨＤコントローラ１３及びリセットカウンタ１４３へ出力する。この信号変動判定部１４２が、「出力異常検出部」の一例にあたる。 The signal fluctuation determination unit 142 receives a start interrupt from the hard disk drive 15 when the hard disk drive 15 is turned on and off by the recovery possibility determination unit 144 described later. In that case, the signal variation determination unit 142 outputs a start interrupt of the hard disk drive 15 to the HD controller 13 and the reset counter 143. The signal fluctuation determination unit 142 is an example of an “output abnormality detection unit”.

リセットカウンタ１４３は、初期値及び閾値が予め与えられている。本実施例では、リセットカウンタ１４３の初期値は０である。また、リセットカウンタ１４３の閾値及びカウンタのリセット間隔は、プログラムによりハードディスクドライブ１５の応答が要求する頻度に応じて設定されることが好ましい。例えば、ハードディスクドライブ１５の応答が要求する頻度が高いプログラムであれば、５分間で１００〜２００回の応答要求が発生する場合が考えられる。そのような場合、後述するカウンタリセットタイマ１６からのカウンタリセットの指示の間隔が５分であれば、例えば、閾値を１００回とするなどが好ましい。ここでは、カウンタリセットタイマ１６からのカウンタリセットの指示の間隔をｍ秒とし、閾値をＭ回とする。 The reset counter 143 is given an initial value and a threshold value in advance. In this embodiment, the initial value of the reset counter 143 is zero. Further, the threshold value of the reset counter 143 and the counter reset interval are preferably set according to the frequency with which the response of the hard disk drive 15 is requested by the program. For example, in the case of a program that requires a high frequency of response requests from the hard disk drive 15, a response request of 100 to 200 times may occur in 5 minutes. In such a case, if the interval of the counter reset instruction from the counter reset timer 16 described later is 5 minutes, for example, the threshold is preferably set to 100 times. Here, the interval between counter reset instructions from the counter reset timer 16 is set to m seconds, and the threshold value is set to M times.

リセットカウンタ１４３は、ＯＳの指示によりＣＰＵ１１から出力されたリセット信号をＨＤコントローラ１３から受信する。そして、リセットカウンタ１４３は、受信したリセット信号をハードディスクドライブ１５へ出力する。 The reset counter 143 receives from the HD controller 13 a reset signal output from the CPU 11 according to an instruction from the OS. Then, the reset counter 143 outputs the received reset signal to the hard disk drive 15.

リセットカウンタ１４３は、ＨＤＤインタフェースデータの変動がない場合、ＯＳの指示によりＣＰＵ１１から出力されるリセット信号のカウントの指示を信号変動判定部１４２から受ける。その後、リセットカウンタ１４３は、ＨＤコントローラ１３からリセット信号を受信する毎にカウンタを１ずつインクリメントしていき、リセット信号の受信した数をカウントする。 The reset counter 143 receives from the signal variation determination unit 142 an instruction to count the reset signal output from the CPU 11 according to an instruction from the OS when there is no variation in the HDD interface data. Thereafter, the reset counter 143 increments the counter by 1 each time a reset signal is received from the HD controller 13, and counts the number of reset signals received.

さらに、リセットカウンタ１４３は、カウンタリセットタイマ１６に対してカウントの開始を通知する。その後、リセットカウンタ１４３は、ｍ秒毎にカウンタリセットタイマ１６からカウンタリセットの指示を受信する。カウンタリセットの指示を受信すると、リセットカウンタ１４３は、自己のカウンタを初期値に戻しカウンタをリセットする。 Further, the reset counter 143 notifies the counter reset timer 16 of the start of counting. Thereafter, the reset counter 143 receives a counter reset instruction from the counter reset timer 16 every m seconds. When the counter reset instruction is received, the reset counter 143 resets its counter to the initial value and resets the counter.

これに対して、カウンタリセットタイマ１６からリセット信号を受信する前にカウンタが閾値Ｍを超えた場合、リセットカウンタ１４３は、ハードディスクドライブ１５が無応答となっていると判定する。ここで、無応答とは、例えば、障害の発生により、ハードディスクドライブ１５が、応答を返せない状態である。すなわち、本実施例に係るサーバ１は、ＨＤインタフェース信号の変化が所定期間無く、且つ、リセット信号が所定値以上の場合に、ハードディスクドライブ１５が無応答であると判定する。これにより、本実施例に係るサーバ１は、単にハードディスクドライブ１５がアイドル状態（ハードディスクドライブ１５にアクセスが無い状態）である場合と無応答である場合とを切り分けることができる。 On the other hand, if the counter exceeds the threshold value M before receiving the reset signal from the counter reset timer 16, the reset counter 143 determines that the hard disk drive 15 is not responding. Here, “no response” refers to a state in which the hard disk drive 15 cannot return a response due to, for example, the occurrence of a failure. That is, the server 1 according to the present embodiment determines that the hard disk drive 15 is not responding when there is no change in the HD interface signal for a predetermined period and the reset signal is equal to or greater than a predetermined value. Thereby, the server 1 according to the present embodiment can distinguish between a case where the hard disk drive 15 is simply in an idle state (a state where the hard disk drive 15 is not accessed) and a case where there is no response.

そして、リセットカウンタ１４３は、ＯＳの指示によりＣＰＵ１１から出力されるリセット信号のカウントを停止する。そして、リセットカウンタ１４３は、ハードディスクドライブ１５の復旧が可能か否かの判定を行う復旧可否判定処理の実施を復旧可否判定部１４４に指示する。 Then, the reset counter 143 stops counting the reset signal output from the CPU 11 according to an instruction from the OS. Then, the reset counter 143 instructs the recovery possibility determination unit 144 to perform a recovery possibility determination process for determining whether the hard disk drive 15 can be recovered.

リセットカウンタ１４３は、信号変動判定部１４２からハードディスクドライブ１５の起動割り込みを受信した場合、受信したハードディスクドライブ１５の起動割り込みを復旧可否判定部１４４へ出力する。 When receiving the activation interrupt of the hard disk drive 15 from the signal fluctuation determination unit 142, the reset counter 143 outputs the received activation interrupt of the hard disk drive 15 to the recovery possibility determination unit 144.

復旧可否判定部１４４は、復旧可否判定を行った回数をカウントする復旧可否判定の実施回数のカウンタを有している。また、復旧可否判定部１４４は、ハードディスクドライブ１５が復旧不可か否かを判定するための復旧可否判定の実施回数の閾値を記憶している。ここで、復旧可否判定の実施回数の閾値は、ハードディスクドライブ１５の状態に応じて設定することが好ましい。通常は５〜１０回程度電源のオフオンを行って復旧しなければハードディスクドライブ１５は復旧の見込みは無いと考えられる。そこで、例えば、復旧可否判定部１４４は、復旧可否判定の実施回数の閾値を１０回と記憶するなどしてもよい。以下では、復旧可否判定の実施回数の閾値をＮ回として説明する。 The recovery possibility determination unit 144 includes a counter for the number of executions of the recovery possibility determination that counts the number of times the recovery possibility determination has been performed. Further, the recovery possibility determination unit 144 stores a threshold value of the number of executions of the recovery possibility determination for determining whether or not the hard disk drive 15 is recoverable. Here, it is preferable to set the threshold value of the number of executions of the recovery possibility determination according to the state of the hard disk drive 15. Normally, it is considered that the hard disk drive 15 is not expected to be restored unless it is restored by turning the power off and on about 5 to 10 times. Therefore, for example, the recovery possibility determination unit 144 may store a threshold value of the number of executions of the recovery possibility determination as 10 times. In the following description, it is assumed that the threshold value of the number of executions of the recovery possibility determination is N.

復旧可否判定部１４４は、リセットカウンタ１４３のカウンタが閾値を越えた場合、復旧可能判定処理の実施の指示をリセットカウンタ１４３から受ける。そして、復旧可否判定部１４４は、電源スイッチ１７に対してスイッチ制御信号を発行する。例えば、復旧可否判定部１４４は、スイッチ制御信号として電源をオフしその後オンすることを指示するパルス信号を電源スイッチ１７へ送信し、電源スイッチ１７のオフオンを行う。復旧可否判定部１４４は、電源スイッチ１７をオフオンさせることで、ハードディスクドライブ１５への電源の供給を一旦停止した後、再度電源の供給を行う。これにより、復旧可否判定部１４４は、ハードディスクドライブ１５を再起動させる。復旧可否判定部１４４は、電源スイッチ１７のオフオンを行った後、予め決められた所定時間待機し、ハードディスクドライブ１５が再起動するのを待つ。ここで、復旧可否判定部１４４が待機する時間は、ハードディスクドライブ１５のタイプなどに応じて設定されることが好ましい。ハードディスクドライブ１５の起動は一般的に３０秒以内で完了するので、一般的なハードディスクドライブを用いた場合、復旧可否判定部１４４の待機時間は、例えば、３０秒などに設定できる。以下では、復旧可否判定部１４４の待機時間をｔ秒とする。 When the counter of the reset counter 143 exceeds the threshold value, the recoverability determination unit 144 receives an instruction to perform the recoverability determination process from the reset counter 143. Then, the recovery possibility determination unit 144 issues a switch control signal to the power switch 17. For example, the restoration possibility determination unit 144 transmits a pulse signal instructing to turn off and then turn on as a switch control signal to the power switch 17 to turn the power switch 17 off and on. The recovery possibility determination unit 144 turns off the power switch 17 to temporarily stop the power supply to the hard disk drive 15 and then supply the power again. Thereby, the recovery possibility determination unit 144 restarts the hard disk drive 15. After the power switch 17 is turned off / on, the recovery possibility determination unit 144 waits for a predetermined time and waits for the hard disk drive 15 to restart. Here, it is preferable that the time that the recovery possibility determination unit 144 waits is set according to the type of the hard disk drive 15 and the like. Since the activation of the hard disk drive 15 is generally completed within 30 seconds, when a general hard disk drive is used, the standby time of the recovery possibility determination unit 144 can be set to 30 seconds, for example. Hereinafter, the standby time of the recovery possibility determination unit 144 is assumed to be t seconds.

復旧可否判定部１４４は、待機しているｔ秒間にハードディスクドライブ１５の起動割り込みをリセットカウンタ１４３から受信したか否かにより、その間にハードディスクドライブ１５の割り込みが発生したか否かを判定する。 The recovery possibility determination unit 144 determines whether or not an interruption of the hard disk drive 15 has occurred during the waiting t seconds based on whether or not the activation interruption of the hard disk drive 15 has been received from the reset counter 143.

電源スイッチ１７をオフオンしてからｔ秒間にハードディスクドライブ１５の起動割り込みを受信した場合、復旧可否判定部１４４はハードディスクドライブ１５が復旧可能か否かの判定を終了して、復旧可否判定を解除する。そして、復旧可否判定部１４４は、強制ダンプの処理の発動を指示する判定信号をＢＭＣ１８へ送信する。 When the hard disk drive 15 activation interrupt is received for t seconds after the power switch 17 is turned off, the recovery possibility determination unit 144 ends the determination of whether the hard disk drive 15 can be recovered and cancels the recovery possibility determination. . Then, the recovery possibility determination unit 144 transmits to the BMC 18 a determination signal instructing activation of forced dump processing.

これに対して、電源スイッチ１７をオフオンしてからｔ秒間にハードディスクドライブ１５の起動割り込みが無かった場合、復旧可否判定部１４４は、復旧可否判定の実施回数のカウンタを１つインクリメントする。そして、復旧可否判定部１４４は、カウンタの数を用いて復旧可否判定の実施回数が予め決められた閾値であるＮ回以上か否かを判定する。 On the other hand, if the hard disk drive 15 is not interrupted for t seconds after the power switch 17 is turned off, the recovery possibility determination unit 144 increments the counter of the number of executions of the recovery possibility determination by one. Then, the recovery possibility determination unit 144 uses the number of counters to determine whether or not the number of executions of the recovery possibility determination is equal to or more than N that is a predetermined threshold.

復旧可否判定の実施回数が閾値Ｎ未満であれば、復旧可否判定部１４４は、電源スイッチ１７のオフオンを行い、復旧可否判定を繰り返す。 If the number of executions of the recovery availability determination is less than the threshold value N, the recovery availability determination unit 144 turns off the power switch 17 and repeats the recovery availability determination.

これに対して、復旧可否判定の実施回数が閾値Ｎ以上であれば、復旧可否判定部１４４は、ハードディスクドライブ１５の復旧が不可と判定する。そして、復旧可否判定部１４４は、サーバ電源のオフをＢＭＣ１８に指示する。この復旧可否判定部１４４が、「ＨＤＤ電源制御部」の一例にあたる。 On the other hand, if the number of times that the recovery possibility determination is performed is equal to or greater than the threshold value N, the recovery possibility determination unit 144 determines that the hard disk drive 15 cannot be recovered. Then, the recovery possibility determination unit 144 instructs the BMC 18 to turn off the server power. The recovery possibility determination unit 144 is an example of an “HDD power supply control unit”.

ハードディスクドライブ１５は、例えば、磁気ディスクドライブである。ハードディスクドライブ１５は、信号変動判定部１４２を介してＨＤコントローラ１３から送られたデータを受信し、指定されたアドレスに格納する。また、ハードディスクドライブ１５は、ＨＤコントローラ１３から要求されたデータを、信号変動判定部１４２を介してＨＤコントローラ１３へ送信する。具体的には、ハードディスクドライブ１５は、ＨＤインタフェース信号を用いて応答を送信する。 The hard disk drive 15 is, for example, a magnetic disk drive. The hard disk drive 15 receives the data sent from the HD controller 13 via the signal fluctuation determination unit 142 and stores it at a designated address. Further, the hard disk drive 15 transmits the data requested from the HD controller 13 to the HD controller 13 via the signal fluctuation determination unit 142. Specifically, the hard disk drive 15 transmits a response using the HD interface signal.

カウンタリセットタイマ１６は、予め決められた所定時間であるｍ秒毎に、信号変動判定部１４２に対して割り込みを行う。ここで、所定時間であるｍ秒は、ハードディスクドライブ１５の復旧までの許容時間などの運用状態に応じて設定することが好ましい。例えば、ハードディスクドライブ１５が５分以内程度であればプログラムがデータの読み書きを行わない間隔として考えられるので、所定時間であるｍ秒を５分以内と設定するなどできる。 The counter reset timer 16 interrupts the signal fluctuation determination unit 142 every m seconds, which is a predetermined time. Here, the predetermined time of m seconds is preferably set in accordance with an operation state such as an allowable time until the hard disk drive 15 is restored. For example, if the hard disk drive 15 is within about 5 minutes, it can be considered as an interval at which the program does not read and write data, so the predetermined time of m seconds can be set to within 5 minutes.

カウンタリセットタイマ１６は、信号監視部１４のリセットカウンタ１４３からカウントの開始の通知を受ける。カウントの開始の通知を受けると、カウンタリセットタイマ１６は、タイマで時間の計測を開始する。そして、タイマが所定時間であるｍ秒になると、カウンタリセットをリセットカウンタ１４３に指示する。そして、カウンタリセットタイマ１６は、タイマをリセットし、ｍ秒の計測を繰り返す。 The counter reset timer 16 receives a count start notification from the reset counter 143 of the signal monitoring unit 14. When receiving the count start notification, the counter reset timer 16 starts measuring time with the timer. When the timer reaches a predetermined time of m seconds, the reset counter 143 is instructed to reset the counter. Then, the counter reset timer 16 resets the timer and repeats the measurement for m seconds.

ＢＭＣ１８は、プロセッサやレジスタなどを有している。ＢＭＣ１８は、ＣＰＵ１１やメモリ１２などの動作の監視、温度センサなどの各種センサの状態の監視及びサーバ１の電源制御などの各種のサーバ管理を行う。 The BMC 18 has a processor, a register, and the like. The BMC 18 performs various server management such as monitoring of operations of the CPU 11 and the memory 12, monitoring of states of various sensors such as a temperature sensor, and power control of the server 1.

また、ＢＭＣ１８は、管理者による入力装置などからの指示を受けて、信号監視部１４に対して制御信号を送信することで、復旧可否判定部１４４が記憶している復旧可否判定の実行回数の閾値Ｎ及びリセットカウンタ１４３が記憶している閾値Ｍを変更できる。また、ＢＭＣ１８は、管理者による入力装置などからの指示を受けて、信号監視部１４に対してタイマ制御信号を送信することで、データ変動計測タイマ１４１が記憶している待機時間ｎ秒を変更できる。さらに、ＢＭＣ１８は、管理者による入力装置などからの指示を受けて、カウンタリセットタイマ１６に対してタイマ制御信号を送信することで、カウンタリセットタイマ１６が記憶しているカウンタリセット信号を送信する間隔ｍ秒を変更できる。 In addition, the BMC 18 receives an instruction from the input device or the like by the administrator and transmits a control signal to the signal monitoring unit 14, so that the number of executions of the recovery possibility determination stored in the recovery possibility determination part 144 is stored. The threshold value N and the threshold value M stored in the reset counter 143 can be changed. Further, the BMC 18 changes the standby time n seconds stored in the data variation measurement timer 141 by transmitting a timer control signal to the signal monitoring unit 14 in response to an instruction from the input device by the administrator. it can. Further, the BMC 18 receives an instruction from the input device or the like by the administrator and transmits a timer control signal to the counter reset timer 16 to transmit a counter reset signal stored in the counter reset timer 16. m seconds can be changed.

ＢＭＣ１８は、ハードディスクドライブ１５が復旧不可能と判定された場合、判定信号を復旧可否判定部１４４から受信する。そして、ＢＭＣ１８は、強制ダンプ処理の実行をＣＰＵ１１に指示する。 When it is determined that the hard disk drive 15 cannot be recovered, the BMC 18 receives a determination signal from the recoverability determination unit 144. Then, the BMC 18 instructs the CPU 11 to execute the forced dump process.

また、ＯＳがハングアップした場合も、ＢＭＣ１８は、強制ダンプ処理の実行をＣＰＵ１１に指示する。 Even when the OS hangs up, the BMC 18 instructs the CPU 11 to execute the forced dump process.

また、復旧可否判定部１４４からサーバの電源オフの指示を受けると、ＢＭＣ１８は、電源をオフするようにサーバ電源１９を制御する。 When receiving a server power-off instruction from the recovery possibility determination unit 144, the BMC 18 controls the server power source 19 to turn off the power.

次に、図３を参照して、本実施例に係る情報処理装置におけるハードディスクドライブ１５の障害検出処理について説明する。図３は、実施例１に係る情報処理装置におけるハードディスクドライブの障害検出処理のフローチャートである。ここでは、信号監視部１４の動作とＯＳを実行するＣＰＵ１１の動作とを並行して説明するが、以下でＯＳが実行しているように説明する処理は、実際にはＯＳを実行しているＣＰＵ１１が動作の主体である。 Next, a failure detection process of the hard disk drive 15 in the information processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 3 is a flowchart of hard disk drive failure detection processing in the information processing apparatus according to the first embodiment. Here, the operation of the signal monitoring unit 14 and the operation of the CPU 11 that executes the OS will be described in parallel. However, the processing that is described below as being executed by the OS actually executes the OS. The CPU 11 is the main subject of operation.

信号監視部１４は、ハードディスクドライブ１５から出力されるＨＤＤインタフェース信号の監視を開始する（ステップＳ１０１）。具体的には、信号監視部１４は、サーバ１が起動してハードディスクドライブ１５に電源が入ると監視を開始する。この時、ＯＳは、通常処理を行っている（ステップＳ２０１）。 The signal monitoring unit 14 starts monitoring the HDD interface signal output from the hard disk drive 15 (step S101). Specifically, the signal monitoring unit 14 starts monitoring when the server 1 is activated and the hard disk drive 15 is turned on. At this time, the OS is performing normal processing (step S201).

信号変動判定部１４２は、ハードディスクドライブ１５から出力されるＨＤＤインタフェース信号が所定期間の間に変動しているか否かを判定する（ステップＳ１０２）。ＨＤＤインタフェース信号が所定期間の間に変動している場合（ステップＳ１０２：肯定）、信号変動判定部１４２は、ｎ秒待機し（ステップＳ１０３）、その後、ステップＳ１０２を繰り返す。 The signal fluctuation determination unit 142 determines whether or not the HDD interface signal output from the hard disk drive 15 fluctuates during a predetermined period (step S102). If the HDD interface signal has fluctuated during the predetermined period (step S102: affirmative), the signal fluctuation determination unit 142 waits for n seconds (step S103), and then repeats step S102.

これに対して、ＨＤＤインタフェース信号が所定期間の間に変動していない場合（ステップＳ１０２：否定）、信号変動判定部１４２は、リセット信号のカウントの開始をリセットカウンタ１４３に指示する。リセットカウンタ１４３は、信号変動判定部１４２からの指示を受けて、ＯＳからのリセット信号の数のカウントを開始する（ステップＳ１０４）。この時、リセットカウンタ１４３は、カウンタリセットタイマ１６にカウント開始を通知する。 On the other hand, when the HDD interface signal has not fluctuated during the predetermined period (step S102: No), the signal fluctuation determination unit 142 instructs the reset counter 143 to start counting the reset signal. The reset counter 143 receives the instruction from the signal fluctuation determination unit 142 and starts counting the number of reset signals from the OS (step S104). At this time, the reset counter 143 notifies the counter reset timer 16 of the start of counting.

リセットカウンタ１４３は、カウント開始の通知を受けて、時間がｍ秒経過するのを計測する。そして、ｍ秒経過すると、リセットカウンタ１４３にカウンタリセットを指示する。このｍ秒の間、リセットカウンタ１４３は、待機している（ステップＳ１０５）。この間、ＯＳは、ハードディスクドライブ１５における応答異常に基づいて、リセット信号の発行を行っている（ステップＳ２０２）。具体的には、ＯＳは、ＨＤコントローラ１３にリセット信号の発行を指示する。そして、ＯＳからの指示を受けたＨＤコントローラ１３は、リセットカウンタ１４３を経由させてハードディスクドライブ１５へリセット信号を送信する。 The reset counter 143 measures the elapse of m seconds in response to the count start notification. When m seconds elapse, the reset counter 143 is instructed to reset the counter. During this m seconds, the reset counter 143 is on standby (step S105). During this time, the OS issues a reset signal based on the response abnormality in the hard disk drive 15 (step S202). Specifically, the OS instructs the HD controller 13 to issue a reset signal. Upon receiving an instruction from the OS, the HD controller 13 transmits a reset signal to the hard disk drive 15 via the reset counter 143.

リセットカウンタ１４３は、ｍ秒の間にリセット信号の発行回数（ここでは、「ｃ」とする。）がカウンタの閾値であるＭを超えているか否か、すなわちｃ＞Ｍか否かを判定する（ステップＳ１０６）。閾値Ｍを超えていない場合（ステップＳ１０６：否定）、リセットカウンタ１４３は、カウンタをリセットした後、ステップＳ１０２へ戻る。 The reset counter 143 determines whether or not the number of reset signal issuances (here, “c”) exceeds the counter threshold M during m seconds, that is, whether c> M. (Step S106). When the threshold value M is not exceeded (No at Step S106), the reset counter 143 resets the counter and then returns to Step S102.

これに対して、閾値Ｍを超えている場合（ステップＳ１０６：肯定）、リセットカウンタ１４３は、リセット信号のカウントを停止する（ステップＳ１０７）。そして、リセットカウンタ１４３は、復旧可否判定の実行を復旧可否判定部１４４に指示する。 On the other hand, when the threshold value M is exceeded (step S106: affirmative), the reset counter 143 stops counting the reset signal (step S107). Then, the reset counter 143 instructs the recovery possibility determination unit 144 to execute the recovery possibility determination.

復旧可否判定部１４４は、リセットカウンタ１４３からの指示を受けて、復旧可否判定を開始する（ステップＳ１０８）。この時、復旧可否判定部１４４は、復旧可否判定の実施回数のカウンタ（ここでは、カウンタ値を「ｉ」とする。）を初期値にする（ここでは、ｉ＝０）。 In response to the instruction from the reset counter 143, the recovery possibility determination unit 144 starts the recovery possibility determination (step S108). At this time, the recovery possibility determination unit 144 sets an initial value (here, i = 0) to the counter of the number of executions of the recovery possibility determination (here, the counter value is “i”).

復旧可否判定部１４４は、復旧可否判定の実施回数が閾値Ｎ未満（ｉ＜Ｎ）か否かを判定する（ステップＳ１０９）。 The recovery possibility determination unit 144 determines whether the number of executions of the recovery possibility determination is less than a threshold value N (i <N) (step S109).

復旧可否判定の実施回数が閾値Ｎ未満の場合（ステップＳ１０９：肯定）、復旧可否判定部１４４は、オフオンするためのパルス信号であるスイッチ制御信号を電源スイッチ１７へ送信する（ステップＳ１１０）。 When the number of executions of the recovery possibility determination is less than the threshold value N (step S109: affirmative), the recovery possibility determination unit 144 transmits a switch control signal that is a pulse signal for turning on and off to the power switch 17 (step S110).

電源スイッチ１７がオフオンされることで、ハードディスクドライブ１５は、再起動する（ステップＳ１１１）。 When the power switch 17 is turned off, the hard disk drive 15 is restarted (step S111).

復旧可否判定部１４４は、復旧可否判定の実施回数を１つインクリメントする（ｉ＝ｉ＋１）（ステップＳ１１２）。 The recovery possibility determination unit 144 increments the number of executions of the recovery possibility determination by one (i = i + 1) (step S112).

復旧可否判定部１４４は、ハードディスクドライブ１５の起動割り込みが発生したか否かを判定する（ステップＳ１１３）。起動割り込みが発生していない場合（ステップＳ１１３：否定）、復旧可否判定部１４４は、ステップＳ１０９に戻る。 The recovery possibility determination unit 144 determines whether or not a hard disk drive 15 activation interrupt has occurred (step S113). When the activation interrupt has not occurred (No at Step S113), the recovery possibility determination unit 144 returns to Step S109.

これに対して、起動割り込みが発生している場合（ステップＳ１１３：肯定）、復旧可否判定部１４４は、復旧可否判定を解除する（ステップＳ１１４）。 On the other hand, when the activation interrupt has occurred (step S113: affirmative), the recovery possibility determination unit 144 cancels the recovery possibility determination (step S114).

そして、復旧可否判定部１４４は、ハードディスクドライブ１５が起動したことを通知する判定信号をＢＭＣ１８へ送信する（ステップＳ１１５）。ＢＭＣ１８は、強制ダンプの処理の発動をＣＰＵ１１に指示する。強制ダンプの処理の発動をＣＰＵ１１が受けると、ＯＳは、強制ダンプの処理を開始する（ステップＳ２０３）。 Then, the recovery possibility determination unit 144 transmits a determination signal notifying that the hard disk drive 15 has started up to the BMC 18 (step S115). The BMC 18 instructs the CPU 11 to start the forced dump process. When the CPU 11 receives the forced dump process, the OS starts the forced dump process (step S203).

復旧可否判定の実施回数が閾値Ｎ以上の場合（ステップＳ１０９：否定）、復旧可否判定部１４４は、ハードディスクドライブ１５の復旧が不可能と判定し、サーバ１の電源をオフするようＢＭＣ１８を介してＣＰＵ１１に指示する。ＣＰＵ１１は、復旧可否判定部１４４からの指示を受けて、サーバ１の電源をオフする（ステップＳ１１６）。 When the number of executions of the recovery possibility determination is equal to or greater than the threshold value N (No at Step S109), the recovery possibility determination unit 144 determines that the hard disk drive 15 cannot be recovered and passes the BMC 18 to turn off the server 1. Instructs the CPU 11. In response to the instruction from the recovery possibility determination unit 144, the CPU 11 turns off the power of the server 1 (step S116).

以上に説明したように、本実施例に係る情報処理装置は、ハードディスクドライブの出力データに変化が無く、且つ、リセット信号が発行された回数が所定数を超えた場合に、ハードディスクドライブが無応答であると判定する。さらに、本実施例に係る情報処理装置は、ハードディスクドライブが無応答の場合、ハードディスクドライブの電源をオンオフし、再起動できた場合には、強制ダンプの処理を実行する。これにより、ハードディスクドライブの障害をＯＳがハングアップ状態になる前に事前に検出することができる。そして、ハードディスクドライブの無応答に起因するＯＳのハングアップを回避でき、障害履歴の採取漏れを軽減できる。そのため、本実施例に係る情報処理装置は、障害履歴を用いた障害の原因究明に寄与することができる。 As described above, the information processing apparatus according to the present embodiment is configured so that the hard disk drive does not respond when there is no change in the output data of the hard disk drive and the number of times the reset signal is issued exceeds a predetermined number. It is determined that Furthermore, when the hard disk drive is not responding, the information processing apparatus according to the present embodiment performs a forced dump process when the hard disk drive is turned on / off and restarted. As a result, a failure of the hard disk drive can be detected in advance before the OS enters the hang-up state. Then, it is possible to avoid an OS hang-up caused by no response from the hard disk drive, and to reduce the failure to collect the failure history. Therefore, the information processing apparatus according to the present embodiment can contribute to the investigation of the cause of the failure using the failure history.

次に、実施例２について説明する。本実施例に係る情報処理装置は、実施例１で説明した処理に加えて、強制ダンプの処理中にもハードディスクドライブの無応答の検出及び再起動を行う。そこで、以下では、強制ダンプの処理中の動作について主に説明する。本実施例に係る情報処理装置のブロック図も、図１及び図２で表される。以下の説明では、実施例１の情報処理装置と同様の機能を有する各部については説明を省略する。 Next, Example 2 will be described. In addition to the processing described in the first embodiment, the information processing apparatus according to the present embodiment detects and restarts no response of the hard disk drive during the forced dump processing. Therefore, the operation during the forced dump process will be mainly described below. Block diagrams of the information processing apparatus according to the present embodiment are also shown in FIGS. In the following description, description of each unit having the same function as the information processing apparatus of the first embodiment is omitted.

信号監視部１４の信号変動判定部１４２は、強制ダンプの処理においてハードディスクドライブ１５のスワップ領域にメモリ１２上のデータが書き込まれている間、ハードディスクドライブ１５からの書き込み応答を監視する。そして、信号変動判定部１４２は、所定期間内に書き込み応答としてのＨＤインタフェース信号が変動するか否かを判定する。ＨＤインタフェース信号が変動しない場合、信号変動判定部１４２は、リセットカウンタ１４３にリセット信号のカウントの開始を指示する。 The signal fluctuation determination unit 142 of the signal monitoring unit 14 monitors the write response from the hard disk drive 15 while the data on the memory 12 is being written in the swap area of the hard disk drive 15 in the forced dump process. Then, the signal variation determination unit 142 determines whether or not the HD interface signal as a write response varies within a predetermined period. When the HD interface signal does not vary, the signal variation determination unit 142 instructs the reset counter 143 to start counting the reset signal.

リセットカウンタ１４３は、信号変動判定部１４２からの指示を受けて、ＨＤコントローラ１３から送られてくるリセット信号のカウントを開始する。加えて、リセットカウンタ１４３は、カウンタリセットタイマ１６にリセット信号のカウント開始を通知する。そして、リセットカウンタ１４３は、カウンタリセットタイマ１６により計測されるｍ秒の間に、リセット信号が発行された回数が閾値Ｍを超えたか否かを判定する。ｍ秒の間にリセット信号が発行された回数が閾値Ｍを超えた場合、リセットカウンタ１４３は、強制ダンプの処理を停止する指示を復旧可否判定部１４４を経由してＢＭＣ１８へ送信する。さらに、リセットカウンタ１４３は、復旧可否の判定の実行を復旧可否判定部１４４に通知する。 The reset counter 143 receives the instruction from the signal variation determination unit 142 and starts counting the reset signal transmitted from the HD controller 13. In addition, the reset counter 143 notifies the counter reset timer 16 of the start of counting the reset signal. Then, the reset counter 143 determines whether or not the number of times that the reset signal is issued exceeds the threshold M during the m seconds measured by the counter reset timer 16. When the number of times that the reset signal is issued within m seconds exceeds the threshold value M, the reset counter 143 transmits an instruction to stop the forced dump process to the BMC 18 via the recovery possibility determination unit 144. Further, the reset counter 143 notifies the recovery enable / disable determining unit 144 of the execution of the recovery enable / disable determination.

復旧可否判定部１４４は、電源スイッチ１７にスイッチ制御信号を送信し、ハードディスクドライブ１５への電源のオフオンを行う。そして、復旧可否判定部１４４は、ハードディスクドライブ１５からの軌道割り込み発生の有無により、ハードディスクドライブ１５が再起動するか否かを判定する。ハードディスクドライブ１５のオフオンを閾値であるＮ回繰り返しても再起動できない場合、復旧可否判定部１４４は、ハードディスクドライブ１５の復旧が不可能と判定し、ＢＭＣ１８へサーバの電源オフを指示する。これに対して、ハードディスクドライブ１５の再起動ができた場合、復旧可否判定部１４４は、強制ダンプの処理の発動を指示する判定信号をＢＭＣ１８へ送信する。 The recovery possibility determination unit 144 transmits a switch control signal to the power switch 17 to turn the hard disk drive 15 on and off. Then, the recovery possibility determination unit 144 determines whether or not the hard disk drive 15 is to be restarted based on whether or not a trajectory interrupt has occurred from the hard disk drive 15. If the hard disk drive 15 cannot be restarted even after being repeatedly turned off and on N times, which is the threshold, the recovery possibility determination unit 144 determines that the hard disk drive 15 cannot be recovered and instructs the BMC 18 to turn off the server. On the other hand, when the hard disk drive 15 can be restarted, the recovery possibility determination unit 144 transmits to the BMC 18 a determination signal instructing activation of the forced dump process.

ＢＭＣ１８は、強制ダンプの処理の実施中に、強制ダンプの処理を停止する指示をリセットカウンタ１４３から受信すると、強制ダンプの処理を中止するようＣＰＵ１１に支持する。そして、ＢＭＣ１８は、強制ダンプの割り込みを解除する。 When the BMC 18 receives an instruction from the reset counter 143 to stop the forced dump process during the forced dump process, the BMC 18 supports the CPU 11 to stop the forced dump process. Then, the BMC 18 cancels the forced dump interrupt.

また、ＢＭＣ１８は、強制ダンプの割り込み解除後、強制ダンプの処理の発動の指示を復旧可否判定部１４４から受けた場合、ＯＳに対して強制ダンプ割り込みを再度行い、ＣＰＵ１１に強制ダンプの処理を再度実施させる。 In addition, when the BMC 18 receives an instruction for invoking forced dump processing from the recovery possibility determination unit 144 after canceling the forced dump interrupt, the BMC 18 again performs a forced dump interrupt to the OS, and causes the CPU 11 to perform the forced dump processing again. Let it be implemented.

次に、図４を参照して、本実施例に係る情報処理装置におけるダンプ処理の流れについて説明する。図４は、実施例２に係る情報処理装置におけるダンプ処理のフローチャートである。 Next, the flow of dump processing in the information processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 4 is a flowchart of the dump process in the information processing apparatus according to the second embodiment.

ＯＳは、ＢＭＣ１８からの強制ダンプの割り込みを受け（ステップＳ３０１）、強制ダンプの処理を開始する。 The OS receives a forced dump interrupt from the BMC 18 (step S301), and starts the forced dump process.

ＣＰＵ１１は、ＯＳのクラッシュダンプ機能を動作させ、メモリ１２上のデータがハードディスクドライブ１５のスワップ領域に書き込む（ステップＳ３０２）。 The CPU 11 operates the crash dump function of the OS and writes the data on the memory 12 to the swap area of the hard disk drive 15 (step S302).

信号変動判定部１４２は、ハードディスクドライブ１５からの書き込み応答であるＨＤＤインタフェース信号が所定期間の間に変動しているか否かを判定する（ステップＳ３０３）。ＨＤＤインタフェース信号が所定期間の間に変動している場合（ステップＳ３０３：肯定）、ＣＵＰ１１は、メモリ１２上のデータ全てのハードディスクドライブ１５のスワップ領域に書き込みが完了したか否かを判定する（ステップＳ３０４）。書込みが完了していない場合（ステップＳ３０４：否定）、ＣＰＵ１１は、ステップＳ３０２に戻る。 The signal fluctuation determination unit 142 determines whether or not the HDD interface signal that is a write response from the hard disk drive 15 fluctuates during a predetermined period (step S303). If the HDD interface signal has fluctuated during a predetermined period (step S303: Yes), the CUP 11 determines whether or not writing has been completed in the swap area of all the hard disk drives 15 in the data on the memory 12 (step S303). S304). If the writing has not been completed (No at Step S304), the CPU 11 returns to Step S302.

これに対して、書込みが完了している場合（ステップＳ３０４：肯定）、ＣＰＵ１１は、サーバ１のリセット処理を実施する（ステップＳ３０５）。 On the other hand, when the writing is completed (step S304: affirmative), the CPU 11 performs a reset process of the server 1 (step S305).

そして、サーバ１が再起動した後、ＣＰＵ１１は、スワップ領域のデータをハードディスクドライブ１５のクラッシュダンプ格納ディレクトリに格納する（ステップＳ３０６）。その後、ＣＰＵ１１は、サーバ１をシャットダウンして処理を終了する。 After the server 1 is restarted, the CPU 11 stores the swap area data in the crash dump storage directory of the hard disk drive 15 (step S306). Thereafter, the CPU 11 shuts down the server 1 and ends the process.

これに対して、ＨＤＤインタフェース信号が所定期間の間に変動していない場合（ステップＳ３０３：否定）、信号変動判定部１４２は、リセット信号のカウントの開始をリセットカウンタ１４３に指示する。リセットカウンタ１４３は、信号変動判定部１４２からの指示を受けて、ＯＳからのリセット信号の数のカウントを開始する（ステップＳ３０７）。この時、リセットカウンタ１４３は、カウンタリセットタイマ１６にカウント開始を通知する。 On the other hand, when the HDD interface signal has not fluctuated during the predetermined period (step S303: No), the signal fluctuation determination unit 142 instructs the reset counter 143 to start counting the reset signal. In response to the instruction from the signal variation determination unit 142, the reset counter 143 starts counting the number of reset signals from the OS (step S307). At this time, the reset counter 143 notifies the counter reset timer 16 of the start of counting.

リセットカウンタ１４３は、カウント開始の通知を受けて、時間がｍ秒経過するのを計測する。そして、ｍ秒経過すると、リセットカウンタ１４３にカウンタリセットを指示する。このｍ秒の間、リセットカウンタ１４３は、待機している（ステップＳ３０８）。 The reset counter 143 measures the elapse of m seconds in response to the count start notification. When m seconds elapse, the reset counter 143 is instructed to reset the counter. During this m seconds, the reset counter 143 is on standby (step S308).

リセットカウンタ１４３は、ｍ秒の間にリセット信号の発行回数ｃがカウンタの閾値であるＭを超えているか否か、すなわちｃ＞Ｍか否かを判定する（ステップＳ３０９）。閾値Ｍを超えていない場合（ステップＳ３０９：否定）、リセットカウンタ１４３は、カウンタをリセットした後、ステップＳ３０２へ戻る。 The reset counter 143 determines whether or not the number of reset signal issuance c has exceeded the threshold value M of the counter during m seconds, that is, whether or not c> M (step S309). If the threshold value M has not been exceeded (No at Step S309), the reset counter 143 resets the counter and then returns to Step S302.

これに対して、閾値Ｍを超えている場合（ステップＳ３０９：肯定）、リセットカウンタ１４３は、リセット信号のカウントを停止する（ステップＳ３１０）。そして、リセットカウンタ１４３は、クラッシュダンプ処理の停止をＢＭＣ１８に通知する。また、リセットカウンタ１４３は、ハードディスクドライブ１５の復旧可否判定の実行を復旧可否判定部１４４に指示する。 On the other hand, when the threshold value M is exceeded (step S309: affirmative), the reset counter 143 stops counting the reset signal (step S310). Then, the reset counter 143 notifies the BMC 18 of the stop of the crash dump process. In addition, the reset counter 143 instructs the recovery enable / disable determining unit 144 to execute the recovery enable / disable determination of the hard disk drive 15.

ＢＭＣ１８は、クラッシュダンプ処理の停止の指示をリセットカウンタ１４３から受けて、ＣＰＵ１１のクラッシュダンプ処理を停止させる（ステップＳ３１１）。 The BMC 18 receives an instruction to stop the crash dump process from the reset counter 143 and stops the crash dump process of the CPU 11 (step S311).

さらに、ＢＭＣ１８は、ＯＳに対する強制ダンプの割り込みを解除する（ステップＳ３１２）。 Further, the BMC 18 cancels the forced dump interrupt to the OS (step S312).

復旧可否判定部１４４は、リセットカウンタ１４３からの指示を受けて、復旧可否判定を開始する（ステップＳ３１３）。この時、復旧可否判定部１４４は、復旧可否判定の実施回数のカウンタを初期値にする（ｉ＝０）。 In response to the instruction from the reset counter 143, the recovery possibility determination unit 144 starts a recovery possibility determination (step S313). At this time, the recovery possibility determination unit 144 sets a counter of the number of executions of the recovery possibility determination to an initial value (i = 0).

復旧可否判定部１４４は、復旧可否判定の実施回数が閾値Ｎ未満（ｉ＜Ｎ）か否かを判定する（ステップＳ３１４）。 The recovery possibility determination unit 144 determines whether the number of executions of the recovery possibility determination is less than a threshold value N (i <N) (step S314).

復旧可否判定の実施回数が閾値Ｎ未満の場合（ステップＳ３１４：肯定）、復旧可否判定部１４４は、オフオンするためのパルス信号であるスイッチ制御信号を電源スイッチ１７へ送信する（ステップＳ３１５）。 When the number of executions of the recovery possibility determination is less than the threshold value N (step S314: Yes), the recovery possibility determination unit 144 transmits a switch control signal, which is a pulse signal for turning on and off, to the power switch 17 (step S315).

電源スイッチ１７がオフオンされることで、ハードディスクドライブ１５は、再起動する（ステップＳ３１６）。 When the power switch 17 is turned off, the hard disk drive 15 is restarted (step S316).

復旧可否判定部１４４は、復旧可否判定の実施回数を１つインクリメントする（ｉ＝ｉ＋１）（ステップＳ３１７）。 The recovery possibility determination unit 144 increments the number of executions of the recovery possibility determination by one (i = i + 1) (step S317).

復旧可否判定部１４４は、ハードディスクドライブ１５の起動割り込みが発生したか否かを判定する（ステップＳ３１８）。起動割り込みが発生していない場合（ステップＳ３１８：否定）、復旧可否判定部１４４は、ステップＳ３１４に戻る。 The recovery possibility determination unit 144 determines whether or not a hard disk drive 15 start interrupt has occurred (step S318). When the activation interrupt has not occurred (No at Step S318), the recovery possibility determination unit 144 returns to Step S314.

これに対して、起動割り込みが発生している場合（ステップＳ３１８：肯定）、復旧可否判定部１４４は、復旧可否判定を解除する（ステップＳ３１９）。 On the other hand, when the activation interrupt has occurred (step S318: affirmative), the recovery possibility determination unit 144 cancels the recovery possibility determination (step S319).

そして、復旧可否判定部１４４は、ハードディスクドライブ１５が起動したことを通知する判定信号をＢＭＣ１８へ送信する（ステップＳ３２０）。その後、ＢＭＣ１８は、ステップＳ３０１へ戻る。 Then, the recovery possibility determination unit 144 transmits a determination signal notifying that the hard disk drive 15 has been activated to the BMC 18 (step S320). Thereafter, the BMC 18 returns to Step S301.

一方、復旧可否判定の実施回数が閾値Ｎ以上の場合（ステップＳ３１４：否定）、復旧可否判定部１４４は、ハードディスクドライブ１５の復旧が不可能と判定し、サーバ１の電源をオフするようＢＭＣ１８を介してＣＰＵ１１に指示する。ＣＰＵ１１は、復旧可否判定部１４４からの指示を受けて、サーバ１の電源をオフし（ステップＳ３２１）、処理を終了する。 On the other hand, when the number of executions of the recovery possibility determination is greater than or equal to the threshold value N (No at Step S314), the recovery possibility determination unit 144 determines that the hard disk drive 15 cannot be recovered and sets the BMC 18 to turn off the server 1. To the CPU 11. In response to the instruction from the recovery possibility determination unit 144, the CPU 11 turns off the server 1 (step S321) and ends the process.

以上に説明したように、本実施例に係る情報処理装置は、ＯＳのクラッシュダンプ機能によるダンプ処理の間にもハードディスクドライブの無応答の検出及び再起動を行う。これにより、ＯＳによりダンプ処理が行われている間にハードディスクドライブの無応答が発生しても復旧を行うことができ、ハードディスクドライブに障害履歴を格納することができる。すなわち、本実施例に係る情報処理装置は、ＯＳがハングアップする前の事前のハードディスク障害の検出及びダンプ処理時のハードディスク障害の回避ができ、より確実に障害履歴の取得漏れを回避することができる。 As described above, the information processing apparatus according to the present embodiment detects the non-response of the hard disk drive and restarts even during the dump process by the OS crash dump function. Thereby, even if the hard disk drive does not respond during the dump process by the OS, the recovery can be performed, and the failure history can be stored in the hard disk drive. That is, the information processing apparatus according to the present embodiment can detect a hard disk failure in advance before the OS hangs up and avoid a hard disk failure at the time of dump processing, and can more reliably avoid a failure history acquisition failure. it can.

（ハードウェア構成）
図５は、各実施例に係るサーバのハードウェア構成の一例の図である。図５に示すように、サーバ１は、例えば、図１に例示したＣＰＵ１１、メモリ１２及びＢＭＣ１８などを搭載するボード８００と、ＨＤコントローラ１３、信号監視部１４及びハードディスクドライブ１５などを搭載するボード９００を有する。(Hardware configuration)
FIG. 5 is a diagram illustrating an example of a hardware configuration of a server according to each embodiment. As shown in FIG. 5, the server 1 includes, for example, a board 800 on which the CPU 11, the memory 12 and the BMC 18 illustrated in FIG. 1 are mounted, and a board 900 on which the HD controller 13, the signal monitoring unit 14, the hard disk drive 15, and the like are mounted. Have

ボード８００とボード９００とはコネクタ８１０で接続されており、ボード８００に搭載されているＣＰＵ１１などとボード９００に搭載されているＨＤコントローラ１３などとは通信可能である。 The board 800 and the board 900 are connected by a connector 810, and the CPU 11 and the like mounted on the board 800 can communicate with the HD controller 13 and the like mounted on the board 900.

さらに、ボード８００には、ＤＣ／ＤＣ変換器８０１、ＵＤＢＩＦ８０２及びシリアルＩＦ８０３などが搭載されている。 Further, the board 800 includes a DC / DC converter 801, a UDBIF 802, a serial IF 803, and the like.

ＤＣ／ＤＣ変換器８０１は、外部電源から供給される電力の電圧をＣＰＵ１１やメモリ１２が使用できる電圧まで下げて各部に電力を供給する。ここで、図５では、説明の都合上、ＤＣ／ＤＣ変換器８０１から各部への電力供給線を記載していないが、実際には、ＤＣ／ＤＣ変換器８０１からボード８００上の各部に電力供給線が接続されている。 The DC / DC converter 801 reduces the voltage of power supplied from an external power source to a voltage that can be used by the CPU 11 and the memory 12 and supplies power to each unit. Here, in FIG. 5, for convenience of explanation, the power supply line from the DC / DC converter 801 to each unit is not shown, but in reality, power is supplied from the DC / DC converter 801 to each unit on the board 800. Supply line is connected.

ＢＭＣ１８は、例えば、ＤＣ／ＤＣ変換器８０１からの電力の供給を停止させることで、サーバ１の電源をオフにする。 For example, the BMC 18 stops the power supply of the server 1 by stopping the supply of power from the DC / DC converter 801.

ボード９００には、タイマ９０１、電源回路９０２、ＦＥＴスイッチ９０３などがさらに搭載されている。タイマ９０１は、図１に例示したカウンタリセットタイマ１６などの機能を実現する。電源回路９０２は、図１に例示したＨＤＤ電源２０などの機能を実現する。ＦＥＴスイッチ９０３は、図１に例示した電源スイッチ１７などの機能を実現する。 The board 900 further includes a timer 901, a power supply circuit 902, an FET switch 903, and the like. The timer 901 implements functions such as the counter reset timer 16 illustrated in FIG. The power supply circuit 902 implements functions such as the HDD power supply 20 illustrated in FIG. The FET switch 903 implements functions such as the power switch 17 illustrated in FIG.

搭載された信号監視部１４によって、ハードディスクドライブ１５の無応答の判定及び復旧可否判定を実施する機能が実現される。 The mounted signal monitoring unit 14 realizes a function of determining non-response of the hard disk drive 15 and determining whether it can be restored.

１サーバ
１１ＣＰＵ
１２メモリ
１３ＨＤコントローラ
１４信号監視部
１５ハードディスクドライブ
１６カウンタリセットタイマ
１７電源スイッチ
１８ＢＭＣ
１９サーバ電源
２０ＨＤＤ電源
１４１データ変動計測タイマ
１４２信号変動判定部
１４３リセットカウンタ
１４４復旧可否判定部1 server 11 CPU
12 Memory 13 HD Controller 14 Signal Monitoring Unit 15 Hard Disk Drive 16 Counter Reset Timer 17 Power Switch 18 BMC
19 Server power supply 20 HDD power supply 141 Data fluctuation measurement timer 142 Signal fluctuation judgment part 143 Reset counter 144 Restorability judgment part

Claims

An output abnormality detection unit that detects an output abnormality based on the output data of the hard disk drive;
When an output abnormality is detected by the output abnormality detection unit, a reset unit that performs a reset process that transmits a reset signal to the hard disk drive to restart the hard disk drive;
An HDD power controller that turns on and off the hard disk drive when the number of reset processes by the reset unit exceeds a threshold;
An information processing apparatus comprising: a failure record collection unit that performs a failure record collection process for storing a failure record in the hard disk drive when the hard disk drive is activated by turning on and off the power supply by the HDD power supply control unit.

The information processing apparatus according to claim 1, wherein the output abnormality detection unit determines an output abnormality if there is no change in the output data from the hard disk drive at a predetermined time.

The HDD power control unit determines that the hard disk drive has started when a startup interrupt is generated by the hard disk drive,
The information processing apparatus according to claim 1, wherein the failure record collection unit performs the failure record collection process when the HDD power supply control unit determines that the hard disk drive is activated.

The information processing apparatus according to claim 1, further comprising: a power control unit that turns off the power of the information processing apparatus when the number of times of turning on / off the power by the HDD power control unit exceeds a predetermined number.

The output abnormality detection unit detects the output abnormality of the hard disk drive during the failure record collection process by the failure record collection unit,
The reset unit performs the reset process when an output abnormality is detected by the output abnormality detection unit during the failure record collection process by the failure record collection unit,
The HDD power control unit turns on / off the power of the hard disk drive when the number of reset processes by the reset unit exceeds a threshold during the fault record collection process by the fault record collection unit,
The failure record collecting unit performs the failure record collecting process again when the hard disk drive is started when the HDD power control unit is turned on / off during the failure record collecting process. The information processing apparatus according to claim 1.

An output error is detected based on the output data of the hard disk drive.
When the output abnormality is detected, a reset signal is transmitted to the hard disk drive to restart the hard disk drive,
When the number of reset processes exceeds a threshold, the hard disk drive is turned on and off,
When the hard disk drive is activated by turning on and off the power, a failure record collecting process for storing a failure record in the hard disk drive is performed.

An output error is detected based on the output data of the hard disk drive.
When the output abnormality is detected, a reset signal is sent to the hard disk drive to restart the hard disk drive, and
When the number of reset processes exceeds a threshold, the hard disk drive is turned on and off,
An information processing apparatus control program for causing a computer to execute a process of storing a failure record in the hard disk when the hard disk drive is normally started by turning on and off the power.