JP2014021577A

JP2014021577A - Apparatus, system, method, and program for failure prediction

Info

Publication number: JP2014021577A
Application number: JP2012157099A
Authority: JP
Inventors: Koichi Kisogawa; 晃一木曽川
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 2012-07-13
Filing date: 2012-07-13
Publication date: 2014-02-03

Abstract

PROBLEM TO BE SOLVED: To provide a failure prediction device capable of predicting fault locations in an electronic device (hardware) in advance from execution time of a diagnosis program run on the electronic device.SOLUTION: A failure prediction device of the present invention includes: measurement means which measures execution time taken to diagnose each hardware constituent element of an electronic device in association with identification information of each hardware constituting element by a hardware constituent element unit; and reporting means which reports anomaly when execution time for any one of the hardware constituent elements exceeds a corresponding reference value set for the hardware constituent element.

Description

本願発明は、電子装置のハードウェアに顕在化することなく内在する障害を、電子装置の診断プログラムの実行時間から検知するための故障予測装置、故障予測システム、故障予測方法、及び、故障予測プログラムに関する。 The present invention relates to a failure prediction device, a failure prediction system, a failure prediction method, and a failure prediction program for detecting a failure inherent in the hardware of an electronic device without revealing it from the execution time of a diagnostic program for the electronic device About.

通常、電子装置の保守においては、電子装置に障害が発生して初めて故障と診断され、障害発生時のログを解析することで障害の原因を追究し、部品交換等により対応することが多い。しかしながら、特に磁気ディスク等の電子装置に関して前述の保守対応を行う場合、故障の程度によってはシステムが起動不可になるケースもあり、重要なデータを損失する恐れがある。したがって、このような問題を回避するため、電子装置のハードウェアの故障を、顕在化する前に事前に予測することが求められている。 Normally, maintenance of an electronic device is often diagnosed as a failure only when a failure occurs in the electronic device, and the cause of the failure is investigated by analyzing a log at the time of the failure, and is often dealt with by parts replacement or the like. However, when performing the above-mentioned maintenance response especially for an electronic device such as a magnetic disk, the system may not be able to start depending on the degree of failure, and important data may be lost. Therefore, in order to avoid such a problem, it is required to predict in advance a hardware failure of the electronic device before it becomes apparent.

このような電子装置のハードウェアの故障を事前に予測するための関連技術として、特急文献１には、電子装置の電源投入からＯＳ起動処理完了までの起動時間を測定して記憶し、起動時間が所定時間を越えた場合、あるいは、蓄積した起動時間の偏差が所定範囲を超えた場合に、異常を判断する異常検査方法が公開されている。 As a related technique for predicting a hardware failure of such an electronic device in advance, the Express Document 1 measures and stores the startup time from the power-on of the electronic device to the completion of the OS startup process, and the startup time. An abnormality inspection method for determining an abnormality when a predetermined time exceeds a predetermined time or when a deviation of accumulated startup time exceeds a predetermined range is disclosed.

また、特許文献２には、電子装置の電源投入後の診断プログラムの実行時間を測定し、測定結果の時間に所定のマージンを加えた時間を、次回のシステム起動時のタイムアウト時間に設定する装置が公開されている。 Patent Document 2 discloses an apparatus for measuring the execution time of a diagnostic program after powering on an electronic device and setting a time obtained by adding a predetermined margin to the time of the measurement result as a timeout time at the next system startup. Is published.

特開2012-58782号公報JP 2012-58782 JP 特開2005-165415号公報JP 2005-165415 A

前述の特許文献１や特許文献２の技術では、装置全体としての起動時間や診断プログラムの実行時間を基に異常検出を行う。このとき、例えば、何れかのハードウェアの部品の診断実行時間が長くなっていても、その他の大部分の部品の診断実行時間が基準値より短く終了した場合は、装置全体としての診断実行時間としては正常範囲となり、故障が隠蔽されてしまう恐れがある。すなわち、特許文献１や特許文献２の技術では、高い精度で故障予測を行うことが期待できない。 In the techniques of Patent Document 1 and Patent Document 2 described above, abnormality detection is performed based on the startup time of the entire apparatus and the execution time of the diagnostic program. At this time, for example, even if the diagnosis execution time of any hardware component is long, if the diagnosis execution time of most other components is shorter than the reference value, the diagnosis execution time of the entire device As a result, there is a risk that the failure will be concealed. That is, the techniques of Patent Document 1 and Patent Document 2 cannot be expected to perform failure prediction with high accuracy.

また、特許文献１や特許文献２の技術では、電子装置のハードウェアの故障の存在の可能性を指摘するのみであり、ハードウェアのどの部分に故障の可能性があるかを特定するためには、保守員による装置ログの解析を待たなければならない。故障の可能性がある箇所の特定に時間が取られて部品交換のタイミングを逃した場合、最悪、部品交換を行う前に実際の障害が発生する可能性がある。 In addition, the techniques of Patent Document 1 and Patent Document 2 merely point out the possibility of hardware failure of the electronic device, and to identify which part of the hardware has the possibility of failure. Must wait for maintenance personnel to analyze the device log. If it takes time to identify a part where there is a possibility of failure and miss the timing of parts replacement, in the worst case, an actual failure may occur before parts replacement.

上述の問題を回避するためには、高い精度で故障の可能性がある箇所を特定可能な故障予測を行うことが課題となる。 In order to avoid the above-mentioned problem, it becomes a problem to perform failure prediction that can identify a portion having a possibility of failure with high accuracy.

本願発明の目的は、上述の課題を解決した故障予測装置、故障予測システム、故障予測方法、及び、故障予測プログラムを提供することである。 An object of the present invention is to provide a failure prediction device, a failure prediction system, a failure prediction method, and a failure prediction program that solve the above-described problems.

本願発明の一実施形態の故障予測装置は、電子装置が包含するハードウェア構成要素単位で、前記ハードウェア構成要素の診断に要する実行時間を、当該ハードウェア構成要素の識別情報に対応付けて測定する測定手段と、何れかの前記ハードウェア構成要素において、前記実行時間が、当該ハードウェア構成要素に設定された基準値を超えた場合に、異常を報告する報告手段と、を備える。 The failure prediction apparatus according to an embodiment of the present invention measures the execution time required for diagnosis of a hardware component in association with identification information of the hardware component in units of hardware components included in the electronic device. And measuring means for reporting, in any one of the hardware components, a reporting means for reporting an abnormality when the execution time exceeds a reference value set in the hardware component.

本願発明の一実施形態の故障予測方法は、電子装置が包含するハードウェア構成要素単位で、前記ハードウェア構成要素の診断に要する実行時間を、当該ハードウェア構成要素の識別情報に対応付けて測定し、何れかの前記ハードウェア構成要素において、前記実行時間が、当該ハードウェア構成要素に設定された基準値を超えた場合に、異常を報告する。 The failure prediction method according to an embodiment of the present invention measures the execution time required for diagnosis of the hardware component in association with the identification information of the hardware component in units of hardware components included in the electronic device. In any of the hardware components, when the execution time exceeds a reference value set in the hardware component, an abnormality is reported.

本願発明の一実施形態の故障予測プログラムは、電子装置が包含するハードウェア構成要素単位で、前記ハードウェア構成要素の診断に要する実行時間を、当該ハードウェア構成要素の識別情報に対応付けて測定する測定処理と、何れかの前記ハードウェア構成要素において、前記実行時間が、当該ハードウェア構成要素に設定された基準値を超えた場合に、異常を報告する報告処理と、をコンピュータに実行させる。 The failure prediction program according to an embodiment of the present invention measures the execution time required for diagnosis of the hardware component in association with the identification information of the hardware component in units of hardware components included in the electronic device. And a reporting process for reporting an abnormality when the execution time exceeds a reference value set in the hardware component in any of the hardware components .

本願発明は、電子装置のハードウェアに顕在化することなく内在する障害の箇所を、電子装置の診断実行時間から、事前に予測できるようにする。 The present invention makes it possible to predict in advance the location of an inherent failure without revealing it in the hardware of the electronic device from the diagnosis execution time of the electronic device.

本願発明の第１の実施形態の故障予測システムの構成を示すブロック図である。It is a block diagram which shows the structure of the failure prediction system of 1st Embodiment of this invention. 本願発明の第１の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the 1st Embodiment of this invention. 本願発明の第１の実施形態における基準実行時間記憶部に格納された情報の構成例である。It is a structural example of the information stored in the reference | standard execution time memory | storage part in 1st Embodiment of this invention. 本願発明の第１の実施形態における許容時間の情報の構成例である。It is a structural example of the information of the allowable time in 1st Embodiment of this invention. 本願発明の第２の実施形態の故障予測システムの構成を示すブロック図である。It is a block diagram which shows the structure of the failure prediction system of 2nd Embodiment of this invention.

本願発明の第一の実施の形態について図面を参照して詳細に説明する。 A first embodiment of the present invention will be described in detail with reference to the drawings.

図１は本実施形態の故障予測システムの構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of the failure prediction system of the present embodiment.

本実施形態の故障予測システム１は、電子装置２と、保守サーバ３とを包含している。保守サーバ３は、電子装置２の保守員が電子装置２の保守作業を行うために、情報の入出力を行うためのサーバである。 The failure prediction system 1 of the present embodiment includes an electronic device 2 and a maintenance server 3. The maintenance server 3 is a server for inputting and outputting information for maintenance personnel of the electronic device 2 to perform maintenance work on the electronic device 2.

電子装置２は、故障予測装置１０と、電源入力部２０と、ＣＰＵ部３０と、メモリ部４０と、ディスク部５０と、管理部６０とを包含している。 The electronic device 2 includes a failure prediction device 10, a power input unit 20, a CPU unit 30, a memory unit 40, a disk unit 50, and a management unit 60.

電源入力部２０は、電子装置２への電源投入を検知し、電源投入信号を故障予測装置１０へ送信する。ＣＰＵ部３０、メモリ部４０、ディスク部５０は、それぞれ、複数のＣＰＵモジュール、メモリモジュール、ディスクモジュールを包含しており、故障した場合はモジュール単位で交換される。管理部６０は、ログ６１を包含し、電子装置２の障害履歴を管理し、保守サーバ３へ障害状況を報告する。 The power input unit 20 detects power-on to the electronic device 2 and transmits a power-on signal to the failure prediction device 10. The CPU unit 30, the memory unit 40, and the disk unit 50 each include a plurality of CPU modules, memory modules, and disk modules, and are replaced in units of modules when a failure occurs. The management unit 60 includes the log 61, manages the failure history of the electronic device 2, and reports the failure status to the maintenance server 3.

故障予測装置１０は、ＰＯＳＴ（ＰｏｗｅｒＯｎＳｅｌｆＴｅｓｔ）実行部１１と、測定部１２と、構成変更検出部１５と、基準実行時間記憶部１６と、報告部１７とを包含している。測定部１２は、ＰＯＳＴ実行検出部１３と、カウント部１４とを包含している。 The failure prediction apparatus 10 includes a POST (Power On Self Test) execution unit 11, a measurement unit 12, a configuration change detection unit 15, a reference execution time storage unit 16, and a report unit 17. The measurement unit 12 includes a POST execution detection unit 13 and a count unit 14.

ＰＯＳＴ実行部１１は、電子装置２の電源立ち上げ時の電子装置２の診断プログラムであるＰＯＳＴプログラムを格納している。ＰＯＳＴ実行部１１は、電源入力部２０から電源投入信号を受信すると、ＣＰＵ部３０、メモリ部４０、ディスク部５０の診断を順次開始する。 The POST execution unit 11 stores a POST program that is a diagnostic program for the electronic device 2 when the electronic device 2 is powered on. When the POST execution unit 11 receives a power-on signal from the power input unit 20, the POST execution unit 11 sequentially starts diagnosis of the CPU unit 30, the memory unit 40, and the disk unit 50.

ＰＯＳＴ実行検出部１３は、ＰＯＳＴ実行部１１による、ＣＰＵ部３０、メモリ部４０、及び、ディスク部５０における各モジュールに対するＰＯＳＴの実行開始と実行終了を検知する。そして、ＰＯＳＴ実行検出部１３は、ＰＯＳＴの実効対象のモジュールのモジュールＩＤとともに、検知結果をカウント部１４へ送信する。 The POST execution detection unit 13 detects the start and end of POST execution for each module in the CPU unit 30, the memory unit 40, and the disk unit 50 by the POST execution unit 11. Then, the POST execution detection unit 13 transmits the detection result to the count unit 14 together with the module ID of the effective module of POST.

カウント部１４は、ＰＯＳＴ実行検出部１３から、ＰＯＳＴの実行開始と、ＰＯＳＴが実行開始されたモジュールＩＤを受信すると、カウント動作を開始する。カウント部１４は、ＰＯＳＴ実行検出部１３から当該モジュールのＰＯＳＴの実行終了を受信すると、カウント動作を終了し、カウント値を、当該モジュールのモジュールＩＤとともに、基準実行時間記憶部１６と報告部１７へ送信する。 When the count unit 14 receives from the POST execution detection unit 13 the start of POST execution and the module ID for which POST execution has started, the count unit 14 starts the count operation. When the count unit 14 receives the completion of POST execution of the module from the POST execution detection unit 13, the count unit 14 terminates the count operation and sends the count value together with the module ID of the module to the reference execution time storage unit 16 and the report unit 17. Send.

構成変更検出部１５は、ＣＰＵ部３０、メモリ部４０、ディスク部５０における、モジュールの追加や削除による構成変更が発生した後の最初の電子装置２の電源立ち上げ時に、構成変更内容を検知して、検知結果を基準実行時間記憶部１６に送信する。 The configuration change detection unit 15 detects the content of the configuration change when the power of the electronic device 2 is first turned on after the configuration change due to addition or deletion of a module in the CPU unit 30, the memory unit 40, or the disk unit 50 occurs. Then, the detection result is transmitted to the reference execution time storage unit 16.

基準実行時間記憶部１６は、ＣＰＵ部３０、メモリ部４０、ディスク部５０のモジュール毎のＰＯＳＴの基準実行時間と、モジュールの新規追加の情報を格納している。基準実行時間記憶部１６に格納された情報の構成例を図３に示す。 The reference execution time storage unit 16 stores the reference execution time of POST for each module of the CPU unit 30, the memory unit 40, and the disk unit 50, and new addition information of modules. A configuration example of the information stored in the reference execution time storage unit 16 is shown in FIG.

図３の例では、ＣＰＵ部３０は、モジュールＩＤがＣＰＵ０乃至ＣＰＵ３の４個のＣＰＵモジュールを包含している。メモリ部４０は、モジュールＩＤがＭＥＭ０乃至ＭＥＭ７の８個のメモリモジュールを包含している。ディスク部５０は、モジュールＩＤがＤＩＳＫ０乃至ＤＩＳＫ３の４個のディスクモジュールを包含している。 In the example of FIG. 3, the CPU unit 30 includes four CPU modules whose module IDs are CPU0 to CPU3. The memory unit 40 includes eight memory modules having module IDs MEM0 to MEM7. The disk unit 50 includes four disk modules whose module IDs are DISK0 to DISK3.

基準実行時間は、各モジュールのＰＯＳＴの標準的な実行時間を示すものである。基準実行時間記憶部１６は、モジュールが新規追加された後の最初の電源立ち上げ時に、カウント部１４から受信したカウント値を、そのモジュールの基準実行時間として格納する。 The reference execution time indicates a standard execution time of POST of each module. The reference execution time storage unit 16 stores the count value received from the count unit 14 as the reference execution time of the module when the power supply is first turned on after a new module is added.

新規追加の項目は、モジュールが新規追加されたものであることを示すフラグである。基準実行時間記憶部１６は、構成変更検出部１５から受信した構成変更の検知結果から、新規追加されたモジュールのレコードを追加して、そのレコードの新規追加のフラグに値を設定する。図３の例では、新規追加のフラグが１であるモジュールが新規追加されたものであり、ＣＰＵ３、ＭＥＭ６、ＭＥＭ７、ＤＩＳＫ３の各モジュールが新規追加されている。新規追加モジュールについては、基準実行時間がまだ設定されていなく、基準実行時間記憶部１６は、カウント部１４から受信した、モジュールＩＤとカウント値から、新規追加モジュールの基準実行時間を格納する。 The newly added item is a flag indicating that a module has been newly added. The reference execution time storage unit 16 adds a record of the newly added module from the configuration change detection result received from the configuration change detection unit 15 and sets a value in the newly added flag of the record. In the example of FIG. 3, a module whose newly added flag is 1 is newly added, and the CPU3, MEM6, MEM7, and DISK3 modules are newly added. The reference execution time is not yet set for the newly added module, and the reference execution time storage unit 16 stores the reference execution time of the newly added module from the module ID and the count value received from the count unit 14.

報告部１７は、カウント部１４からモジュールＩＤとカウント値を受信すると、基準実行時間記憶部１６から、受信したモジュールＩＤとモジュールＩＤが一致するレコードの基準実行時間を読み出す。報告部１７は、受信したモジュールに対応する許容時間１７０の値を、基準実行時間記憶部１６から読み出した基準実行時間の値に加算する。 When the reporting unit 17 receives the module ID and the count value from the counting unit 14, the reporting unit 17 reads the reference execution time of the record in which the received module ID and the module ID match from the reference execution time storage unit 16. The reporting unit 17 adds the value of the allowable time 170 corresponding to the received module to the value of the reference execution time read from the reference execution time storage unit 16.

許容時間１７０は、報告部１７が、カウント部１４から受信したカウント値と、基準実行時間記憶部１６から読み出した基準実行時間から、モジュールの異常を判定するときの、カウント値と基準実行時間の差分が取り得る許容値を示すものである。許容時間１７０の情報の構成例を図４に示す。 The allowable time 170 is a count value and a reference execution time when the reporting unit 17 determines a module abnormality from the count value received from the count unit 14 and the reference execution time read from the reference execution time storage unit 16. It indicates the allowable value that the difference can take. A configuration example of the information of the allowable time 170 is shown in FIG.

図４の例では、許容時間１７０は、ＣＰＵ部３０、メモリ部４０、ディスク部５０に対する許容時間として、それぞれ、１０ミリ秒、５０ミリ秒、２０００ミリ秒の値を包含している。また、図３に示すとおり、ＣＰＵ部３０、メモリ部４０、ディスク部５０の各モジュールの平均的な基準実行時間は、それぞれ、２０ミリ秒、１００ミリ秒、４０００ミリ秒程度である。したがって本実施形態では、許容時間１７０は、標準的な基準実行時間の約半分の値としているが、これは一例であり、これとは異なる基準で許容時間１７０の値を設定する場合もある。尚、許容時間１７０の値は、故障予測システム１の管理者が人手で設定する場合もあれば、故障予測装置１０が基準実行時間記憶部１６に格納された各モジュールの基準実行時間から標準的な基準実行時間を算出し、その値を基に所定の計算式で設定する場合もある。 In the example of FIG. 4, the allowable time 170 includes values of 10 milliseconds, 50 milliseconds, and 2000 milliseconds as allowable times for the CPU unit 30, the memory unit 40, and the disk unit 50, respectively. Further, as shown in FIG. 3, the average reference execution times of the modules of the CPU unit 30, the memory unit 40, and the disk unit 50 are about 20 milliseconds, 100 milliseconds, and 4000 milliseconds, respectively. Therefore, in the present embodiment, the allowable time 170 is a value that is approximately half of the standard reference execution time, but this is an example, and the value of the allowable time 170 may be set based on a different reference. Note that the value of the allowable time 170 may be set manually by the administrator of the failure prediction system 1 or may be standardized from the reference execution time of each module stored in the reference execution time storage unit 16 by the failure prediction apparatus 10. In some cases, a basic reference execution time is calculated and set by a predetermined calculation formula based on the calculated value.

報告部１７は、上述した各モジュールの基準実行時間と許容時間の加算値と、カウント部１４から受信したカウント値を比較する。報告部１７は、カウント値の方が大きい場合、診断対象のモジュールが異常であると判定し、当該モジュールのモジュールＩＤとともに管理部６０へ異常発生を報告する。 The reporting unit 17 compares the added value of the reference execution time and the allowable time of each module described above with the count value received from the counting unit 14. When the count value is larger, the reporting unit 17 determines that the diagnosis target module is abnormal, and reports the occurrence of the abnormality to the management unit 60 together with the module ID of the module.

例えば、ＣＰＵ０のモジュールの場合、図３に示すとおり基準実行時間が２０ミリ秒であり、図４に示すとおりＣＰＵ部の許容時間が１０ミリ秒である。したがって、報告部１７は、カウント部１４から受信したＣＰＵ０に関するカウント値が、上述の２つの値を加算した３０ミリ秒よりも大きい場合、ＣＰＵ０のモジュールの異常を、管理部６０へ報告する。 For example, in the case of the module of CPU0, the reference execution time is 20 milliseconds as shown in FIG. 3, and the allowable time of the CPU unit is 10 milliseconds as shown in FIG. Therefore, when the count value related to CPU0 received from the count unit 14 is larger than 30 milliseconds obtained by adding the above two values, the report unit 17 reports the module abnormality of the CPU0 to the management unit 60.

次に図２のフローチャートを参照して、本実施形態の動作について詳細に説明する。 Next, the operation of this embodiment will be described in detail with reference to the flowchart of FIG.

ＰＯＳＴ実行部１１は、電源入力部２０から電源投入信号を受信する（Ｓ１０１）。ＰＯＳＴ実行部１１は、ＣＰＵ部３０、メモリ部４０、ディスク部５０に対して順番に診断を開始する（Ｓ１０２）。 The POST execution unit 11 receives a power-on signal from the power input unit 20 (S101). The POST execution unit 11 starts diagnosis for the CPU unit 30, the memory unit 40, and the disk unit 50 in order (S102).

ＰＯＳＴ実行検出部１３は、ＰＯＳＴ実行部１１による診断開始を検出し、ＰＯＳＴ実行部１１が診断を開始したモジュールのモジュールＩＤとともに、カウント部１４へ送信する（Ｓ１０３）。カウント部１４は、カウント動作を開始する（Ｓ１０４）。ＰＯＳＴ実行検出部１３は、ＰＯＳＴ実行部１１による当該モジュールの診断終了を検出し、カウント部１４へ送信する（Ｓ１０５）。カウント部１４は、カウント動作を停止し、カウント値と当該モジュールのモジュールＩＤを、基準実行時間記憶部１６と報告部１７へ送信する（Ｓ１０６）。 The POST execution detection unit 13 detects the start of diagnosis by the POST execution unit 11, and transmits it to the count unit 14 together with the module ID of the module for which the POST execution unit 11 has started diagnosis (S103). The count unit 14 starts a count operation (S104). The POST execution detection unit 13 detects the end of diagnosis of the module by the POST execution unit 11 and transmits it to the count unit 14 (S105). The count unit 14 stops the count operation, and transmits the count value and the module ID of the module to the reference execution time storage unit 16 and the report unit 17 (S106).

基準実行時間記憶部１６は、カウント部１４から受信した当該モジュールのモジュールＩＤとモジュールＩＤが一致するレコードの新規追加フラグを確認する（Ｓ１０７）。新規追加フラグの値が１である場合（Ｓ１０８でＹｅｓ）、基準実行時間記憶部１６は、当該モジュールに対応するレコードの基準実行時間の項目に、カウント部１４から受信したカウント値を書き込む（Ｓ１０９）。新規追加フラグの値が０である場合（Ｓ１０８でＮｏ）、処理はＳ１１０へ進む。 The reference execution time storage unit 16 checks the new addition flag of the record whose module ID matches the module ID of the module received from the counting unit 14 (S107). When the value of the new addition flag is 1 (Yes in S108), the reference execution time storage unit 16 writes the count value received from the count unit 14 in the item of reference execution time of the record corresponding to the module (S109). ). If the value of the new addition flag is 0 (No in S108), the process proceeds to S110.

報告部１７は、基準実行時間記憶部１６における当該モジュールに対応するレコードの基準実行時間の値を読み出し、許容時間１７０に記載された当該モジュールの所属するユニットの許容時間の値を加算する。そして、報告部１７は、上記加算値と、カウント部１４から受信したカウント値を比較する（Ｓ１１０）。カウント値の方が上記加算値より大きい場合（Ｓ１１１でＹｅｓ）、報告部１７は、管理部６０へ当該モジュールの異常をモジュールＩＤとともに報告する（Ｓ１１２）。カウント値が上記加算値以下の場合（Ｓ１１１でＮｏ）、処理はＳ１１３に進む。当該モジュールがＰＯＳＴ実行部１１による最後の診断モジュールでない場合（Ｓ１１３でＮｏ）、処理はＳ１０３へ戻る。当該モジュールがＰＯＳＴ実行部１１による最後の診断モジュールである場合（Ｓ１１３でＹｅｓ）、全体の処理は終了する。 The reporting unit 17 reads the reference execution time value of the record corresponding to the module in the reference execution time storage unit 16 and adds the allowable time value of the unit to which the module belongs described in the allowable time 170. Then, the reporting unit 17 compares the added value with the count value received from the counting unit 14 (S110). When the count value is larger than the above addition value (Yes in S111), the reporting unit 17 reports the abnormality of the module together with the module ID to the management unit 60 (S112). If the count value is less than or equal to the above addition value (No in S111), the process proceeds to S113. If the module is not the last diagnostic module by the POST execution unit 11 (No in S113), the process returns to S103. When the module is the last diagnostic module by the POST execution unit 11 (Yes in S113), the entire process ends.

本実施形態には、電子装置のハードウェアにおいて故障の可能性がある箇所を、事前に予測する効果がある。その理由は、報告部１７が、カウント部１４によりカウントされたモジュール毎のＰＯＳＴ実行時間と、基準実行時間記憶部１６に記憶された当該モジュールの基準となるＰＯＳＴ実行時間とを比較して、カウント結果の方が所定の許容値よりも長いモジュールが存在する場合は、当該モジュールが異常であることを報告するからである。 In the present embodiment, there is an effect of predicting in advance a place where there is a possibility of failure in the hardware of the electronic device. The reason is that the reporting unit 17 compares the POST execution time for each module counted by the counting unit 14 with the POST execution time that is the reference of the module stored in the reference execution time storage unit 16 and counts This is because if there is a module whose result is longer than the predetermined allowable value, it is reported that the module is abnormal.

電子装置のハードウェアは、完全に故障した状態になる前に、データの読み書き動作においてリトライ動作を繰り返す状態がしばらく継続することがある。この状態の期間では、ＰＯＳＴ実行時にもリトライ動作が発生するため、ＰＯＳＴの実行時間が正常時よりも長くなる。したがって、ＰＯＳＴの実行時間を、正常時のＰＯＳＴ実行時間と比較することにより、ハードウェアの障害を事前に予測することが可能となる。 The hardware of the electronic device may continue to repeat a retry operation in a data read / write operation for a while before becoming completely in a failed state. In this period, a retry operation occurs even when POST is executed, and therefore the POST execution time becomes longer than normal. Therefore, a hardware failure can be predicted in advance by comparing the POST execution time with the normal POST execution time.

本実施形態では、ハードウェアのモジュール単位で、上述のＰＯＳＴ実行時間の比較を行うため、電子装置全体でＰＯＳＴ実行時間の比較を行う従来の方式と比較して、精度の高い障害予測を行うことが可能となる。そして、ＰＯＳＴ実行時間が正常時と比較して許容値以上であるモジュールを、故障の可能性のあるモジュールとして事前に特定することが可能となる。これにより、実際に当該モジュールが故障する前に、迅速にモジュール交換を行うことで、システムダウンを回避することが可能となる。
＜第二の実施形態＞
次に、本願発明の第二の実施形態について図面を参照して詳細に説明する。 In this embodiment, since the POST execution time is compared in units of hardware modules, the failure prediction is performed with higher accuracy than the conventional method in which the POST execution time is compared in the entire electronic device. Is possible. Then, it is possible to specify in advance a module whose POST execution time is equal to or greater than the allowable value as compared with the normal time as a module with a possibility of failure. As a result, it is possible to avoid a system failure by quickly replacing the module before the module actually fails.
<Second Embodiment>
Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

図５は本願発明の第二の実施形態の故障予測システムの構成を示すブロック図である。 FIG. 5 is a block diagram showing the configuration of the failure prediction system according to the second embodiment of the present invention.

故障予測システム１は、故障予測装置１０を包含している。故障予測装置１０は、測定部１２と、報告部１７とを包含している。測定部１２は、電子装置のハードウェアの構成要素の診断プログラム、例えば、ＰＯＳＴを実行するＰＯＳＴ実行部等からの実行情報を基に、ハードウェアのモジュール単位で、ＰＯＳＴの実行時間を、各モジュールを識別する情報に対応付けて測定する。 The failure prediction system 1 includes a failure prediction device 10. The failure prediction apparatus 10 includes a measurement unit 12 and a report unit 17. The measurement unit 12 determines the POST execution time in units of hardware modules based on a diagnostic program for hardware components of the electronic device, for example, execution information from a POST execution unit that executes POST. Measured in association with information for identifying.

報告部１６は、何れかのハードウェアのモジュールにおいて、ＰＯＳＴの実行時間が、当該モジュールに設定された基準値を超えた場合に、異常を報告する。 The reporting unit 16 reports an abnormality in any hardware module when the POST execution time exceeds a reference value set in the module.

本実施形態には、第一の実施形態と同様に、電子装置のハードウェアにおいて故障の可能性がある箇所を、高い精度で事前に予測する効果がある。その理由は、報告部１７が、測定部１２により測定されたモジュール毎のＰＯＳＴ実行時間が、基準値よりも長いモジュールが存在する場合は、異常を報告するからである。 In the present embodiment, as in the first embodiment, there is an effect of predicting in advance with high accuracy a place where there is a possibility of failure in the hardware of the electronic device. The reason is that the reporting unit 17 reports an abnormality when there is a module whose POST execution time for each module measured by the measuring unit 12 is longer than the reference value.

また、本実施形態では、上述の基準値のベースとなる値を、第一の実施形態における基準実行時間記憶部１６のような記憶部に事前に格納する場合もあれば、ＰＯＳＴ実行時に、故障予測システム１の使用者が人手入力する場合もある。 In this embodiment, a value serving as the base of the above-described reference value may be stored in advance in a storage unit such as the reference execution time storage unit 16 in the first embodiment, or a failure may occur during POST execution. The user of the prediction system 1 may input manually.

以上、実施形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されたものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

１故障予測システム
２電子装置
３保守サーバ
１０故障予測装置
１１ＰＯＳＴ実行部
１２測定部
１３ＰＯＳＴ実行検出部
１４カウント部
１５構成変更検出部
１６基準実行時間記憶部
１７報告部
１７０許容時間
２０電源入力部
３０ＣＰＵ部
４０メモリ部
５０ディスク部
６０管理部
６１ログ DESCRIPTION OF SYMBOLS 1 Failure prediction system 2 Electronic apparatus 3 Maintenance server 10 Failure prediction apparatus 11 POST execution part 12 Measurement part 13 POST execution detection part 14 Count part 15 Configuration change detection part 16 Reference | standard execution time memory | storage part 17 Report part 170 Permissible time 20 Power supply input part 30 CPU part 40 Memory part 50 Disk part 60 Management part 61 Log

Claims

Measuring means for measuring the execution time required for diagnosis of the hardware component in association with identification information of the hardware component in units of hardware components included in the electronic device;
In any one of the hardware components, a reporting means for reporting an abnormality when the execution time exceeds a reference value set in the hardware component;
A failure prediction apparatus comprising:

Detecting means for detecting a configuration change of the hardware component;
The failure according to claim 1, wherein the reporting unit sets, as the reference value, a value obtained by adding a predetermined allowable value to the execution time of the hardware component when the detection unit detects the configuration change. Prediction device.

The failure prediction apparatus according to claim 1, wherein the reporting unit reports the identification information of the hardware component that has detected the abnormality.

A failure prediction system including the failure prediction device according to claim 1, the electronic device, and a maintenance server connected to the electronic device.

For each hardware component included in the electronic device, the execution time required for diagnosis of the hardware component is measured in association with the identification information of the hardware component,
A failure prediction method for reporting an abnormality in any one of the hardware components when the execution time exceeds a reference value set in the hardware component.

Detecting a configuration change of the hardware component;
The failure prediction method according to claim 5, wherein a value obtained by adding a predetermined allowable value to the execution time of the hardware component when the configuration change is detected is set as the reference value.

The failure prediction method according to claim 5, wherein the identification information of the hardware component that has detected the abnormality is reported.

Measurement processing for measuring the execution time required for diagnosis of the hardware component in association with the identification information of the hardware component in units of hardware components included in the electronic device;
In any one of the hardware components, a report process for reporting an abnormality when the execution time exceeds a reference value set in the hardware component;
Failure prediction program that causes a computer to execute.

Causing a computer to execute a detection process for detecting a configuration change of the hardware component;
The failure of claim 8, wherein the reporting process sets, as the reference value, a value obtained by adding a predetermined allowable value to the execution time of the hardware component when the detection process detects the configuration change. Prediction program.

The failure prediction program according to claim 8, wherein the reporting process reports the identification information of the hardware component that has detected the abnormality.