JPH07262054A

JPH07262054A - Failure information management system

Info

Publication number: JPH07262054A
Application number: JP6047075A
Authority: JP
Inventors: Haruto Watanabe; 晴人渡辺; Shinichi Isobe; 真一磯部; Yasuhiro Ishii; 保弘石井
Original assignee: Hitachi Ltd; Hitachi Information Network Ltd
Current assignee: Hitachi Ltd; Hitachi Information Systems Ltd
Priority date: 1994-03-17
Filing date: 1994-03-17
Publication date: 1995-10-13

Abstract

PURPOSE:To provide a failure information management system which can perform various types of management of threshold value in accordance with the types and importance degree of errors and also corresponding to the system operations. CONSTITUTION:When a failure occurs, an error code is produced and stored in a detailed error log file. At the same time, the error code and the error occurrence frequency (e) are inputted to a threshold value management part 5. The part 5 reads a threshold value management table 10a out of a threshold value management file. The table 10 a contains the entries for management cycles (c), lower limit value (l), upper limit value (u), error mask code, error comparison codes, etc. Then, a logical arithmetic operation is performed between the error code and the error mask code and also the error code is compared with the error comparison code, and the type of the corresponding error is specified and then judged by means of a conditional expression, [1<=(e mode c)<=u]. When this expression is satisfied, it is decided that the failure shown in the error code is out of its allowance range. In such a constitution of a failure information management system, the decided error is reported to a maintenance center.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、情報処理装置の信頼
性、保守性の向上等を図るために用いられる障害情報管
理技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a fault information management technique used for improving the reliability and maintainability of an information processing device.

【０００２】[0002]

【従来の技術】たとえば、情報処理装置においては、障
害の発生時に作成するエラーコードは、システムの保守
および運用を行う上で欠くことができない。このエラー
コードはシステムが大規模、複雑化するに伴い、多様化
している。また発生した障害が広範囲に渡って影響を与
え、原因究明に多大な時間がかかってしまう。そのため
にエラーコードの作成や格納、および予防保守において
もこれらに対応する必要がある。2. Description of the Related Art For example, in an information processing apparatus, an error code created when a failure occurs is indispensable for system maintenance and operation. This error code is diversified as the system becomes large-scale and complicated. In addition, the failure that occurs affects a wide range, and it takes a lot of time to investigate the cause. Therefore, it is necessary to deal with these in the creation and storage of error codes and preventive maintenance.

【０００３】従来の閾値管理による障害情報管理方式と
しては、特開平３−２５０２２７号公報に記載の技術が
知られていた。すなわち、障害コード、障害カウンタよ
りなる蓄積レコードを複数個持つ蓄積テーブルと、障害
コード、閾値よりなる参照レコードを複数個持つ参照テ
ーブルを持ち、前記参照テーブルの参照レコードを順次
参照し、前記蓄積レコードの障害コードが一致すれば、
前記障害カウンタと当該閾値を比較し、前記障害カウン
タが当該閾値より大きい場合、障害コードを監視ホスト
へ送信し、障害カウンタをリセットする。一方、前記蓄
積レコードの障害コードが一致する参照レコードがない
場合、前記障害コードを監視ホストへ送信し前記蓄積レ
コード内の障害カウンタをリセットする、という操作を
行うものである。As a conventional fault information management system by threshold management, the technique described in Japanese Patent Laid-Open No. 3-250227 has been known. That is, it has a storage table having a plurality of storage records composed of a failure code and a failure counter, and a reference table having a plurality of reference records composed of a failure code and a threshold, and sequentially references the reference records of the reference table to store the storage records. If the fault codes of
The failure counter is compared with the threshold, and if the failure counter is larger than the threshold, a failure code is transmitted to the monitoring host and the failure counter is reset. On the other hand, when there is no reference record having the same fault code in the accumulated record, the fault code is transmitted to the monitoring host and the fault counter in the accumulated record is reset.

【０００４】[0004]

【発明が解決しようとする課題】前述の従来技術によれ
ば障害コードの閾値管理による障害情報管理方式は、閾
値管理対象である障害コードの取りうる全ての値を、参
照テーブルに登録しておく必要があるため、参照テーブ
ルが増加してしまう。また、閾値管理対象である障害コ
ードに対応する参照テーブルの参照レコードに登録する
閾値が１つであるため、後述の図４（ａ）の様に障害が
発生する度に閾値を超えていると管理するか、図４
（ｆ）の様に障害発生時に障害カウントが設定した閾値
の倍数回の場合に閾値を超えていると管理する閾値管理
方法に限定されてしまう。そのため、たとえば、回線の
データ化けの頻度調査を行いたい場合等では、ある回数
のとき一度だけ閾値を超え、監視ホストに送信すれば頻
度調査が可能であるが、この様な閾値管理が行えない。
そのため、ある回数のとき一度だけ閾値を越え、その後
は閾値を越えないという閾値管理を行うためには、図４
（ｆ）の閾値管理を用いて、閾値を超えて監視ホストに
送信があった後、閾値管理対象である障害コードに対応
する参照レコードを、参照テーブルから削除しなければ
ならず、障害の履歴情報が失われるとともに操作も煩雑
になる。削除しなければ、障害カウントのリセット後、
再度、障害カウントを越えてしまうという不都合を生じ
る。According to the above-mentioned prior art, in the fault information management system by the fault code threshold management, all possible values of the fault code which is the threshold management target are registered in the reference table. As it is necessary, the reference table will increase. Further, since there is only one threshold value registered in the reference record of the reference table corresponding to the fault code that is the target of threshold value management, it is determined that the threshold value is exceeded every time a fault occurs as shown in FIG. Manage or Figure 4
If the failure count is a multiple of the set threshold value when a failure occurs as in (f), the method is limited to the threshold value management method of managing that the threshold value is exceeded. Therefore, for example, if you want to check the frequency of garbled data on the line, you can check the frequency by exceeding the threshold once at a certain number of times and sending it to the monitoring host, but you cannot perform such threshold management. .
Therefore, in order to manage the threshold value such that the threshold value is exceeded only once at a certain number of times, and then the threshold value is not exceeded,
Using the threshold management of (f), after the threshold has been transmitted to the monitoring host, the reference record corresponding to the fault code that is the threshold management must be deleted from the reference table. Information is lost and operation becomes complicated. If you do not delete it, after resetting the failure count,
Once again, the inconvenience of exceeding the failure count occurs again.

【０００５】本発明の目的は、予防保守を行うためにエ
ラーの種類や重要度、およびシステムの運用に応じた多
様な閾値管理を行うことが可能な障害情報管理方式を提
供することにある。An object of the present invention is to provide a fault information management system capable of performing various threshold managements according to the type and importance of errors and system operation for preventive maintenance.

【０００６】本発明の他の目的は、エラーコードの管理
を簡略化することが可能な障害情報管理方式を提供する
ことにある。Another object of the present invention is to provide a fault information management system capable of simplifying error code management.

【０００７】[0007]

【課題を解決するための手段】図２は本発明の障害情報
管理方式における閾値管理の原理の一例を示すブロック
図である。FIG. 2 is a block diagram showing an example of the principle of threshold value management in the fault information management system of the present invention.

【０００８】情報処理装置でエラー発生時、エラーコー
ドを作成し、そのエラーコードにエラー発生回数（ｅ）
を付加した情報を閾値管理部は受け取る。閾値管理部５
が閾値管理を行うのに用いる閾値管理テーブル１０ａ
は、参照単位レコードを複数個持つ構成とし、閾値管理
を行う障害のエラー発生回数の間隔を示す管理サイクル
（ｃ）と、閾値管理を行う障害のエラー発生回数の下限
値（ｌ）と、閾値管理を行う障害のエラー発生回数の上
限値（ｕ）と、エラーコードのマスク値であるエラーマ
スクコードと、エラーコードとエラーマスクコードの論
理積あるいは論理和演算結果の値と比較するエラー比較
コードとする。閾値管理部は閾値管理テーブルの参照単
位レコードを順次参照し、エラーコードとエラーマスク
コードの論理積あるいは論理和をとり、その演算結果と
エラー比較コードが一致した場合、そのエラーコードは
閾値管理対象と判定し、そのエラー発生回数を当該参照
単位レコードの管理サイクルでモジュロをとり、この値
が下限値以上、上限値以下であれば［ｌ≦（ｅｍｏｄ
ｃ）≦ｕ］、障害が許容範囲を超えたと判定する。一
方、エラーコードと閾値管理テーブル内のすべての参照
単位レコードのエラーマスクコードの論理積あるいは論
理和の結果と、エラー比較コードが一致するものがない
場合は、閾値管理対象でないエラーコードと判定する。
また、閾値管理対象と判定したが、エラー発生回数を当
該参照単位レコードの管理サイクルでモジュロをとった
値が、下限値以上、上限値以下でなかった場合は障害が
許容範囲内であると判定する。When an error occurs in the information processing device, an error code is created, and the error code indicates the error occurrence count (e).
The threshold management unit receives the information added with. Threshold management unit 5
Threshold management table 10a used by the user to perform threshold management
Is a configuration having a plurality of reference unit records, a management cycle (c) indicating an interval of error occurrence times of a failure for which threshold management is performed, a lower limit value (l) of error occurrence times of a failure for performing threshold management, and a threshold value. The upper limit value (u) of the error occurrence count of the fault to be managed, the error mask code that is the mask value of the error code, and the error comparison code that compares the logical product of the error code and the error mask code or the value of the logical sum operation result And The threshold management unit sequentially refers to the reference unit record of the threshold management table, calculates the logical product or logical sum of the error code and the error mask code, and when the operation result and the error comparison code match, the error code is the threshold management target. It is determined that the number of error occurrences is modulo the management cycle of the reference unit record, and if this value is greater than or equal to the lower limit value and less than or equal to the upper limit value, [l ≦ (e mod
c) ≦ u], it is determined that the failure exceeds the allowable range. On the other hand, if there is no match between the error comparison code and the result of the logical product or the logical sum of the error mask codes of all the reference unit records in the threshold management table, it is determined that the error code is not the threshold management target. .
In addition, it was determined that the failure was within the permissible range if it was determined as a threshold management target, but the value of the number of error occurrences modulo in the management cycle of the reference unit record is not less than the lower limit value and not more than the upper limit value. To do.

【０００９】[0009]

【作用】本発明によれば、下限値、上限値、管理サイク
ルの値の組合せにより、図４（ａ）から図４（ｇ）で示
すような様々な閾値管理方法を行え、エラーの種類や重
要度、およびシステムの運用に応じた多様な障害の閾値
管理が行える。また、エラーコードの内容をハードウェ
アや障害の分類に応じて階層化しておき、エラーマスク
コードによって障害を切り分けることにより、閾値管理
テーブルの増加を防ぎ閾値管理テーブルの容量を低減で
きる。According to the present invention, various threshold management methods as shown in FIGS. 4 (a) to 4 (g) can be performed by combining the lower limit value, the upper limit value, and the value of the management cycle, and the type of error and Threshold management of various failures can be performed according to the degree of importance and system operation. Further, the content of the error code is hierarchized according to the hardware and the classification of the failure, and the failure is separated by the error mask code, whereby the increase of the threshold management table can be prevented and the capacity of the threshold management table can be reduced.

【００１０】[0010]

【実施例】以下、本発明の実施例を図面に基づいて詳細
に説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【００１１】図１は本発明の一実施例である障害情報管
理方式が実施される情報処理システムの構成の一例を示
すブロック図である。FIG. 1 is a block diagram showing an example of the configuration of an information processing system in which a failure information management system according to an embodiment of the present invention is implemented.

【００１２】図１において、１は障害発生時に障害によ
り異なる詳細エラーログを採取し詳細エラーログファイ
ル８に格納する詳細エラーログ採取部、２は詳細エラー
ログから、各種障害を統一的なフォーマットで表現した
エラーコードを作成するエラーコード作成部、３はエラ
ーコード作成部２がエラーコードの作成に用いるエラー
コード作成テーブル、４はエラーコードをエラーコード
ファイル９に格納するエラーコード格納部、５はエラー
コードにエラー発生回数を付加したデータと、閾値管理
ファイル１０に格納されている図２で示す構成の閾値管
理テーブル１０ａにより閾値管理を行い、閾値超過を検
出した場合、閾値超過情報ファイル１１に閾値超過情報
を格納する閾値管理部、６は詳細エラーログファイル８
の格納データの出力や、格納データの削除、エラーコー
ドファイル９の格納データの出力や、エラーコードの削
除、閾値管理ファイル１０の閾値管理テーブル１０ａの
データの入出力や削除、閾値超過情報ファイル１１の閾
値超過情報データの出力や、閾値超過情報データの削除
をキーボードやＣＲＴ等の入出力装置７を介して行う障
害情報入出力部、１２はエラーコード格納部４がエラー
コードファイル９の容量が規定値超過したことを検知し
た場合や、閾値管理部５がエラーコードの閾値超過を検
知した場合に、通信装置１３を制御して保守センタ１４
に各情報を送信する通信制御部１２である。In FIG. 1, reference numeral 1 is a detailed error log collection unit for collecting different detailed error logs depending on the failure when the failure occurs and storing it in the detailed error log file 8. Reference numeral 2 is a detailed error log showing various failures in a unified format. An error code creation unit for creating the expressed error code, 3 is an error code creation table used by the error code creation unit 2 for creating the error code, 4 is an error code storage unit for storing the error code in the error code file 9, and 5 is Threshold management is performed by the data obtained by adding the number of error occurrences to the error code and the threshold management table 10a stored in the threshold management file 10 having the configuration shown in FIG. A threshold management unit for storing threshold excess information, 6 is a detailed error log file 8
Output of stored data of the above, deletion of stored data, output of stored data of the error code file 9, deletion of error code, input / output and deletion of data of the threshold management table 10a of the threshold management file 10, threshold excess information file 11 Error information input / output unit for outputting threshold excess information data or deleting threshold excess information data via an input / output device 7 such as a keyboard or a CRT; 12 is an error code storage unit 4; When it is detected that the specified value is exceeded, or when the threshold value management unit 5 detects that the error code threshold value is exceeded, the communication device 13 is controlled to control the maintenance center 14.
The communication control unit 12 sends each information to the.

【００１３】図３は本実施例の障害情報管理方式におけ
るエラーコードロギング処理の一例を示すフローチャー
トである。以下に図１と図２および図３を用いて障害発
生からのエラーコードロギング処理を説明する。FIG. 3 is a flow chart showing an example of error code logging processing in the fault information management system of this embodiment. The error code logging process from the occurrence of a failure will be described below with reference to FIGS. 1, 2 and 3.

【００１４】（１）．障害が発生すると、詳細エラーロ
グ採取部１は各障害によって採取内容やログデータ長が
異なる詳細エラーログを採取し、詳細エラーログファイ
ル８に格納する。また、エラーコード作成部２に採取し
た詳細エラーログを渡す（ステップ１００）。(1). When a failure occurs, the detailed error log collection unit 1 collects a detailed error log having different collection contents and log data length depending on each failure and stores it in the detailed error log file 8. Further, the detailed error log collected is passed to the error code creation unit 2 (step 100).

【００１５】（２）．エラーコード作成部２は、詳細エ
ラーログ採取部１から受け取った詳細エラーログとエラ
ーコード作成テーブル３を用いて各障害を統一的なフォ
ーマットで表わされるエラーコードを作成し、エラーコ
ード格納部４に渡す（ステップ１０１）。(2). The error code creation unit 2 creates an error code that represents each failure in a uniform format using the detailed error log received from the detailed error log collection unit 1 and the error code creation table 3, and stores it in the error code storage unit 4. Hand over (step 101).

【００１６】（３）．エラーコード格納部４は、エラー
コード作成部２にて作成したエラーコードをエラーコー
ドファイル９に格納し、エラーコードにエラー発生回数
を付加したデータを閾値管理部５に渡す（ステップ１０
２）。(3). The error code storage unit 4 stores the error code created by the error code creation unit 2 in the error code file 9, and passes the data obtained by adding the error occurrence count to the error code to the threshold value management unit 5 (step 10).
2).

【００１７】（４）．閾値管理部５は、閾値管理ファイ
ル１０の閾値管理テーブル１０ａの参照単位レコードの
有効／無効フラグが有効である参照単位レコードを探す
（ステップ２００）。(4). The threshold management unit 5 searches for a reference unit record in which the valid / invalid flag of the reference unit record of the threshold management table 10a of the threshold management file 10 is valid (step 200).

【００１８】（５）．有効参照単位レコード全てを探し
終えたか否かを調べ、全てを探し終えた場合は処理を終
了する（ステップ２０１）。(5). It is checked whether or not all the valid reference unit records have been searched, and if all the valid reference unit records have been searched, the processing ends (step 201).

【００１９】（６）．有効参照単位レコードを見つけた
ら、エラーコード格納部４から受け取ったエラーコード
と有効参照単位レコードのエラーマスクコードとの論理
積演算を行う（ステップ２０２）。(6). When the valid reference unit record is found, the logical product operation of the error code received from the error code storage unit 4 and the error mask code of the valid reference unit record is performed (step 202).

【００２０】（７）．エラーコードと有効参照単位レコ
ードのエラーマスクコードとの論理積演算結果の値と有
効参照単位レコードのエラー比較コードを比較し一致し
なかった場合、（４）（ステップ２００）へ戻る（ステ
ップ２０３）。(7). When the value of the logical product operation result of the error code and the error mask code of the valid reference unit record and the error comparison code of the valid reference unit record do not match, the process returns to (4) (step 200) (step 203). .

【００２１】（８）．エラーコードと有効参照単位レコ
ードのエラーマスクコードとの論理積演算結果とエラー
比較コードが一致した場合、エラー発生回数に対する有
効参照単位レコードの管理サイクルのモジュロを計算す
る（ステップ２０４）。(8). When the logical product operation result of the error code and the error mask code of the valid reference unit record matches the error comparison code, the modulo of the management cycle of the valid reference unit record with respect to the number of error occurrences is calculated (step 204).

【００２２】（９）．エラー発生回数に対する有効参照
単位レコードの管理サイクルのモジュロを計算した値
が、有効参照単位レコードの下限値以上、上限値以下で
ない場合、当該エラーコードのエラー発生回数は閾値を
超えなかった（障害は許容範囲内）と判定し、処理を終
了する（ステップ２０５）。(9). If the value obtained by calculating the modulo of the management cycle of the valid reference unit record with respect to the number of error occurrences is not more than the lower limit value and not more than the upper limit value of the valid reference unit record, the number of error occurrences of the error code does not exceed the threshold value (failure is It is determined to be within the allowable range), and the processing ends (step 205).

【００２３】（１０）．エラー発生回数に対する有効参
照単位レコードの管理サイクルのモジュロを計算した値
が、有効参照単位レコードの下限値以上、上限値以下で
ある場合、当該エラーコードのエラー発生回数は閾値を
超過したと判定し、閾値超過情報を閾値超過情報ファイ
ル１１に格納する（ステップ２０６）。(10). If the value obtained by calculating the modulo of the management cycle of the effective reference unit record with respect to the number of error occurrences is greater than or equal to the lower limit value and less than or equal to the upper limit value of the effective reference unit record, it is determined that the number of error occurrences of the error code exceeds the threshold value. The threshold excess information is stored in the threshold excess information file 11 (step 206).

【００２４】（１１）．通信制御部１２は、該情報処理
装置が保守センタ１４に通信回線等で接続されているか
否かを調べ、保守センタ１４に接続されていない場合、
処理を終了する（ステップ２０７）。(11). The communication control unit 12 checks whether or not the information processing apparatus is connected to the maintenance center 14 by a communication line or the like, and if it is not connected to the maintenance center 14,
The process ends (step 207).

【００２５】（１２）．該情報処理装置が保守センタ１
４に通信回線等で接続されている場合、閾値超過情報フ
ァイル１１に格納された閾値超過情報を通信装置１３を
制御して保守センタ１４に送信し処理を終了する（ステ
ップ２０８）。(12). The information processing device is a maintenance center 1
4 is connected by a communication line or the like, the threshold value excess information stored in the threshold value excess information file 11 is controlled by the communication device 13 and transmitted to the maintenance center 14 to end the processing (step 208).

【００２６】図４は、本実施例における閾値管理テーブ
ル１０ａのパラメータである下限値（ｌ）、上限値
（ｕ）、管理サイクル（ｃ）の組合せによる様々な閾値
管理方法の一例を示す図である。以下図４により閾値管
理方法と、それを用いることが有効な障害の例を説明す
る。FIG. 4 is a diagram showing an example of various threshold value management methods by the combination of the lower limit value (l), the upper limit value (u) and the management cycle (c) which are the parameters of the threshold value management table 10a in this embodiment. is there. Hereinafter, a threshold management method and an example of a failure for which it is effective will be described with reference to FIG.

【００２７】図４（ａ）はｌ＝０、ｕ＝ｉ−１、ｃ＝ｉ
（ｉ＞０）の場合の閾値管理方法を表わしており、対応
する障害が発生する毎に許容範囲を超えていると判定す
る管理を行う。この閾値管理方法を用いる障害例とし
て、早急に障害対策を行わなければならないような重度
障害の場合がある。In FIG. 4A, l = 0, u = i-1, and c = i.
This shows the threshold value management method in the case of (i> 0), and management is performed to determine that the allowable range is exceeded each time a corresponding failure occurs. An example of a failure that uses this threshold value management method is a serious failure that requires immediate countermeasures.

【００２８】図４（ｂ）はｌ＝０、ｕ＝ｉ、ｃ＝∞（ｉ
＞０）の場合の閾値管理方法を表わしており、対応する
障害が発生してからｉ回まで許容範囲を超えていると判
定する管理を行う。In FIG. 4B, l = 0, u = i, c = ∞ (i
> 0) represents the threshold value management method, and performs management to determine that the allowable range has been exceeded up to i times from the occurrence of the corresponding failure.

【００２９】図４（ｃ）はｌ＝ｉ、ｕ＝∞、ｃ＝∞（ｉ
＞０）の場合の閾値管理方法を表わしており、対応する
障害の発生回数がｉ回以降は許容範囲を超えていると判
定する管理を行う。この閾値管理方法を用いる障害例と
して、システム運用時は、重度障害でなかった障害が運
用後、システムの変更等により重度障害となった場合な
どがある。In FIG. 4C, l = i, u = ∞, c = ∞ (i
> 0) represents a threshold value management method, and performs management to determine that the number of occurrences of the corresponding failure exceeds the allowable range after i times. An example of a failure that uses this threshold value management method is a failure that was not a serious failure during system operation, but became a severe failure due to a system change or the like after operation.

【００３０】図４（ｄ）はｌ＝ｉ、ｕ＝ｉ、ｃ＝∞（ｉ
＞０）の場合の閾値管理方法を表わしており、対応する
障害の発生回数がｉ回の時だけ許容範囲を超えていると
判定する管理を行う。この閾値管理方法を用いる障害例
として、回線のデータ化け障害の頻度調査等を行う場合
がある。In FIG. 4D, l = i, u = i, and c = ∞ (i
> 0) represents a threshold value management method, in which management is performed to determine that the number of occurrences of the corresponding failure exceeds the permissible range only when i times. As an example of a failure using this threshold value management method, there is a case where a frequency investigation of garbled data failure of a line is performed.

【００３１】図４（ｅ）はｌ＝ｉ、ｕ＝ｊ、ｃ＝∞（０
＜ｉ＜ｊ）の場合の閾値管理方法を表わしており、対応
する障害の発生回数がｉ回からｊ回のとき許容範囲を超
えていると判定する管理を行う。In FIG. 4E, l = i, u = j, c = ∞ (0
It represents a threshold value management method in the case of <i <j), and performs management to determine that the number of occurrences of the corresponding failure is beyond the allowable range from i times to j times.

【００３２】図４（ｆ）は下限値＝０、上限値＝０、管
理サイクル＝ｎ（ｎ＞０）の場合の閾値管理方法を表わ
しており、対応する障害の発生回数がｎの倍数回のとき
許容範囲を超えていると判定する管理を行う。この閾値
管理方法を用いる障害例として、障害の発生頻度と部品
交換等の障害対策が既に判っているような障害で、ある
回数の倍数回で送信すればよい場合である。FIG. 4 (f) shows a threshold value management method in the case of the lower limit value = 0, the upper limit value = 0, and the management cycle = n (n> 0). The number of times of occurrence of the corresponding failure is a multiple of n. In this case, control is performed to determine that the allowable range is exceeded. An example of a failure using this threshold value management method is a failure in which the failure occurrence frequency and failure countermeasures such as component replacement are already known, and it is sufficient to transmit at a multiple of a certain number of times.

【００３３】図４（ｇ）は下限値＝ｉ、上限値＝ｊ、管
理サイクル＝ｎ（０＜ｉ＜ｊ＜ｎ）の場合の閾値管理方
法を表わしており、対応する障害の発生回数がｎ回の周
期でｉ回以上、ｊ回以下のとき許容範囲を超えていると
判定する管理を行う。FIG. 4G shows a threshold value management method in the case of the lower limit value = i, the upper limit value = j, and the management cycle = n (0 <i <j <n). When the number of times is n or more and j times or less in a cycle of n times, management is performed to determine that the allowable range is exceeded.

【００３４】図５は本実施例におけるエラーコードの体
系と閾値管理のグルーピングの一例を示す概念図であ
る。FIG. 5 is a conceptual diagram showing an example of error code system and threshold management grouping in this embodiment.

【００３５】図５（ａ）は装置のユニット別のエラーコ
ードを表す図である。ユニットはＣＰＵ、バスアダプ
タ、デバイス等に分けられ、エラーコードの先頭４ビッ
トで識別される。FIG. 5A is a diagram showing an error code for each unit of the apparatus. The unit is divided into a CPU, a bus adapter, a device, etc., and is identified by the first 4 bits of the error code.

【００３６】図５（ｂ）はデバイス別のエラーコードを
表す図である。デバイスはカセット磁気テープ装置（Ｃ
ＭＴ）、デジタルオーディオテープ装置（ＤＡＴ）、ハ
ードディスク装置（ＨＤＤ）等に分けられ、エラーコー
ドの先頭１バイトで識別される。FIG. 5B is a diagram showing an error code for each device. The device is a cassette magnetic tape unit (C
MT), digital audio tape device (DAT), hard disk device (HDD), etc., and is identified by the first byte of the error code.

【００３７】図５（ｃ）はＤＡＴ障害の大分類のエラー
コードを表す図である。タイムアウト障害、アダプタ障
害、デバイス障害、リトライ失敗障害等に分けられ、エ
ラーコードの先頭２バイトで識別される。FIG. 5C is a diagram showing error codes of the major categories of DAT failures. It is divided into timeout failure, adapter failure, device failure, retry failure failure, etc., and is identified by the first 2 bytes of the error code.

【００３８】図５（ｄ）はＤＡＴのリトライ失敗障害の
小分類を表す図である。リトライ失敗障害は、「テープ
ロードが完了しない」、「受信コマンドパラメータに無
効フィールドがある」等に分けられ、エラーコードの先
頭６バイトで識別される。FIG. 5D is a diagram showing a small classification of DAT retry failure failures. The retry failure failure is divided into "tape loading is not completed", "reception command parameter has invalid field", etc., and is identified by the first 6 bytes of the error code.

【００３９】図５（ａ）から図５（ｄ）によるエラーコ
ードの体系により、たとえば、ＤＡＴのリトライ失敗で
テープロードが完了しないという障害が発生した場合、
デバイス、ＤＡＴ、ＤＡＴのリトライ失敗障害、
ＤＡＴのリトライ失敗でテープロードが完了しない
（グルーピングしない）という様に障害を階層化したグ
ルーピングが可能であり、閾値管理テーブル１０ａの一
参照単位レコードを図５（ｅ）に示すエラーマスクコー
ドとエラー比較コードを設定しておくことにより行え
る。In the case of the error code system shown in FIGS. 5A to 5D, for example, when a failure occurs such that the tape load is not completed due to a DAT retry failure,
Device, DAT, DAT retry failure failure,
It is possible to perform hierarchical grouping of failures such that tape loading is not completed (no grouping) due to DAT retry failure, and one reference unit record of the threshold management table 10a shows the error mask code and error shown in FIG. 5 (e). This can be done by setting the comparison code.

【００４０】このように本実施例によれば、各障害によ
って採取内容、およびログデータ長が異なる詳細エラー
ログより、各障害を統一的なフォーマットで表わされる
エラーコードを作成することにより、障害解析手順が簡
略化され、障害対策にかかる時間が短縮できる。また、
もしエラーコードだけでは障害解析が行えなくても、詳
細エラーログがエラーコードとは別管理で保存されてい
るため、より詳細な障害解析が行える。またエラーコー
ドの閾値管理を行うのに閾値として上限値、下限値、管
理サイクルと、エラー発生回数を用いることにより、閾
値管理方法を変えることができ、障害の種類や重要度、
およびシステム運用に応じた多様な管理ができる。As described above, according to this embodiment, the failure analysis is performed by creating an error code representing each failure in a unified format from the detailed error log in which the collection contents and the log data length are different depending on each failure. The procedure is simplified and the time required for troubleshooting can be shortened. Also,
Even if the failure analysis cannot be performed only with the error code, the detailed error log is stored separately from the error code, so that more detailed failure analysis can be performed. Also, by using the upper limit value, the lower limit value, the management cycle, and the number of error occurrences as the threshold value for performing the error code threshold value management, the threshold value management method can be changed, and the type of failure, the importance level,
And various management according to system operation is possible.

【００４１】また閾値管理対象のエラーコードを、エラ
ーマスクコードとエラー比較コードにより、障害を階層
化したグルーピングを行うことと、閾値管理テーブル１
０ａの参照単位レコードに、有効／無効フラグを設け、
参照単位レコードの削除を行えることにより、閾値管理
テーブル１０ａの容量を低減できる。さらに障害情報で
ある詳細エラーログデータ、エラーコード、閾値管理テ
ーブル１０ａ、閾値超過情報を入出力装置７と障害情報
入出力部６により入出力可能とし、障害情報をシステム
運用者が適時把握できる、等の効果がある。Further, the error codes to be threshold-controlled are grouped by fault masking using the error mask code and the error comparison code, and the threshold management table 1
The reference unit record of 0a is provided with a valid / invalid flag,
Since the reference unit record can be deleted, the capacity of the threshold management table 10a can be reduced. Further, detailed error log data which is failure information, an error code, a threshold value management table 10a, and threshold value excess information can be input / output by the input / output device 7 and the failure information input / output unit 6, so that the system operator can grasp the failure information in a timely manner. And so on.

【００４２】[0042]

【発明の効果】以上説明したように、本発明の障害情報
管理方式によれば、エラーコードの閾値管理を行うのに
閾値として上限値、下限値、管理サイクルと、エラー発
生回数を用いることにより、様々に閾値管理方法を変え
ることができ、障害の種類や重要度、およびシステム運
用に応じた多様な管理ができる。また閾値管理対象のエ
ラーコードを、エラーマスクコードとエラー比較コード
により、障害を階層化したグルーピングを行うことによ
り、閾値管理テーブルの増加を防ぎ閾値管理テーブルの
容量を低減できる。また保守センタと通信回線等によっ
て該情報処理装置を接続し閾値超過情報を自動通報し、
遠隔保守を行うことにより情報処理装置の信頼性、保守
性の向上が図れる。As described above, according to the fault information management system of the present invention, the upper limit value, the lower limit value, the management cycle, and the number of error occurrences are used as the threshold values for the error code threshold value management. The threshold management method can be changed in various ways, and various managements can be performed according to the type and importance of failure and system operation. Further, by grouping the error codes to be threshold-managed with the error mask code and the error comparison code, the failures are hierarchized to prevent an increase in the threshold management table and reduce the capacity of the threshold management table. In addition, the information processing device is connected to the maintenance center by a communication line or the like, and information about threshold excess is automatically reported.
By performing remote maintenance, the reliability and maintainability of the information processing device can be improved.

[Brief description of drawings]

【図１】本発明の一実施例である障害情報管理方式が実
施される情報処理システムの構成の一例を示すブロック
図である。FIG. 1 is a block diagram showing an example of the configuration of an information processing system in which a failure information management system according to an embodiment of the present invention is implemented.

【図２】本発明の障害情報管理方式における閾値管理の
原理の一例を示すブロック図である。FIG. 2 is a block diagram showing an example of the principle of threshold management in the fault information management system of the present invention.

【図３】本発明の一実施例である障害情報管理方式の作
用の一例を示すフローチャートである。FIG. 3 is a flowchart showing an example of the operation of the fault information management system according to the exemplary embodiment of the present invention.

【図４】（ａ）〜（ｇ）は、本発明の一実施例である障
害情報管理方式における閾値管理の一例を示す概念図で
ある。4A to 4G are conceptual diagrams showing an example of threshold value management in a failure information management method according to an embodiment of the present invention.

【図５】本発明の一実施例である障害情報管理方式にお
けるエラーコードの体系と閾値管理のグルーピングの一
例を示す概念図である。FIG. 5 is a conceptual diagram showing an example of an error code system and threshold management grouping in the fault information management system according to the embodiment of the present invention.

[Explanation of symbols]

１…詳細エラーログ採取部、２…エラーコード作成部、
３…エラーコード作成テーブル、４…エラーコード格納
部、５…閾値管理部、６…障害情報入出力部、７…入出
力装置、８…詳細エラーログファイル、９…エラーコー
ドファイル、１０…閾値管理ファイル、１０ａ…閾値管
理テーブル、１１…閾値超過情報ファイル、１２…通信
制御部、１３…通信装置、１４…保守センタ1 ... Detailed error log collection part, 2 ... Error code creation part,
3 ... Error code creation table, 4 ... Error code storage unit, 5 ... Threshold management unit, 6 ... Fault information input / output unit, 7 ... Input / output device, 8 ... Detailed error log file, 9 ... Error code file, 10 ... Threshold value Management file, 10a ... threshold management table, 11 ... threshold information file, 12 ... communication control unit, 13 ... communication device, 14 ... maintenance center

───────────────────────────────────────────────────── フロントページの続き (72)発明者石井保弘神奈川県海老名市下今泉810番地株式会社日立製作所オフィスシステム事業部内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Yasuhiro Ishii 810 Shimoimaizumi, Ebina-shi, Kanagawa Hitachi Systems Office Systems Division

Claims

[Claims]

1. A failure information management method for creating an error code corresponding to a failure when the failure occurs and managing the failure based on the error code and the occurrence frequency of the failure, wherein each error code is managed. Set the upper limit value and the lower limit value as a threshold value in a table, and when the failure occurs, obtain the upper limit value and the lower limit value corresponding to the error code of the failure from the table, and the number of failure occurrences is the lower limit value. As described above, the fault information management method is characterized in that the fault is judged to have exceeded the allowable range when the upper limit value or less.