JPH02143333A

JPH02143333A - Fault recover device

Info

Publication number: JPH02143333A
Application number: JP63297630A
Authority: JP
Inventors: Takashi Suzuki; 孝鈴木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1988-11-24
Filing date: 1988-11-24
Publication date: 1990-06-01

Abstract

PURPOSE:To realize high speed fault recovery even for faults occurring frequently from the same information processing device without deteriorating the resolution of faulty data by using a log area as circulating it every time the fault occurs. CONSTITUTION:An information processing device group 1 consisting of plural information processing devices 11,12,... in and a fault recovering device 2 to process the fault at the time of the occurrence of the fault in the information processing devices 11 to 1n are provided. Besides, the device is constituted by providing a service processor (SVP) 3 which receives log data in response to a request from the fault recovering device 2 and stores it in an auxiliary storage device. The log area is secured at every occurrence of the fault, and is used as being circulated. Thus, the fault recovering can be executed at high speed by raising the duty rate of the log area.

Description

【発明の詳細な説明】〔ｆ了業りの利用分野）本発明は、複数台の情報処理装置とサービスプロセッサ
とに接続され、情報処理装置に障害が発生した場合、註
情報処理装置から障害発生時のデータ採取を行い、ログ
エリアに一時的に〆Ｑ害す＾報を記憶しておき、その障
害発生時のログデータにより障害処理を行い、障害処理
が終了すると、ログデータをサービスプロセッサに引き
渡す障害処理装置に関する。[Detailed Description of the Invention] [Field of Application] The present invention provides a service processor that is connected to a plurality of information processing devices and a service processor. Collect data at the time of failure, temporarily store the error message in the log area, process the failure using the log data at the time of failure, and when the failure handling is complete, send the log data to the service processor. Regarding the handover failure handling device.

[Conventional technology]

従来、この腫の障害処理装置は、情報処理装置から障害
発生の報告を受けると、最初に障害が発生した情報処理
装置から内部情報（ログデータ）を順次全て取り出し、
障害処理装置内のメモリ上の情報処理装置毎に予め指定
されたログエリアにログデータを一時格納し、障害が発
生した情報処理装置の障害処理をログデータに基づいて
行なった後、磁気記憶装置等の記憶装置を持つサービス
プロセッサ（ｓｖｐ）にログデータ引きＩ収り鼎求を発
行し、ログデータをＳｖＰに引き加していた。Conventionally, when a failure processing device receives a report of a failure from an information processing device, it sequentially extracts all internal information (log data) from the information processing device where the failure occurred first.
Log data is temporarily stored in a log area specified in advance for each information processing device on the memory in the fault processing device, and after fault processing for the information processing device in which the fault has occurred is performed based on the log data, the magnetic storage device A request to retrieve log data was issued to a service processor (SVP) having a storage device such as the above, and the log data was added to the SVP.

障害には間欠と固定の二種類があるが、障害処理装置で
は間欠故障を救済することを可能としており、ログデー
タから命令再実行（命令リトライ）可能ならば命令リト
ライを行なわせる。しかし、障害が間欠でなく固定であ
った場合、障害処理は、再び障害報告を受けて、同一の
原因であフだとしても情報処理装置の内部情報を全て増
りｉｌし該情報処理装置の切り離し処理等を行なってい
た。また、障害処理」−命令リトライが完了した場合で
も再び障害になると、障害処理装置は、ＳｖＰへのログ
データの引渡しが完ｒしていない場合は、ログエリアか
空き状態になるまでログエリアへのログデータの格納が
待たされ、ログエリアが空き状態になると障害処理を開
始しログデータを採取していた。There are two types of failures: intermittent and fixed, but the failure processing device makes it possible to recover from intermittent failures, and if it is possible to re-execute an instruction (instruction retry) based on log data, it will retry the instruction. However, if the failure is fixed rather than intermittent, the failure processing will receive the failure report again and increase all the internal information of the information processing device, even if it is due to the same cause. The separation process was being performed. In addition, if a failure occurs again even if the instruction retry has been completed, the failure handling device will transfer the log data to the log area until it becomes free, if the log data has not yet been handed over to the SvP. The system had to wait for the log data to be stored, and when the log area became empty, it started troubleshooting and collected the log data.

〔発明が解決しようとする３題〕ト述した従来の障害処理装置は、情報処理装置毎にログ
エリアが定められており、しかもログデータは磁気記憶
装置等の記憶装置を持つｓｖｐに引き取られるまで保持
されるので、ＳＶＰにログデータが引き取られるｎｌに
同一情報処理装置の障害が多発し、指定されたログエリ
アが一杯になると同処理装置の障害処理はログエリアが
空くまで待たされ、同一情報処理装置から同一原因で障
害が発生したとしても障害のいかんに関わらず障害装置
のログデータ全てを採取しているので、ログデータの採
取に障害処理の時間のほとんどを！！やしてしまい高速
な障害処理が行なえないという欠点がある。[Three Problems to be Solved by the Invention] In the conventional failure processing device mentioned above, a log area is determined for each information processing device, and log data is received by an SVP that has a storage device such as a magnetic storage device. Therefore, if a failure occurs frequently in the same information processing device in the nl where the log data is taken over by SVP and the specified log area becomes full, failure processing for the same information processing device will have to wait until the log area becomes free, and the same Even if a failure occurs from an information processing device due to the same cause, all log data from the failed device is collected regardless of the cause of the failure, so most of the time spent on troubleshooting is spent on collecting log data! ! This method has the disadvantage that high-speed failure processing cannot be performed because of the large amount of data.

〔課題を解決するだめの１１段）本発明の第１の障害処理装置は、ログエリアにログデータが格納される毎に、そのベース
アドレスを保持するログエリア管理テーブルと、ログエリア管理テーブルから現在使用されているログエ
リアの最新ベースアドレスと最古のベースアドレスを取
得し、今回ログデータとして採取するデータ計からログ
エリアを確保できるかどうか判定し、ログエリアが確保
できると判定すると、ログ管理テーブルに次に確保する
ログエリアのベースアドレスを格納するとともにログデ
ータの採取をログデータ採取部に指示し、サービスプロ
セッサによりログデータの引き取りが完了すると、引き
とられたログデータのベースアドレスをログエリア管理
テーブルから削除するログエリア使用状態管理部とをイ
」−シている。[11 Steps to Solving the Problem] The first fault handling device of the present invention includes: a log area management table that holds the base address of log data every time log data is stored in the log area; Obtain the latest base address and oldest base address of the log area currently in use, determine whether the log area can be secured based on the data collected as log data this time, and if it is determined that the log area can be secured, log Stores the base address of the next log area to be secured in the management table and instructs the log data collection unit to collect log data. When the service processor completes receiving the log data, it stores the base address of the collected log data. The log area usage status management section to be deleted from the log area management table is deleted.

本発明の第２の障害処理装置は、ログエリアに格納されているログデータのログエリアベ
ースアドレス、該ログデータに対応する情報処理装置の
装置コード、該ログデータが最新ログが否かを示すＶビ
ットを含むログエリア管理テーブルと、命令リトライ中か否かを示す情報を含む情報処理装置毎
の各種情報を含む装置状態管理テーブルと、情報処理装置の障害の発生状況により採取するログデー
タを定義したログデータ定義テーブルと、障害発生の報告を受けた情報処理装置が既に障害を発生
していて命令リトライ中であるが否がを装置状、聾管理
テーブルを参照して識別する装置状態管理部と、命令リトライ中でなかった場合に、ログエリアに同一の
情報処理装置のログデータが未だ残っているか謔かをロ
グエリア管理テーブルの装置コードを参照して判定し、
またそのログデータが最新のログデータであるかどうか
をＶビットにより判定したり、ログエリア管理テーブル
の更新を行なうログエリア使用状態管理部と、命令リトライ中の障害であると判定された場合、ログデ
ータ定義テーブルに定義されている障害原因判定に必要
なデータを障害を発生した情報処理装置から採取し、命
令リトライ前の障害と同原因かどうかを判定し、命令リ
トライ航とｋなった原因で再び障害となった場合は、残
りの全ログデータを障害を起こした情報処理装置がら採
取し、命令リトライ中でなく、かつログエリアに同一の
情報処理装置のログデータが残っていなかった場合はロ
グデータ定義テーブルに定義されている全てのログデー
タを障害を起こした情報処理装置から採取し、命令リト
ライ中でなく、かつログエリアに同一の情報処理装置の
ログデータが残っている場合はログデータ定義テーブル
に定義されている障害原因判定ログデータを障害を起こ
した情報処理装置から採取し２０グエリアに残っている
ログデータを読出し、両各を比較し、面間と同一・原因
かどうかを判定し、同−原因によるｌＩａ害でない場合
は、残りの全ログデータを採取し、同−原因による障害
の場合は、ログデータ定義テーブルに定義されている命
令リトライに必要なログデータを追加データとして障害
を起こした情報処理装置から採取するログデータ採取部
を有している。The second failure processing device of the present invention includes a log area base address of log data stored in the log area, a device code of the information processing device corresponding to the log data, and an indication of whether the log data is the latest log. A log area management table containing the V bit, a device status management table containing various information for each information processing device including information indicating whether an instruction is being retried, and log data collected depending on the failure occurrence status of the information processing device. Device status management refers to the defined log data definition table and the device status and deaf management table to determine whether the information processing device that has received a failure report has already experienced a failure and is retrying commands. and, if the command is not being retried, determine whether log data of the same information processing device still remains in the log area or not by referring to the device code in the log area management table;
In addition, there is a log area usage state management unit that determines whether the log data is the latest log data using the V bit and updates the log area management table, and if it is determined that there is a failure during an instruction retry, The data required to determine the cause of the failure defined in the log data definition table is collected from the information processing device where the failure occurred, and it is determined whether the cause is the same as the failure before the command retry, and the cause of the failure to be retried is determined. If a failure occurs again, collect all remaining log data from the failed information processing device, and if the command is not being retried and no log data from the same information processing device remains in the log area. collects all log data defined in the log data definition table from the failed information processing device, and if the command is not being retried and log data from the same information processing device remains in the log area. Collect the failure cause determination log data defined in the log data definition table from the information processing device that caused the failure, read out the log data remaining in the 20-group area, compare both, and determine whether the cause is the same and the cause is the same as the one between the two sides. If the fault is not due to the same cause, collect all remaining log data, and if the failure is due to the same cause, add the log data necessary for command retry defined in the log data definition table. It has a log data collection unit that collects data from the information processing device that has caused the failure.

[For production]

第１の障害処理装置では、ログエリアか障害発生毎に確
保され、循環して使用されるので、ログエリアの使用率
を上げて、障″Ｊｆ処理を高速に行なうことができる。In the first fault processing device, a log area is secured every time a fault occurs and is used cyclically, so that the usage rate of the log area can be increased and fault "Jf" processing can be performed at high speed.

第２の障害処理装置では、ログデータの収集が、情報処
理装置の状態（命令リトライ中か）、ログエリアに同一
の情報処理装置のログデータが残っているかどうか、障
害発生原因が同一かどうかにより細かく制御されるので
、同一の情報処理装置からの頻発する障害に対しても高
速な障害処理を行なうことができる。In the second failure processing device, log data collection is performed based on the status of the information processing device (is the instruction being retried?), whether log data from the same information processing device remains in the log area, and whether the cause of the failure is the same. Since the control is more finely controlled, high-speed failure processing can be performed even for frequently occurring failures from the same information processing device.

〔Example〕

次に、本発明の実施例について図面を参照して説明する
。Next, embodiments of the present invention will be described with reference to the drawings.

第１図は本発明の障害処理装置の第１の〜実施例を含む
情報処理システムの構成図、第２図は本実施例における
ログエリア２０の使用方法を示す図、第３図はログデー
タの採取処理を示すフローチャート、第４図はログエリ
アの解放処理のフローチャートである。FIG. 1 is a configuration diagram of an information processing system including first to embodiments of the failure processing device of the present invention, FIG. 2 is a diagram showing how to use the log area 20 in this embodiment, and FIG. 3 is a diagram showing log data. FIG. 4 is a flowchart showing the collection processing of the log area, and FIG. 4 is a flowchart of the log area release processing.

この情報処理システムは、複数の情報処理装置１１．１
２．・・・、Ｉｎからなる情報処理装置ｎ１と、情報処
理装置１１〜１ｎに障害が発生した場合、障害処理を行
なう障害処理装置２と、障害処理装置２からの要求によ
りログデータを引きとり、補助記憶装置に記憶しておく
　５ＶＰ３とから構成されている。This information processing system includes a plurality of information processing devices 11.1.
2. When a failure occurs in the information processing device n1 consisting of . It consists of 5VP3 stored in the auxiliary storage device.

障害処理装置２は、ログエリア２０と、ログエリア２０
にログデータが格納される毎に、そのベースアドレスを
保持する、ＦＩＦＯ構造のログエリア管理テーブル２Ｉ
と、障害発生＋ｉ’ｒ　Ｍ処理装置からログデータを採
取するログデータ採取部２２と、採取されたログデータ
をログエリア２０に格納するログデータ格納部２３と、
ログエリア管理テーブル２１から現在使用されているロ
グエリア２０の最新ベースアドレスと最古のベースアド
レスを取得し、今回ログデータとして採取するデータ量
とからログエリアを確保できるかどうか判定し、ログエ
リアが確保できると判定すると、ログ管理テーブル２１
に次に確保するログエリアのベースアドレスを格納する
とと乙にログデータの採取をログデータ採取部２２に指
示し、５ｖＰ３によりログデータの引き取りが完了する
と、引きとられたログデータのベースアドレスをログエ
リア管理テーブル２１から削除するログエリア使用状態
管理部２４と、障害処理を行ない処理が完γすると、５
ＶＰ３にログデータの引き取り要求を出′ｆ障害処理部
２５とから構成されている。次に、本実施例の動作を第
２図〜第４図を参照して説明する。The failure processing device 2 has a log area 20 and a log area 20.
FIFO-structured log area management table 2I that holds the base address every time log data is stored in
, a log data collection unit 22 that collects log data from the failure occurrence +i'r M processing device, a log data storage unit 23 that stores the collected log data in the log area 20,
Obtain the latest base address and oldest base address of the currently used log area 20 from the log area management table 21, determine whether the log area can be secured based on the amount of data to be collected as log data this time, and set the log area. If it is determined that the log can be secured, the log management table 21
When storing the base address of the next log area to be secured, Party B instructs the log data collection unit 22 to collect log data, and when the collection of log data is completed by 5vP3, the base address of the collected log data is stored. When the log area usage state management unit 24 deletes from the log area management table 21 and the failure process is completed, 5
It is composed of a fault processing section 25 that issues a request to take over log data to the VP3. Next, the operation of this embodiment will be explained with reference to FIGS. 2 to 4.

情報処理装置群１の中のいずれかの情報処理装置に障害
が発生すると、障害処理装置２は、その報告を受けて障
害処理を起動する。When a failure occurs in any of the information processing devices in the information processing device group 1, the failure processing device 2 receives the report and starts failure processing.

まず、ログエリア使用状態管理部２４はログエリア管理
テーブル２１から現在使用されているログエリアの最新
ベースアドレスと最古のベースアドレスとを取得し今回
ログデータとして採取するデータ寸とよりログエリアを
確保できるかどうかをｆｉ１定する（処理３０）。ログ
エリアが確保できるとテリ断すると、ログエリア管理テ
ーブル２１に次に確保するログエリアのベースアドレス
を格納してログエリア管理テーブル２１を更新しく処理
３１）、ログデータの採取をログデータ採取部２２に指
示する（処理３２）。First, the log area usage state management unit 24 obtains the latest base address and oldest base address of the currently used log area from the log area management table 21, and uses the data size to be collected as log data this time to determine the log area. It is determined whether fi1 can be secured (process 30). When it is determined that the log area can be secured, the base address of the next log area to be secured is stored in the log area management table 21, the log area management table 21 is updated (31), and the log data collection unit starts collecting the log data. 22 (process 32).

この後、障害を起こした装置からログデータ採取部２２
によりログデータを採取されるが、例えば使用できるロ
グエリアがログエリアの一ト限付近から始まり下限に戻
る様な場合はログデータの格納時にト限までデータを格
納した接ログエリアの下限からそれ以降のログデータを
格納して行く（第２図のエリアＡがこれに相当する）。After this, the log data collection unit 22
For example, if the usable log area starts near the limit of the log area and returns to the lower limit, when storing the log data, the data is collected from the lower limit of the contact log area that has stored data up to the limit. Subsequent log data is stored (area A in FIG. 2 corresponds to this).

障害処理部２５は障害処理が終了すると、５ＶＰ３に対
してログデータの引き取りを要求する。When the failure processing unit 25 completes the failure handling, it requests the 5VP3 to take over the log data.

しかし、５ＶＰ３にログデータが引き取られる而に情報
処理装置群１から障害報告が発生した場合も処理３０に
より新しいログエリア２０のベースアドレスが求まり処
理３１によりログの採取がおこなわわ、障害処理が待た
されることはない（第２図のエリアＢ）。However, even if a failure report occurs from the information processing device group 1 while the log data is being received by the 5VP3, the base address of the new log area 20 is determined in process 30, log collection is performed in process 31, and failure handling is delayed. (Area B in Figure 2).

このように、５ＶＰ３にログデータが引き取られる前に
情報処理装置に障害が発生してもログエリアが一杯にな
るまで障害処理を行なうことができる。５ＶＰ３がログ
データの引き取りを完了すると、ログエリア使用状態管
理部２４がどのログが引き取られたかどうかを確認しく
処理４０）、ログエリア管理テーブル２１から該当する
ログエリアペルスアドレスを削除し、ログエリアを解放
して（処理４１）、次の障害発生に備える。In this way, even if a failure occurs in the information processing device before the log data is received by the 5VP3, failure processing can be performed until the log area is full. When the 5VP3 completes receiving the log data, the log area usage state management unit 24 checks which log has been collected (40), deletes the corresponding log area pass address from the log area management table 21, and deletes the log area from the log area management table 21. is released (process 41) to prepare for the next failure.

このように、ログエリアを循環させて使用することによ
り、ログエリアの使用率を上げ、障害処理を高速に行な
うことか可能になる。By circulating the log area and using it in this way, it is possible to increase the usage rate of the log area and perform fault processing at high speed.

第５図は本発明の障害処理装置の第２の実施例を含む情
報処理システムの構成図、第６図はログデータ定義テー
ブル５６のフォーマット図、第７図はログエリア管理テ
ーブル５１のフォーマット図、第８図は装置状態管理テ
ーブル５５のフォーマット図である。FIG. 5 is a configuration diagram of an information processing system including a second embodiment of the failure handling device of the present invention, FIG. 6 is a format diagram of the log data definition table 56, and FIG. 7 is a format diagram of the log area management table 51. , FIG. 8 is a format diagram of the device status management table 55.

この情報処理システムは、複数の情報処理装置Ｉｆ、１
２．・−、Ｉｎからなるすｎ報処理装置群１と、情報処
理装置１１〜１ｎに障害が発生した場合、ＩＩＸ害処卵
処理なう障害処理装置５と、障害処理装置５からの要求
によりログデータを引きとり、補助記憶装置に記憶して
おく５ＶＰ３とから構成されている。This information processing system includes a plurality of information processing devices If, 1
2. When a failure occurs in the information processing device group 1 consisting of -, In and the information processing devices 11 to 1n, the failure processing device 5 that handles IIX damage processing and the log processing device 5 upon request from the failure processing device 5 It consists of 5VP3 that receives data and stores it in an auxiliary storage device.

障害処理装置５は、ログエリア５０と、ログエリア５０
に格納されているログデータのエリアベースアドレス、
該ログデータに対応する情報処理装置の装置コード、該
ログデータが最新ログか否かを示すＶビットを含むログ
エリア管理テーブル５１（第７図）と、命令リトライ中
か否かを示す情報を含む情報処理装置毎の各種情報を含
む装置状態管理テーブル５５（第８図）と、情報処理装
置の障害の発生状況により採取するログデータを定義し
たログデータ定義テーブル５６（第６図）と、障害発生
の報告を受けた情報処理装置が既に障害を発生していて
命令リトライ中であるか否かを装置状態管理テーブル５
５を参照して識別する装置状態管理部５７と、命令リト
ライ中でなかった場合に、ログエリアに同一の情報処理
装置のログデータが未だ残っているか否かをログエリア
管理テーブル５１の装置コードを参照して判定し、また
そのロゴデータが最新のログデータであるかどうかをＶ
ビットにより判定したり、ログエリア管理テーブル５１
の更新を行なうログエリア使用状態管理部５４と、障害
を起こした情報処理装置が命令リトライ中か、ログエリ
アに障害を起こした情報処理装置のログデータが残って
いるか、障害発生原因が餌口と同一かどうかに応じて障
害を起こした情報処理装置からログデータを収集するロ
グデータ採取部５２と、障害処理を行ない、処理が完了
すると５ＶＰ３にログデータの引き取り要求を出す障害
処理部５８から構成されている。The failure processing device 5 has a log area 50 and a log area 50.
Area base address of log data stored in,
A log area management table 51 (FIG. 7) including the device code of the information processing device corresponding to the log data, the V bit indicating whether the log data is the latest log, and information indicating whether the instruction is being retried. A device status management table 55 (FIG. 8) containing various information for each information processing device included; a log data definition table 56 (FIG. 6) defining log data to be collected depending on the failure occurrence status of the information processing device; The device status management table 5 determines whether the information processing device that has received the failure report has already experienced a failure and is retrying the command.
The device status management unit 57 identifies the information processing device by referring to 5, and the device code in the log area management table 51 determines whether log data of the same information processing device still remains in the log area when the instruction is not being retried. to determine whether the logo data is the latest log data.
Judgment based on bits, log area management table 51
The log area usage state management unit 54, which updates the log area, determines whether the faulty information processing device is retrying instructions, whether log data of the faulty information processing device remains in the log area, and whether the cause of the fault is a bait. from the log data collection unit 52 that collects log data from the information processing device that has caused the failure depending on whether it is the same as or not, and the failure processing unit 58 that performs failure processing and requests the 5VP3 to retrieve the log data when the processing is completed. It is configured.

次に、木実施例の動作を第９図により説明する。Next, the operation of the tree embodiment will be explained with reference to FIG.

情報処理装置群１の中のいすねかの情報処理装置に障害
が発生すると、障害処理装置５は、その報音を受けて障
害処理を起動する。初めに、障害を起こした装置が既に
障害を起こしていて命令リトライ中であるかどうかを装
置状態管理部５７か装置状、態管理テーブル５５を参照
して識別する（処理６１）。命令リトライ中でなければ
ログエリア使用状態管理部５４が障害処理装置５上のロ
グエリア５０に同一の情報処理装置のログデータが５Ｖ
Ｐ３に引き渡されもせず未だ残っているかどうかをログ
エリア管理テーブル５１により判定する（処理６２）。When a failure occurs in one of the information processing devices in the information processing device group 1, the failure processing device 5 receives the alarm and starts failure processing. First, it is determined whether the faulty device has already caused a fault and is retrying the command by referring to the device status management section 57 or the device status management table 55 (process 61). If the command is not being retried, the log area usage state management unit 54 stores the log data of the same information processing device at 5V in the log area 50 on the failure processing device 5.
It is determined by the log area management table 51 whether or not the log area has not been handed over to P3 and still remains (processing 62).

障害処理装置５上のログエリア５０に残っている場合は
障害を起こした情報処理装置からログデータ採取部５２
がログデータ定義テーブル５６で定義されたログ情報の
全てを障害情報として採取する（処理６３）。この時に
、最新データログを示すＶビットの更新がログエリア使
用状態管理部５４により行なわれる。そしてその障害内
容に応じて命令リトライ処理等の処理が行なわれ（処理
７１）、そのｌｆｉ　Ｓ　Ｖ　Ｐ　３に対してログデー
タの引き取り要求を発行しログデータを掃き出して（処
理７３）、障害処理を終了する。一方、処理６２におい
て障害を起こした装置のログデータが障害処理装置上５
のログエリア５０に残っている場合は、ログデータ採取
部５２が障害を起こした情報処理装置から障害判定ログ
データをログデータ定義テーブル５６を参照して採取す
る（ＩＡ理６４）。そして、既に採取されているログデ
ータをログエリア使用状態管理部５４がログエリア管理
テーブル５１の中からＶビットを参照して、最新ログデ
ータを選びだし障害判定情報による比較を行ない、既に
発生している障害かどうかを認識する（処理６５）。同
−原因による障害でない場合は、残りのログデータを採
取するく処理６６）。この時に、最新データログを示す
Ｖビットの更新がログエリア使用状態管理部５４により
行なわれる。そしてその障害内容に応じて命令リトライ
処理等の処理が行なわれる（処理７１）。逆に、同一原
因である場合は、ログデータ採取部２２が処理７１での
障害処理に必要なデータを追加データとして採取しく９
ｆ！！埋６７）、処理７１に制御を渡す。次に、処理６
１において命令リトライ中の障害であると判定された場
合には、ログデータ採取部２２が障害判定に必要なデー
タを障害を起こした情Ｉｆｌｌ！Ａ理装置から採取しく
処理６８）、命令リトライｒｎの障害と同−原因である
かどうかを判定する（処理６９）。If it remains in the log area 50 on the failure processing device 5, the log data collection unit 52 from the information processing device that caused the failure
collects all the log information defined in the log data definition table 56 as failure information (process 63). At this time, the log area usage state management unit 54 updates the V bit indicating the latest data log. Then, processing such as command retry processing is performed depending on the content of the failure (processing 71), and a log data collection request is issued to the lfi SVP 3 to flush out the log data (processing 73), and the failure processing is performed. end. On the other hand, in process 62, the log data of the device that caused the failure is stored on the failure processing device 5.
If the log data remains in the log area 50, the log data collecting unit 52 collects the fault determination log data from the information processing device that has caused the fault by referring to the log data definition table 56 (IA management 64). Then, the log area usage state management unit 54 refers to the V bit from the log area management table 51, selects the latest log data from the log data that has already been collected, and compares it with the failure determination information. It is recognized whether or not there is a fault (processing 65). If the failure is not due to the same cause, the remaining log data is collected (process 66). At this time, the log area usage state management unit 54 updates the V bit indicating the latest data log. Then, processing such as instruction retry processing is performed depending on the content of the failure (processing 71). Conversely, if the cause is the same, the log data collection unit 22 collects the data necessary for handling the failure in process 71 as additional data.
f! ! 67), and passes control to process 71. Next, process 6
If it is determined in step 1 that the failure is occurring during an instruction retry, the log data collection unit 22 collects the data necessary for determining the failure and collects the information about the cause of the failure. Process 68) is collected from the A physical device, and it is determined whether the cause is the same as the failure of instruction retry rn (process 69).

命令リトライ萌と異なった原因で再び障害となった場合
は、残りのログデータを採取する（処理７０）。この時
に、最新データログを示すＶビットの更新がログエリア
使用状態管理部５４により行なわれる。そして採取され
たログデータにより障害処理を行ない（処理７２）、５
ＶＰ３にログデータの掃出しを指示しく処理７３）、処
理を完了する。同一原因の場合は、そのまま処理７２へ
処理を渡す。If a failure occurs again due to a cause different from the instruction retry moe, the remaining log data is collected (process 70). At this time, the log area usage state management unit 54 updates the V bit indicating the latest data log. Then, trouble processing is performed based on the collected log data (processing 72).
The process 73) instructs the VP3 to clean out the log data, and the process is completed. If the cause is the same, the process is directly passed to process 72.

〔Effect of the invention〕

以上説明したように本発明は、ログエリアを障害発生の
都度循環させて使用することにより、ログエリアの使用
率を上げて障害処理を高速に行なえ、またログデータの
採取を装置状態、ログエリアの残ログデータの有無、障
害原因により細かく制御することにより、障害データの
分解能を落さずに同一の情報処理装置からの頻発する障
害に対しても高速な障害処理が行なえるという効果があ
る。As explained above, the present invention increases the usage rate of the log area and speeds up failure processing by circulating the log area each time a failure occurs. By controlling the presence or absence of residual log data and the cause of failure in detail, it is possible to perform high-speed failure processing even for frequent failures from the same information processing device without reducing the resolution of failure data. .

[Brief explanation of the drawing]

第１図は本発明の障害処理装置の第１の一実施例を含む
情報処理システムの構成図、第２図は本実施例における
ログエリア２０の使用方法を示す図、第３図はログデー
タの採取処理を示すフローチャート、第４図はログエリ
アの解放処理のフローチャート、第５図は本発明の障害
処理装置の第２の実施例を含む情報処理システムの構成
図、第６図はログデータ定義テーブル５６のフォーマッ
ト図、第７図はログエリア管理テーブル５１のフォーマ
ット図、第８図は装置状、態管理テーブル５５のフォー
マット図、第９１Ｊは第２の実施例の動作を示すフロー
チャートである。蓋・・・・・・・・・＋′？ｉ報処理装置群、１１〜】
ｎ・・・情報処理装置、２．５・・・・・・障害処理装置、２０．５０・・・ログエリア、２１．５１・・・ログエリア管理テーブル、２２．５２
・・・ログデータ採取部、２３．５３・・・ログデータ格納部、２４．５４・・・ログエリア使用状態管理部。２５　、５８・・・障害処理部、５５・・・・・・・・・装置状態管理テーブル、５６・
・・・・・・・・ログデータ定義テーブル、５７・・・
・・・・・・装置状態管理部、　　３・・・・・・・・
・ｓｖｐ。FIG. 1 is a configuration diagram of an information processing system including a first embodiment of the failure processing device of the present invention, FIG. 2 is a diagram showing how to use the log area 20 in this embodiment, and FIG. 3 is a diagram showing log data. FIG. 4 is a flowchart showing log area release processing, FIG. 5 is a configuration diagram of an information processing system including the second embodiment of the failure handling device of the present invention, and FIG. 6 is a flowchart showing log data collection processing. FIG. 7 is a format diagram of the definition table 56, FIG. 7 is a format diagram of the log area management table 51, FIG. 8 is a format diagram of the device status management table 55, and FIG. 91J is a flowchart showing the operation of the second embodiment. . Lid......+'? i-information processing device group, 11~]
n... Information processing device, 2.5... Failure processing device, 20.50... Log area, 21.51... Log area management table, 22.52
...Log data collection section, 23.53...Log data storage section, 24.54...Log area usage state management section. 25, 58...fault processing unit, 55...device status management table, 56...
...Log data definition table, 57...
・・・・・・Device status management department, 3・・・・・・・・・
・svp.

Claims

[Claims] 1. When a plurality of information processing devices and a service processor are connected and a failure occurs in the information processing device, data at the time of failure is collected from the information processing device and temporarily stored in the log area. A fault handling device that stores fault information automatically, performs fault processing based on log data at the time of the fault occurrence, and when the fault processing is completed, issues a log data takeover request to the service processor and hands over the log data to the service processor. , every time log data is stored in the log area, the log area management table that holds the base address and the latest base address and oldest base address of the currently used log area are obtained from the log area management table. Then, it is determined whether the log area can be secured based on the amount of data to be collected as log data this time, and if it is determined that the log area can be secured, the base address of the next log area to be secured is stored in the log management table, and the log data is and a log area usage state management unit that instructs the log data collection unit to collect the log data and, when the service processor completes the collection of the log data, deletes the base address of the collected log data from the log area management table. Characteristic failure handling device. 2. When multiple information processing devices and service processors are connected and a failure occurs in the information processing device, data is collected from the information processing device at the time of failure, and the failure information is temporarily stored in the log area. Then, trouble processing is performed using the log data at the time of the fault occurrence, and when the fault processing is completed, the fault processing device issues a log data collection request to the service processor and stores the log data in the log area. A log area management table including a log area base address of the log data currently being processed, a device code of the information processing device corresponding to the log data, a V bit indicating whether the log data is the latest log, and whether or not an instruction is being retried. A device status management table that contains various information for each information processing device, including information indicating whether the information processing device A device status management unit that refers to the device status management table to determine whether the information processing device has already experienced a failure and is currently retrying an instruction; Determine whether log data of the information processing device still remains by referring to the device code in the log area management table;
In addition, there is a log area usage state management unit that determines whether the log data is the latest log data using the V bit and updates the log area management table, and if it is determined that there is a failure during an instruction retry, Collect the data required to determine the cause of the failure defined in the log data definition table from the information processing device where the failure occurred, determine whether the cause is the same as the failure before retrying the instruction, and check if the cause is different from the cause before retrying the instruction. If the problem occurs again,
If all the remaining log data is collected from the information processing device that caused the failure, and if the instruction is not being retried and no log data from the same information processing device remains in the log area, the log data defined in the log data definition table is Collect all log data from the failed information processing device, and if the command is not being retried and log data from the same information processing device remains in the log area, check the log data defined in the log data definition table. Collect the failure cause determination log data from the information processing device that caused the failure, read the log data remaining in the log area, compare the two, determine whether the cause is the same as the previous one, and if the failure is not due to the same cause. collects all remaining log data, and in the case of a failure due to the same cause, collects the log data from the information processing device that caused the failure as additional data with the log data required for command retry defined in the log data definition table. A failure processing device characterized by having a data collection section.