JPH03150643A

JPH03150643A - Fault monitoring device and control method for information processing system

Info

Publication number: JPH03150643A
Application number: JP1288917A
Authority: JP
Inventors: Toshio Hirozawa; 廣澤　敏夫; Junichi Kurihara; 潤一栗原; Ikuo Kimura; 木村　伊九夫; Hideki Nanba; 難波　秀企
Original assignee: Hitachi Ltd; Hitachi Electronics Services Co Ltd
Current assignee: Hitachi Ltd; Hitachi Electronics Services Co Ltd
Priority date: 1989-11-08
Filing date: 1989-11-08
Publication date: 1991-06-27
Anticipated expiration: 2013-09-24
Also published as: JP2804125B2; US5237677A

Abstract

PURPOSE:To monitor and recover the fault of a computer system from a remote place by providing a computer system as the controlled object with a monitor controller. CONSTITUTION:An information processing system 250 as the monitored object relays the data line, through which message data is sent to a conventional master console 102, and monitors message data. With respect to detection of hardware fault, a private interface line is provided between a central processing unit 200 and the system to detect the occurrence of fault or gather fault information. Consequently, it is unnecessary to change a conventional operating system, and the malfunction is prevented. Thus, fault monitor and recovery of the information processing system, namely, the computer system are performed from a remote place.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、情報処理システムの障害監視装置とその制御
方法に係り、特に、遠隔地から障害発生時の該情報処理
システムの初期診断を行なうときに、好適な制御方式に
関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a fault monitoring device for an information processing system and a control method thereof, and in particular, to perform initial diagnosis of the information processing system when a fault occurs from a remote location. Sometimes related to preferred control strategies.

[Conventional technology]

情報処理システム、すなわち電子計算機システムの応用
範囲の拡大にともない、システムの構成も大規模、複雑
化の一途をたどりつつある。これに伴い、情報処理シス
テムの信頼性向上、耐障害性の向上、さらに障害発生後
の早期の復旧、等がより重要になりつつある。BACKGROUND OF THE INVENTION As the range of applications of information processing systems, that is, electronic computer systems, expands, system configurations are also becoming larger and more complex. Along with this, improvements in the reliability and fault tolerance of information processing systems, as well as early recovery after a fault occurs, are becoming more important.

近年の情報処理システムにおいては、情報処理システム
本体に保守制御用の制御装置を付加し。In recent information processing systems, a control device for maintenance control is added to the information processing system itself.

情報処理システムの保守１診断を司る構成が一般的にな
りつつある。この種の制御装置は、サービス０プロセツ
サ（Ｓｅｒｖｉｃｅ　Ｐｒｏｃｅｓｓｏｒ　：　Ｓ　Ｖ
　Ｐ　）と呼ばれており、米国特許第４，２０４，２４
９号公報に開示されている。また、特開昭５８−５６１
５８号公報では、遠隔地の保守センタの計算機システム
から複数のユーザ計算機システムを保守・診断する制御
方式を開示している。さらに、特開昭６１−１４８５４
２号公報では、ＳｖＰの画面を遠隔地からも操作出来る
ための制御方式が開示されている。A configuration that handles maintenance and diagnosis of information processing systems is becoming common. This type of control device is a Service Processor (SV).
P) and U.S. Patent No. 4,204,24
It is disclosed in Publication No. 9. Also, JP-A-58-561
No. 58 discloses a control method for maintaining and diagnosing a plurality of user computer systems from a computer system at a remote maintenance center. Furthermore, JP-A-61-14854
Publication No. 2 discloses a control method that allows the SvP screen to be operated from a remote location.

米国特許第４，２０４，２４９号公報に開示された技術
は、複数の処理装置群に対して、電源のオン／オフの制
御やマイクロ・プログラムのローディング制御をＳｖＰ
が担当することにより、集中管理を可能としている。特
に、該制御装置から各処理装置群へ専用の信号線を直接
に布線することにより、従来の並列布線方式に比べて再
布線を必要としない。The technology disclosed in U.S. Patent No. 4,204,249 uses SvP to control power on/off and control microprogram loading for multiple processing device groups.
This enables centralized management. In particular, by directly wiring a dedicated signal line from the control device to each processing device group, there is no need for rewiring compared to the conventional parallel wiring method.

特開昭５８−５６１５８号公報で開示された技術は、保
守センタの計算機システムが常時、複数のユーザ計算機
システムを順に診断し、事前に障害の発生を検知しよう
とするものである。また、特開昭６１−１４８５４２号
公報では、ｓｖｐの画面を遠隔地がらも操作出来るよう
にするときに、ＳｖＰ側の画面制御プログラムと遠隔地
の保守側の画面制御プログラムの構造、および処理手順
を同じようにすることを目的として、ＳｖＰ側にデータ
・バッファを設け、このデータ・バッファの内容を転送
する方式を開示している。The technique disclosed in Japanese Patent Application Laid-Open No. 58-56158 is such that a computer system at a maintenance center constantly diagnoses a plurality of user computer systems in sequence to detect the occurrence of a failure in advance. In addition, in Japanese Patent Application Laid-open No. 61-148542, when making it possible to operate the SVP screen even from a remote location, the structure and processing procedure of the screen control program on the SVP side and the screen control program on the remote maintenance side are disclosed. In order to achieve the same result, a method is disclosed in which a data buffer is provided on the SvP side and the contents of this data buffer are transferred.

[Problem to be solved by the invention]

ところで、情報処理システムの２４時間運転サービスや
応用範囲の拡大に伴い、情報処理システムの信頼性の向
上や耐障害性の向上のための各種技術開発とともに、障
害発生後の早期復旧のための制御手段も重要になる。早
期復旧のためには、システムの保守員がユーザ計算機シ
ステムのサイトに常駐すれば良い訳であるが、２４時間
運転サービスの普及とともに、無人運転が一般的になり
つつあり、システムの保守員も保守センタに待機し、複
数のユーザ計算機システムの障害監視と保守を行なえる
ことが必要である。その時の課題は。By the way, with the expansion of 24-hour operation services and the range of applications for information processing systems, various technological developments have been developed to improve the reliability and fault tolerance of information processing systems, as well as the development of controls for early recovery after a fault occurs. The means are also important. For early recovery, it would be sufficient if system maintenance personnel were permanently stationed at the site of the user computer system, but with the spread of 24-hour operation services, unmanned operation is becoming commonplace, and system maintenance personnel are becoming more and more popular. It is necessary to be able to stand by at a maintenance center and perform fault monitoring and maintenance of multiple user computer systems. What were the challenges at that time?

遠隔地からユーザ計算機システムの障害発生を検知出来
る機能を充実することであり、かつ早期復旧の制御手段
を提供することである。The objective is to enhance the ability to detect the occurrence of a failure in a user computer system from a remote location, and to provide a control means for early recovery.

この観点でみると、従来技術に関して、米国特許第４，
２０４，２４９号公報記載の技術は、計算機システムの
構成が複雑になるにともない、電力供給装置の任意の変
更や布線の変更を容易にするものであり、具体的には、
ＳｖＰにて電源のオン／オフ。From this perspective, regarding the prior art, U.S. Pat.
The technology described in 204,249 facilitates arbitrary changes in power supply devices and wiring as computer system configurations become more complex.
Power on/off with SvP.

および電圧調整を可能としている。しかし、遠隔地から
の障害監視や保守方法については開示していない。また
、情報処理システムの処理装置群に対しては、現地のＳ
ｖＰにて操作するという制約がある。and voltage adjustment. However, it does not disclose failure monitoring or maintenance methods from remote locations. In addition, for the processing equipment group of the information processing system, local S
There is a restriction that it must be operated in vP.

特開昭５８−５６１５８号公報記載の技術は、保守セン
タの計算機システムから現地のユーザ計算機システム（
被診断処理装置）のＳｖＰと交信させ、ユーザ計算機シ
ステムの状態を巡回して監視しようとするものである。The technology described in Japanese Unexamined Patent Publication No. 58-56158 is a technology that allows the maintenance center computer system to be transferred from the local user computer system (
This system communicates with the SvP of the processing device to be diagnosed (processing device to be diagnosed) and patrols and monitors the status of the user computer system.

したがって、巡回してユーザ計算機システムの状態を監
視することにより、保守センタの計算機システムの稼働
状況を向上させること、および監視手順のカタログ化に
より、診断の自動化が行なえる効果を狙っている。しか
し、障害検出の手段、ロギング情報の具体的項目、さら
に、障害判定の基準、などが具体的に開示されていない
。Therefore, by patrolling and monitoring the status of user computer systems, the aim is to improve the operating status of the computer system at the maintenance center, and to automate diagnosis by cataloging the monitoring procedures. However, failure detection means, specific items of logging information, failure determination criteria, etc. are not specifically disclosed.

一方、特開昭６１−１４８５４２号公報記載の技術は、
現地の計算機システムの８７２画面と同じ画面を遠隔地
の表示装置に表示するために、現地の計算機システムの
ＳｖＰ内にデータ・バッファを設け。On the other hand, the technology described in JP-A-61-148542 is
In order to display the same screen as the 872 screen of the local computer system on a remote display device, a data buffer is provided in the SvP of the local computer system.

このデータ・バッファの内容が現地の計算機システムの
８７２画面や遠隔地の表示装置に表示させるようにして
処理プログラムの論理構造を簡略化している。これによ
り、遠隔地の保守センタの表示装置からも現地の計算機
システムのＳｖＰを操作できるようにしている。ところ
で、ｓｖｐは保守操作を支援するものであり、ハードウ
ェア障害の検知は可能であるが、ソフトウェア、すなわ
ちオペレーティング・システム（Ｏｐｅｒａｔｉｎｇ　
５ｙｓｔｅＣｏＳ）の誤動作等の検知は一般的に無理が
ある。The logical structure of the processing program is simplified by displaying the contents of this data buffer on the 872 screen of the local computer system or on a remote display device. This allows the SvP of the local computer system to be operated even from a display device at a remote maintenance center. By the way, svp supports maintenance operations and can detect hardware failures, but software, that is, operating system
5ysteCoS) is generally difficult to detect.

また１通常、Ｏ８の動作中はＯ８用のコンソール・メツ
セージ等の監視が障害検出手段の主力をなすものである
。この観点でみると、上記の特開昭６１−１４８５４２
号公報記載の技術では、遠隔地と現地の８７２画面の切
り換えの契機、さらに、コンソール・メツセージの検出
手段、障害発生時の通報の方法や８７２画面からの障害
情報の収集項目、収集方法については言及されていない
。1. Normally, when the O8 is in operation, monitoring of console messages, etc. for the O8 is the main fault detection means. From this point of view, the above-mentioned Japanese Patent Application Laid-Open No. 61-148542
The technology described in the publication does not include the trigger for switching between the remote and local 872 screens, the means for detecting console messages, the method of reporting when a failure occurs, the items to collect failure information from the 872 screen, and the collection method. Not mentioned.

ユーザ計算機システムの障害の検出、および障害発生後
の早期復旧を遠隔地の保守センタ等で実現するためには
、Ｏ８の振舞いの経過を遠隔地にて即時に収集出来る機
構の実現方法が課題として残されている６通常、Ｏ８の
振舞いの経過は、ＯＳコンソール上に出力されたメツセ
ージを追跡すれば可能であるが、コンソール・メツセー
ジを出力するハードコピー装置は現地のユーザ計算機シ
ステムの近くに存在するか、あるいは近くに存在してい
ても無人運転を行なっているために該ハードコピー装置
の電源をオフ状態としている場合が多い、これは用紙切
れの状態や用紙の巻き込み状態の発生を回避するためで
ある。In order to detect failures in user computer systems and achieve early recovery after a failure occurs at a remote maintenance center, the challenge is to create a mechanism that can instantly collect the progress of O8 behavior from a remote location. Remaining 6 Normally, the progress of O8 behavior can be traced by tracing messages output on the OS console, but the hard copy device that outputs console messages is located near the local user computer system. In many cases, the power of the hard copy device is turned off because the hard copy device is running unattended or is operating unattended even if it is nearby. It's for a reason.

また、障害発生時に、遠隔地より、主記憶装置内の特定
の領域をも参照できなければならない。Furthermore, when a failure occurs, it is necessary to be able to refer to a specific area within the main storage device from a remote location.

これらはＯ８の管理情報が格納されている領域やハード
ウェアの使用している領域が対象となる。These targets include the area where O8 management information is stored and the area used by the hardware.

当然のことながら、無人運転を行なっている場合には、
ユーザ計算機システム側には、運転オペレータや保守要
員は存在しない。したがって、障害発生を検知する制御
手段の提供が望まれる。また、遠隔地にて、障害の発生
を認識したならば。Naturally, when operating unmanned vehicles,
There are no operating operators or maintenance personnel on the user computer system side. Therefore, it is desired to provide a control means for detecting the occurrence of a failure. Also, if you recognize that a problem has occurred in a remote location.

保守要員が現地に到着するまでに障害発生要因の初期解
析がなされなければならない。これらの制御手段を提供
することが、計算機システムの障害発生後の早期回復に
寄与するものである。An initial analysis of the cause of the failure must be performed before maintenance personnel arrive at the site. Providing these control means contributes to early recovery of a computer system after a failure occurs.

したがって１本発明の目的は、情報処理システム、すな
わち計算機システムの障害監視と回復を遠隔地から可能
とする制御装置と制御手段を提供することにある。Therefore, one object of the present invention is to provide a control device and control means that enable fault monitoring and recovery of an information processing system, that is, a computer system, from a remote location.

本発明の他の目的は、遠隔地にて情報処理システムの障
害発生の通報を受けたならば、遠隔地から情報処理シス
テムの障害情報、具体的には、障害発生時点の該システ
ムの状況と障害に至る過程での動作履歴情報を収集でき
る制御装置と制御手段を提供することにある。Another object of the present invention is to provide information on the failure of the information processing system from the remote location, specifically, the status of the system at the time of occurrence of the failure, when a notification of the occurrence of a failure in the information processing system is received at a remote location. An object of the present invention is to provide a control device and a control means that can collect operation history information in the process leading to a failure.

本発明の他の目的は、収集した障害情報を解析し、保守
員が現地に到着したときに適切な回復手順を指示出来る
制御手段を提供することにある。Another object of the present invention is to provide a control means that can analyze collected failure information and instruct maintenance personnel to take appropriate recovery procedures when they arrive at the site.

[Means to solve the problem]

上記目的を達成するために、本発明の情報処理システム
の障害監視装置と制御方法では、該障害監視装置を監視
および制御対象の情報処理システム（計算機システムと
も云う）のマスク・コンソール装置の間に介在させ、情
報処理システムとの間のデータの送受信信号線上のデー
タ・ストリームを監視している。さらに、中央処理装置
との間で専用の接続インタフェース線を配し、中央処理
装置からのハードウェア障害の報告を受信したり、中央
処理装置内の特定のハードウェア情報を取り込むことが
出来る機構を具備している。なお、従来のマスク・コン
ソールの代わりに本発明の制御装置にキーボードと表示
装置を付加しても効果は同じである０本発明の一実施例
においては、従来のマスタ・コンソールの代わりに本発
明の制御装置にキーボードと表示装置を付加した構成に
て説明することにする。In order to achieve the above object, in the fault monitoring device and control method for an information processing system of the present invention, the fault monitoring device is installed between a mask/console device of an information processing system (also referred to as a computer system) to be monitored and controlled. It intervenes and monitors the data stream on the signal line for transmitting and receiving data to and from the information processing system. Furthermore, a dedicated connection interface line is installed between the central processing unit and a mechanism that can receive hardware failure reports from the central processing unit and import specific hardware information within the central processing unit. Equipped with Note that the effect is the same even if a keyboard and display device are added to the control device of the present invention instead of the conventional mask console. In one embodiment of the present invention, the present invention The explanation will be based on a configuration in which a keyboard and a display device are added to the control device.

また、遠隔地には上記の情報処理システム群を監視制御
する第２の情報処理システムが存在する。Furthermore, a second information processing system exists in a remote location that monitors and controls the above information processing system group.

この第２の情報処理システムは先の制御装置から障害発
生の通報を受けると、該障害情報をもとに過去の判例情
報を検索、照合し、該障害の回復手順を自動的に生成し
た後、障害を起した情報処理システムのサイトへその回
復手順を転送する役割を分担している。When this second information processing system receives a notification of a failure from the previous control device, it searches for and collates past case law information based on the failure information, automatically generates a recovery procedure for the failure, and then , share the role of forwarding recovery procedures to the site of the information processing system that has experienced the failure.

本発明の情報処理システムの障害監視装置とその制御方
法を実現する監視制御装置は、監視・制御対象の情報処
理システムから送出されるメッセージ・データを受信す
るとデータ・バッファに格納後１表示装置に表示する８
このとき、同時に。A supervisory control device that realizes a fault monitoring device for an information processing system and a control method thereof according to the present invention receives message data sent from an information processing system to be monitored and controlled, stores it in a data buffer, and displays it on one display device. Display 8
At this time, at the same time.

あらかじめ登録されている障害メツセージと比較し、障
害発生を認識する。情報処理システム（以降、計算機シ
ステムと云う場合もある）のハードウェア障害の場合に
は、専用のインタフェース線を介してその旨が知らされ
る。ハードウェア障害としてはマシン・チエツク発生、
メモリ・エラー処理ユニット障害などである６監視制御
装置は。Compare with pre-registered failure messages to recognize the occurrence of a failure. In the event of a hardware failure in an information processing system (hereinafter also referred to as a computer system), the failure will be notified via a dedicated interface line. Hardware failures include machine checks,
6. Monitoring and control equipment such as memory error processing unit failure.

前者のソフトウェア障害、あるいはハードウェア障害の
場合のいずれにおいても、専用のインタフェース線を介
して中央処理装置内の特定の処理ユニットの情報を読み
出して記憶域に一時的に格納する。In either the former case of a software failure or a hardware failure, information of a specific processing unit within the central processing unit is read out via a dedicated interface line and temporarily stored in a storage area.

以上で述べた障害状態を検出すると１本発明の監視制御
装置は遠隔地の第２の情報処理システムへ障害発生の旨
を報告する。このとき、データ・バッファに格納された
複数のメッセージ・データ、すなわち障害の発生した時
点からさかのぼって過去の複数のメッセージ・データと
障害が発生した時点でのハードウェア障害情報を転送す
る。これによって、遠隔地の監視・保守用の第２の情報
処理システム（以降、監視・保守用計算機システムと云
う）が監視、および制御対象の計算機システムでの障害
の発生を認識する。When the above-mentioned fault condition is detected, the supervisory control device of the present invention reports the occurrence of the fault to the second information processing system at a remote location. At this time, a plurality of message data stored in the data buffer, that is, a plurality of past message data from the time when the failure occurred, and hardware failure information at the time when the failure occurred are transferred. As a result, the second information processing system for monitoring and maintenance at a remote location (hereinafter referred to as the computer system for monitoring and maintenance) recognizes the occurrence of a failure in the computer system to be monitored and controlled.

監視・保守用計算機システムは上記の通報、および障害
情報を受信すると、該障害情報とあらがじめ記憶してい
る判例情報とを照合比較し、最適な回復手順を生成する
。その結果は、障害の発生した計算機システムの監視制
御装置へ転送される６したがって、保守員が障害の発生
した計算機システムのサイトに到着したときに、コンソ
ール装置から″回復指示″の旨のコマンドを投入するこ
とにより、一連の回復手順がコンソール装置の表示画面
やハードコピー装置に出力される。When the monitoring/maintenance computer system receives the above report and failure information, it compares the failure information with pre-stored precedent information and generates an optimal recovery procedure. The results are transferred to the supervisory control device of the computer system where the failure has occurred.6 Therefore, when maintenance personnel arrive at the site of the computer system where the failure has occurred, they can issue a command for "recovery instructions" from the console device. By inputting the information, a series of recovery procedures are output to the display screen of the console device or the hard copy device.

これにより、計算機システムの障害発生から回復までの
時間を短縮出来ることになり、計算機システムの運転サ
ービスの向上が図れることになる。As a result, the time from the occurrence of a failure in the computer system to recovery can be shortened, and the operational service of the computer system can be improved.

〔作用〕本発明の情報処理システムの障害監視装置と制御方法は
、監視対象となる情報処理システムが従来のマスタ・、
コンソールにメッセージ・データを送出するデータ線を
中継しており、それらのメッセージ・データを監視して
いる。また、ハードウェア障害の検知方法に関しても中
央処理装置との間で専用のインタフェース線を設けて、
障害の発生を検知したり、障害情報を収集したりしてい
る。[Operation] The fault monitoring device and control method for an information processing system according to the present invention is such that the information processing system to be monitored is a conventional master,
It relays the data line that sends message data to the console, and monitors the message data. In addition, regarding the detection method of hardware failure, a dedicated interface line is provided with the central processing unit.
It detects the occurrence of failures and collects failure information.

したがって、従来のオペレーティング・システムを改造
する必要がなく、誤動作することもない。Therefore, there is no need to modify the conventional operating system, and there is no possibility of it malfunctioning.

〔Example〕

以下、本発明の一実施例を第１図〜第２２図により説明
する。第１図は本発明の情報処理システムの障害監視装
置と制御方法の構成の概略幅飛した図である。図中の符
号２００は監視対象となる計算機システムであり、その
もとでオペレーティング・システム（Ｏｐｅｒａｔｉｎ
ｇ　Ｓｙｓｔｅｍ　：ＯＳ　）２０８ａ。An embodiment of the present invention will be described below with reference to FIGS. 1 to 22. FIG. 1 is a schematic diagram of the configuration of a fault monitoring device and a control method for an information processing system according to the present invention. Reference numeral 200 in the figure is a computer system to be monitored, under which an operating system (Operating System) is installed.
g System :OS ) 208a.

ユーザ・プログラム（Ｕｓｅｒ　Ｐｒｏｇｒａｍ：　Ｕ
　Ｐ　）　２０８　ｂが動作している。計算機システム
２ｏＯを構成する代表的なハードウェア処理部は、命令
制御ユニット（Ｉｎｓｔｒｕｃｔｉｏｎ　Ｕｎｉｔ：　
Ｉ　Ｕ）２０１　、実行ユニット（Ｅｘｅｃｕｔｉｏｎ
　Ｕｎｉｔ：　Ｅ　Ｕ）　２０２　、入出力処理ユニッ
ト（１／Ｑ　ＰｒｏｃｅｓｓｉｎｌＨｔｌｎｉｔ：　Ｉ
　ＯＰ　）　２０３　。User Program: U
P) 208b is operating. A typical hardware processing unit that constitutes the computer system 2oO is an instruction control unit (Instruction Unit).
IU) 201, execution unit (Execution
Unit: EU) 202, input/output processing unit (1/Q ProcessinlHtlnit: I
OP) 203.

メモリ制御ユニット（Ｍｅｍｏｒｙ　Ｃｏｎｔｒｏｌ　
Ｕｎｉｔ：ＭＣυ）２０４、主記憶装置（Ｍａｉｎ　Ｓ
ｔｏｒａｇｅ：　Ｍ　Ｓ　）　２０６　。Memory control unit
Unit: MCυ) 204, main memory device (Main S
storage: MS) 206.

およびサービス・プロセッサ（Ｓ６ｒｙｉＣ６Ｐｒｏｃ
ｅｓｓｏｒ：５ＶＰ）２０７である。また、計算機シス
テム２００には、ファイル装置２０９などの入出力装置
が接続されている。and service processor (S6ryiC6Proc
essor:5VP)207. Further, input/output devices such as a file device 209 are connected to the computer system 200.

符号１００は本発明の１つの構成要素をなす監視制御装
置であり、計算機システム２００から本発明の監視制御
袋ｅ１００にシステム・コンソール用の信号線Ｌｌ　ａ
、５ＶＰ２０７の表示用信号線Ｌｌｂが接続されている
。なお、システム・コンソール用の信号線Ｌｌａと５Ｖ
Ｐ２０７の表示用信号線Ｌｌｂとは同一の信号線であっ
ても構わない、この場合には、信号線Ｌｌａは工○Ｐ２
Ｏ３から５ＶＰ２０７を経て出力される。Reference numeral 100 denotes a supervisory control device which is one of the components of the present invention, and a signal line Ll a for the system console is connected from the computer system 200 to the supervisory control bag e100 of the present invention.
, 5VP207 are connected to the display signal line Llb. In addition, the signal line Lla and 5V for the system console
The display signal line Llb of P207 may be the same signal line. In this case, the signal line Lla is the same as the display signal line Llb of P207.
It is output from O3 via 5VP207.

さらに１本発明の監視制御装置１００と計算機システム
２００との間は各処理ユニットからの要求信号Ｌ２．ア
ドレス・バスとデータ・バスを含む信号線Ｌ３で接続さ
れている。ここで、アドレス・バスとデータ・バスとを
信号線Ｌ３にて同一に示しであるが、これは表現上のた
めであり、当然のことながら、別の信号線で接続されて
いる。Further, between the supervisory control device 100 of the present invention and the computer system 200, a request signal L2. They are connected by a signal line L3 including an address bus and a data bus. Here, the address bus and the data bus are shown identically by the signal line L3, but this is for representational purposes only, and as a matter of course, they are connected by different signal lines.

監視制御装置１００はマイクロ・プロセッサ等と同様な
処理能力を具備した演算器、主メモリ、および処理プロ
グラム群で構成されている。監視制御袋［１００内の符
号１はエンコーダ、２はアドレス・レジスタＡＤ、３は
データ・レジスタＤＴである。符号４はＣＰＵインタフ
ェース処理部（ＣＰＵ　Ｉｎｔｅｒｆａｃｅ）、符号５
は比較処理部ＣＭＰ。The supervisory control device 100 is composed of an arithmetic unit having processing power similar to that of a microprocessor, a main memory, and a group of processing programs. In the supervisory control bag [100, reference numeral 1 is an encoder, 2 is an address register AD, and 3 is a data register DT. Code 4 is a CPU interface processing unit (CPU Interface), code 5
is the comparison processing unit CMP.

符号６はデータ・バッファ、符号７は比較テーブル、符
号８は診断処理部、符号９は画面バッファ。Reference numeral 6 is a data buffer, reference numeral 7 is a comparison table, reference numeral 8 is a diagnostic processing section, and reference numeral 9 is a screen buffer.

符号１０は障害判定処理部、符号１１は分配器。Reference numeral 10 represents a failure determination processing unit, and reference numeral 11 represents a distributor.

符号１２は転送処理部、符号１３はコマンド解釈処理部
、符号１４は送受信処理部、である。なお。Reference numeral 12 is a transfer processing section, reference numeral 13 is a command interpretation processing section, and reference numeral 14 is a transmission/reception processing section. In addition.

監視制御装置１００には、キーボード付の表示装置１０
２、−時記憶ファイル１０４が接続され°Ｃいる。The monitoring control device 100 includes a display device 10 with a keyboard.
2. - Time storage file 104 is connected.

監視制御装置１００と遠隔地の監視・保守用計算機シス
テム２５０とは線Ｌ４で接続されている。The monitoring control device 100 and the remote monitoring/maintenance computer system 250 are connected by a line L4.

なお、監視制御装置１ｏＯと遠隔地の監視・保守用計算
機システム２５０との間に公衆回線網が介在しても構わ
ない。監視・保守用計算機システム２５０には、該計算
機システム２５０の制御用のコンソール装置２５２、判
例情報を記憶するファイル装置２５４、障害情報格納フ
ァイル２５６が接続されている。なお、第１図には示し
てないが当然のことながら計算機システムに接続可能な
入出力装置群、例えばライン・プリンタ装置などの装置
群も接続可能である。Note that a public line network may be interposed between the supervisory control device 1oO and the remote monitoring/maintenance computer system 250. Connected to the monitoring/maintenance computer system 250 are a console device 252 for controlling the computer system 250, a file device 254 for storing precedent information, and a failure information storage file 256. Although not shown in FIG. 1, it is of course possible to connect input/output devices that can be connected to the computer system, such as line printers.

監視・保守用計算機システム２５０内では、障害要因の
解析プログラム群が動作しており、それらは符号２１の
通信処理部、符号２２の解釈・指令処理部、符号２３の
収集・解析処理部、符号２５の照合処理部、符号２６の
判例検索・登録処理部、である、また、符号２４は作業
用のバッファ（Ｗｏｒｋ　Ｂｕｆｆｅｒ：　Ｂ　Ｕ　Ｆ
　）である。In the monitoring/maintenance computer system 250, a group of failure factor analysis programs are running, including a communication processing section 21, an interpretation/command processing section 22, a collection/analysis processing section 23, and a failure factor analysis program group 250. Reference numeral 25 is a collation processing unit, reference numeral 26 is a precedent search/registration processing unit, and reference numeral 24 is a work buffer (Work Buffer: B U F
).

監視・保守用計算機システム２５０は複数の監視制御対
象の計算機システムを管理出来る。符号１００ａ、１０
０ｂは符号１００と同様の他の監視制御装置、符号２０
０ａ、２００ｂは監視対象となる計算機システムである
。The monitoring/maintenance computer system 250 can manage a plurality of computer systems to be monitored and controlled. Code 100a, 10
0b is another supervisory control device similar to 100, 20
0a and 200b are computer systems to be monitored.

では、第１図を用いて本発明の情報処理システムの障害
監視装置と制御方法の動作の概要を説明した後に、第２
図以降の説明図を用いて各処理部の詳細を説明すること
にする。Now, after explaining the outline of the operation of the fault monitoring device and control method for an information processing system of the present invention using FIG.
The details of each processing section will be explained using the explanatory diagrams that follow.

第１図を参照するに、監視・制御対象の計算機システム
２００からは線Ｌｌａを介してＯＳ　２０８ａのメッセ
ージ・データが順次に送出されている。Referring to FIG. 1, messages and data of the OS 208a are sequentially sent from the computer system 200 to be monitored and controlled via the line Lla.

これらのメッセージ・データはデータ・バッファ６内に
順に格納されて行く。データ・バッファ６が満杯になる
と、再び先頭から格納される。メッセージ・データがデ
ータ・バッファ６内に格納される前に、比較処理部ＣＭ
Ｐ５にて該メッセージ・データが比較テーブル７内にあ
らかじめ登録されている障害判定用のメッセージ・デー
タと等しいか否かを検査する。比較検査の結果、一致し
たならば障害判定処理部１０に制御が移る。These message data are sequentially stored in the data buffer 6. When the data buffer 6 becomes full, data is stored again from the beginning. Before the message data is stored in the data buffer 6, the comparison processing unit CM
At P5, it is checked whether the message data is equal to message data for failure determination registered in advance in the comparison table 7. As a result of the comparison test, if they match, control is transferred to the failure determination processing section 10.

障害判定処理部１０はソフトウェアの障害であるならば
、ハードウェアの付加情報を得るために診断処理部８を
起動して中央処理装置２００内の各処理ユニットの状態
情報を収集する。状態情報の収集に際しては、アドレス
・レジスタＡＤ２に処理ユニットのアドレス値、データ
・レジスタＤＴ３に収集項目番号を設定して、中央処理
装置２００内の各処理ユニットへ送出する。各処理ユニ
ットは収集項目番号に対応した情報をデータ・バスＬ３
に返す。収集した状態情報は、−旦。If the fault is a software fault, the fault determination processing section 10 activates the diagnosis processing section 8 and collects status information of each processing unit in the central processing unit 200 in order to obtain additional hardware information. When collecting status information, the address value of the processing unit is set in the address register AD2, the collection item number is set in the data register DT3, and the information is sent to each processing unit in the central processing unit 200. Each processing unit sends information corresponding to the collection item number to the data bus L3.
Return to. The collected status information is -d.

時記憶ファイル１０４に格納される。ここで、ハードウ
ェアの状態情報としては、主記憶装置２０６内のハード
ウェア使用領域２０６ｂの格納内容。It is stored in the time storage file 104. Here, the hardware status information is the contents stored in the hardware usage area 206b in the main storage device 206.

各処理ユニット内の状態保持情報１例えば割込み保持レ
ジスタやプログラム状態語などがある。State holding information 1 in each processing unit includes, for example, an interrupt holding register and a program status word.

一方、ハードウェアの障害の場合には、一般にサービス
・プロセッサ５ＶＰ２０７が検出し、線Ｌ２を介してエ
ンコーダ１に報告される。また。On the other hand, in the case of a hardware failure, it is generally detected by the service processor 5VP 207 and reported to the encoder 1 via line L2. Also.

サービス・プロセッサ５ＶＰ２０７が検出できない障害
、例えばラッチ回路のパリティ・エラーなどは、各処理
ユニットから直接に線Ｌ２を介してエンコーダ１に報告
される。ハードウェア障害の報告を受けるとＣＰＵイン
タフェース処理部４は、診断処理部８を起動して中央処
理装置２００内の各処理ユニットの状態情報を収集した
後、障害判定処理部１０に制御を渡す、なお、ハードウ
ェア障害の場合には、ｏＳから発行されているメッセー
ジ・データの履歴は既にデータ・バッファ６に格納され
ているので、特に、別の処理を行なわない。Faults that cannot be detected by the service processor 5VP 207, such as parity errors in latch circuits, are reported directly from each processing unit to the encoder 1 via line L2. Upon receiving a report of a hardware failure, the CPU interface processing unit 4 activates the diagnostic processing unit 8 to collect status information of each processing unit in the central processing unit 200, and then passes control to the failure determination processing unit 10. Note that in the case of a hardware failure, the history of message data issued by the OS is already stored in the data buffer 6, so no special processing is performed.

以上の処理が完了すると、障害判定処理部１゜は送受信
処理部１４を経由し、かっ線Ｌ４を介して監視・保守用
計算機システム２５０へ障害の発生の旨を通報する。な
お、この通報処理は、先に述べた障害情報の収集の前に
行なっても構わない。When the above processing is completed, the fault determination processing unit 1° notifies the monitoring/maintenance computer system 250 of the occurrence of a fault via the transmission/reception processing unit 14 and the parentheses L4. Note that this reporting process may be performed before collecting the failure information described above.

監視・保守用計算機システム２５０は障害発生の通報を
受けると、まず、障害の概要を調べた後。When the monitoring/maintenance computer system 250 receives a report of a failure, it first investigates the outline of the failure.

監視制御装置１００に対して詳細な障害情報を要求する
。これは、線Ｌ４を介して指令（コマンド：　Ｃｏｍｍ
ａｎｄ）のデータ列が監視制御装置１００へ送られる。Detailed fault information is requested from the supervisory control device 100. This is done via the line L4 (command: Comm
and) data string is sent to the supervisory control device 100.

この指令は解釈・指令処理部２２が行なう。This command is issued by the interpretation/command processing unit 22.

監視制御装置１００では、この指令をコマンド解釈処理
部１３が解釈し、転送処理部１２を起動する。転送処理
部１２はデータ・バッファ６の内容、および−時記憶フ
ァイル１０４に格納された障害情報を転送することにな
る。このときに、転送処理部１２は要求された情報のみ
を転送する訳であるが、特に、ハードウェア使用領域２
０６ｂのデータについては、編集処理を施し、転送する
データ量の削減を図っている０編集処理の一例としては
、同一のデータが続いた場合、その旨の印に置き換える
などである。In the supervisory control device 100, the command interpretation processing section 13 interprets this command and starts the transfer processing section 12. The transfer processing unit 12 will transfer the contents of the data buffer 6 and the failure information stored in the -time storage file 104. At this time, the transfer processing unit 12 transfers only the requested information.
For the data of 06b, editing processing is performed to reduce the amount of data to be transferred.An example of the 0 editing processing is to replace the same data with a mark to that effect if it continues.

解釈・指令処理部２２は監視制御装置１００から障害発
生の通報を受けたときに、自動的に上記の指令すなわち
、障害情報の転送指令を発行するが、コンソール装置２
５２から手動で該当するコマンドを投入することも可能
である。その場合には、投入されたコマンドが解釈・指
令処理部２２にて解釈され、通信処理部２１を経由して
監視制御装置１００内のコマンド解釈処理部１３に渡さ
れる。監視・保守用計算機システム２５０側においては
、転送されて来る詳細な障害情報をＢＵＦ２４を介して
、障害情報格納ファイル２５６に、−旦、格納する。こ
れら一連の処理は収集・解析処理部２３が行なう。The interpretation/command processing unit 22 automatically issues the above command, that is, the fault information transfer command, when receiving a notification of the occurrence of a fault from the supervisory control device 100, but the console device 2
It is also possible to input the corresponding command manually from 52. In that case, the input command is interpreted by the interpretation/command processing unit 22 and passed to the command interpretation processing unit 13 in the supervisory control device 100 via the communication processing unit 21. On the monitoring/maintenance computer system 250 side, the transferred detailed fault information is stored in the fault information storage file 256 via the BUF 24. These series of processes are performed by the collection/analysis processing unit 23.

収集・解析処理部２３は、次に１判例検索・登録処理部
２６を起動し、該障害に類似した過去の障害例を判例記
憶ファイル２５４の中から検索する。その後、照合処理
部２５にて、先に検索した候補群の中から該障害に合致
した判例を照合する。The collection/analysis processing unit 23 then activates the one-case precedent search/registration processing unit 26 to search the precedent storage file 254 for past failure cases similar to the failure. Thereafter, the collation processing unit 25 collates a precedent that matches the disorder from among the previously searched candidate group.

照合の結果、該障害に合致した判例が存在したならば１
回復指示生成処理部２７にて１合致した判例の中に記憶
されている回復手順を得る。その回復手順は通信処理部
２１を経由して監視制御装置１００へ転送され、−時記
憶ファイル１０４に格納される。As a result of the verification, if there is a precedent that matches the disability, 1.
The recovery instruction generation processing unit 27 obtains the recovery procedure stored in the case precedent that matches one match. The recovery procedure is transferred to the supervisory control device 100 via the communication processing unit 21 and stored in the -time storage file 104.

保守員が障害の発生した計算機システム２００のサイト
に到着し、コンソール装置１０２より、′″回復指示”
の旨のコマンドを投入すると１回復手順がコンソール装
置１０２の表示画面、あるいはハードコピー族Ｗ（図示
せず）に出力される。A maintenance person arrives at the site of the computer system 200 where the failure has occurred and issues a ``recovery instruction'' from the console device 102.
When a command to that effect is input, one recovery procedure is output to the display screen of the console device 102 or to the hard copy group W (not shown).

これにより、障害の発生から保守員が現地に到着するま
での間に、並行して障害の要因分析と回復手順の生成を
行なうので１回復時間の短縮が図られる。As a result, the cause of the failure is analyzed and the recovery procedure is generated in parallel between the occurrence of the failure and the arrival of the maintenance personnel at the site, thereby reducing the recovery time.

なお、照合の結果、該障害に合致した判例が存在しない
ならば、その旨の情報が監視制御装置１００に転送され
ているので、保守員が′″回復指示″のコマンドを投入
すると、自刃で回復のための解析を行なう必要があるこ
とが分かる。その場合には、データ・バッファ６や一時
記憶フアイル１０４の内容をコンソール装置１０２、あ
るいはハードコピー装置に出力すれば良い。また、コン
ソール装置１０２にはサービス・プロセッサ５ｖＰ２０
７のコンソール機能も有している。その後。As a result of the comparison, if there is no precedent that matches the failure, the information to that effect is transferred to the supervisory control device 100, so when the maintenance personnel inputs the command ``recovery instruction'', the self-blade is activated. It turns out that it is necessary to perform an analysis for recovery. In that case, the contents of data buffer 6 and temporary storage file 104 may be output to console device 102 or hardcopy device. The console device 102 also includes a service processor 5vP20.
It also has 7 console functions. after that.

該計算機システムを回復させたならば、その手順を監視
・保守計算機システム２５０へ転送する。Once the computer system is recovered, the procedure is transferred to the monitoring/maintenance computer system 250.

監視・保守計算機システム２５０側では、先の障害内容
とこの回復手順を対にして１判例記憶ファイル２５４に
格納する。これによって、後に、他のサイト、あるいは
自サイトにて、再び同様な障害が発生したときに役立つ
ことになる。On the monitoring/maintenance computer system 250 side, the contents of the previous failure and this recovery procedure are stored as a pair in one precedent storage file 254. This will be useful if a similar problem occurs again at another site or your own site later.

では、第２図以降の図を用いて、本発明の情報処理シス
テムの障害監視装置とその制御方法の詳細を説明する。Now, details of the fault monitoring device for an information processing system and its control method according to the present invention will be explained using the figures from FIG. 2 onwards.

第２図は第１図で示したデータ・バッファ６の構成を示
した図、第３図は比較テーブル７の構成を示した図、第
４図は診断処理部８で使用する診断番号とアドレス・レ
ジスタＡＤ２゜データ・レジスタＤＴ３への値の対応を
示した図、第５図は一時記憶フアイル１０４に障害情報
を格納するとき、および監視・保守用計算機システム２
５０へ障害情報を転送するときのデータの形式を示した
図、第６図は監視制御装置１００から監視・保守用計算
機システム２５０へ障害発生の概略情報を転送するとき
のデータの形式を示した図。2 is a diagram showing the configuration of the data buffer 6 shown in FIG. 1, FIG. 3 is a diagram showing the configuration of the comparison table 7, and FIG.・A diagram showing the correspondence of values to register AD2 and data register DT3. FIG.
FIG. 6 shows the format of data when transferring summary information on the occurrence of a fault from the monitoring and control device 100 to the monitoring/maintenance computer system 250. figure.

第７図は監視・保守用計算機システム２５０から監視制
御装置１００へ回復手順を転送するとき、あるいは保守
員が現地にて回復作業の後、監視制御袋Ｗ１００から監
視・保守用計算機システム２５０へ回復手順を転送する
ときのデータの形式を示した図、第８図はコンソール装
置１０２．２５２からのコマンドの形式、または監視・
保守用計算機システム２５０から監視制御装置ｇｌｏ。FIG. 7 shows recovery from the monitoring/maintenance computer system 250 to the monitoring/maintenance computer system 250 when the recovery procedure is transferred from the monitoring/maintenance computer system 250 to the monitoring/control device 100, or after a maintenance worker performs recovery work on site. Figure 8 shows the format of data when transferring procedures, and the format of commands from the console device 102.252 or monitoring/
From the maintenance computer system 250 to the supervisory control device glo.

へのコマンドの形式を示した図、第９図は監視制御袋［
１００から監視・保守用計算機システム２５０へ障害情
報を転送する時に、転送するデータ量の削減処理を施す
１つの例を示した図、第１０図は判例記憶ファイル２５
４の植成を示した図、である。Figure 9 shows the command format for the supervisory control bag [
FIG. 10 is a diagram showing an example of reducing the amount of data to be transferred when transferring failure information from the computer system 100 to the monitoring/maintenance computer system 250.
4 is a diagram showing the planting of No. 4.

また、第１１図〜第２０図までは各処理部の処理フロー
を示した図である。さらに、第２１図、第２２図は保守
員の介入を必要としないときの自動回復動作を説明した
図である。第１１図は第１図の比較処理部ＣＭＰ５の処
理フロー図、第１２図は障害判定処理部１０の処理フロ
ー図、第１３図はＣＰＵインタフェース処理部ＣＰＵＩ
４の処理フロー図、第１４図（ａ）、（ｂ）は診断処理
部８の処理フロー図、第１５図（ａ）、（ｂ）は監視処
理袋［１００内のコマンド解釈処理部１３の処理フロー
図、第１６図（ａ）、（ｂ）は監視・保守用計算機シス
テム２５０内の収集・解析処理部２３の処理フロー図、
第１７図は判例検索・登録処理部２６の検索処理の処理
フロー図、第１８図は判例検索・登録処理部２６の登録
処理の処理フロー図、第１９図（ａ）、（ｂ）は照合処
理部２５の処理フロー図、第２０図は回復指示生成処理
部２７の処理フロー図、第２１図は第７図で示した障害
回復情報１９の回復手順１９ｃに保守員が介入しなくて
も良い指令列が並んだ例を示した図、第２２図はそのと
きの処理フローを示した図、である。Further, FIGS. 11 to 20 are diagrams showing the processing flow of each processing section. Furthermore, FIGS. 21 and 22 are diagrams explaining automatic recovery operations when no intervention by maintenance personnel is required. 11 is a processing flow diagram of the comparison processing unit CMP5 in FIG. 1, FIG. 12 is a processing flow diagram of the failure determination processing unit 10, and FIG. 13 is a processing flow diagram of the CPU interface processing unit CPUI.
4, FIGS. 14(a) and 14(b) are processing flow diagrams of the diagnostic processing unit 8, and FIGS. 15(a) and (b) are processing flow diagrams of the command interpretation processing unit 13 in the monitoring processing bag [100]. Processing flow diagram, FIGS. 16(a) and 16(b) are processing flow diagrams of the collection/analysis processing unit 23 in the monitoring/maintenance computer system 250,
FIG. 17 is a process flow diagram of the search process of the case law search/registration processing unit 26, FIG. 18 is a process flow diagram of the registration process of the case law search/registration process unit 26, and FIGS. 19(a) and (b) are collation 20 is a processing flow diagram of the recovery instruction generation processing section 27, and FIG. 21 is a processing flow diagram of the processing unit 25, and FIG. 21 is a process flow diagram of the recovery instruction generation processing unit 27. FIG. 22 is a diagram showing an example in which good command sequences are lined up, and FIG. 22 is a diagram showing the processing flow at that time.

第２図を参照するに、データ・バッファ６はメツセージ
／コマンド格納領域ＴＲＡＣＥ６　ａと管理テーブル６
ｂとで成っている。メツセージ／コマンド格納領域ＴＲ
ＡＣＥ６ａはメッセージ・データやコマンド・データの
発生した時刻を格納するフィールド６ｃ、該メッセージ
・データの識別子フィールド６ｄ、該メッセージ・デー
タの番号フィールド６ｆ、および詳細情報フィールド６
ｇで構成される。また、管理テーブル６ｂはメツセージ
／コマンド格納領域ＴＲＡＣＥ６ａの先頭領域ポインタ
（ＦＩＲ５Ｔ、Ｅ）６ｈ、最終領域ポインタ（ＬＡＳＴ
、Ｅ）６ｉ、現在の格納領域ポインタ（ＣＵＲ，Ｅ）６
ｊ、および次の格納領域ポインタ（ＮＥＸＴ、Ｅ）６に
で成っている。Referring to FIG. 2, the data buffer 6 includes a message/command storage area TRACE 6a and a management table 6.
It consists of b. Message/command storage area TR
The ACE 6a includes a field 6c for storing the time at which message data or command data is generated, an identifier field 6d for the message data, a number field 6f for the message data, and a detailed information field 6.
Consists of g. In addition, the management table 6b includes the first area pointer (FIR5T, E) 6h and the last area pointer (LAST) of the message/command storage area TRACE6a.
, E) 6i, current storage area pointer (CUR, E) 6
j, and the next storage area pointer (NEXT, E) 6.

第３図は比較テーブル７の構成を示しており、この比較
テーブル内には登録数（Ｎ）７ａ、検出対象のメツセー
ジ識別Ｔフィールド７ｂ、メツセージ番号フィールド７
ｃ、処置フラグ・フィールド７ｄで成っている。比較処
理部（ＣＭＰ）５の処理フローは第１１図に示してあり
、監視・制御対象の計算機システム２００からの出力メ
ッセージ・データ（以降、単にメツセージと略す場合も
ある）は、この比較処理部（ＣＭＰ）５で処理される。FIG. 3 shows the structure of the comparison table 7, which includes the number of registrations (N) 7a, the message identification T field 7b to be detected, and the message number field 7.
c, and a disposition flag field 7d. The processing flow of the comparison processing unit (CMP) 5 is shown in FIG. (CMP) 5.

なお、コンソール装［１０２から投入されたコマンド・
データは計算機システム２００へ、−旦、渡された後、
再びメッセージ・データとして計算機システム２００か
ら送出される。Note that the command input from the console [102]
After the data is passed to the computer system 200,
It is sent out again from the computer system 200 as message data.

第１図、第２図、第３図、および第１１図を参照するに
、１Ｌｌａを介して、計算機システム２００からのメツ
セージ、すなわち０５２０８ａのメツセージが順次に送
出されて来ると、第１１図の処理ステップ３１ａにて該
メッセージ・データをデータ・バッファ６に格納する。Referring to FIG. 1, FIG. 2, FIG. 3, and FIG. 11, when messages from the computer system 200, that is, messages of 05208a, are sent out sequentially via 1Lla, the messages shown in FIG. The message data is stored in the data buffer 6 in processing step 31a.

これは第２図で示した次の格納領域ポインタ（ＮＥＸＴ
、Ｅ）６にの示すエントリに格納する。格納に際しては
。This is the next storage area pointer (NEXT
, E) is stored in the entry shown in 6. When storing.

該メツセージが発生した時刻を時刻フィールド６ｃに設
定した後、続いて該メッセージ・データをメツセージ識
別子フィールド６ｈ、メツセージ番号フィールド６ｆ、
および詳細情報フィールド６ｇに格納する。格納された
メッセージ・データは分配！ａ１１を経由して画面バッ
ファ９に格納され、結果として、コンソール装置１０２
の表示画面に表示される。After setting the time when the message occurred in the time field 6c, the message data is then stored in the message identifier field 6h, message number field 6f,
and stored in the detailed information field 6g. Distribute the stored message data! a11 and is stored in the screen buffer 9, and as a result, the console device 102
displayed on the display screen.

処理ステップ３１ｂでは１次の格納領域ポインタ（ＮＥ
ＸＴ、Ｅ）６にと現在の格納領域ポインタ（ＣＵＲ，Ｅ
）６ｊの値をそれぞれ＋１とする。In processing step 31b, the primary storage area pointer (NE
XT, E) 6 and the current storage area pointer (CUR, E)
) 6j are each set to +1.

処理ステップ３１ｃから処理ステップ３１ｆまでは、上
記のポインタの値が最終領域ポインタ（ＬＡＳＴ、Ｅ）
６ｉの値を超えていないかを検査する。もしも超えてい
たならば、それぞれ先頭領域ポインタ（ＦＩＲ５Ｔ、Ｅ
）６ｈの値に置き換える。したがって、データ・バッフ
ァ６のメツセージ／コマンド格納領域ＴＲＡＣＥ６ａは
ｎ個分のメッセージ・データを格納することが可能であ
り、現在の格納領域ポインタ（ＣＵＲ，Ｅ）６ｊの１つ
前のエントリから後に向がって、次の格納領域ポインタ
（ＮＥＸＴ、Ｅ）６ｋまでのエントリが過去のメッセー
ジ・データの履歴となる。From processing step 31c to processing step 31f, the value of the above pointer is the final area pointer (LAST, E).
Check whether the value exceeds 6i. If it exceeds, the respective start area pointers (FIR5T, E
)6h value. Therefore, the message/command storage area TRACE 6a of the data buffer 6 can store n pieces of message data, and the message/command storage area TRACE 6a of the data buffer 6 can store n pieces of message data, and can be used from the previous entry of the current storage area pointer (CUR, E) 6j. Therefore, the entries up to the next storage area pointer (NEXT, E) 6k become the history of past message data.

すなわち、第２図の場合（２）、（１）、（ｎ）。That is, in the case of FIG. 2, (2), (1), and (n).

（ｎ−１）・・・（５）、（４）が過去のメッセージ・
データの履歴となる。(n-1)...(5), (4) are past messages
It becomes the history of the data.

処理ステップ３１ｇでは、先にデータ・バッファ６に格
納したメッセージ・データ、すなわち現在の格納領域ポ
インタ（ＣＵＲ，Ｅ）６ｊの指すエントリのメッセージ
・データのメツセージ識別子６ｃ、メツセージ番号６ｆ
と第３図で示した比較テーブル７内のメツセージ識別子
７ｂ、番号７ｃとを比較する。比較回数は登録数（Ｎ）
７ａの回数である。比較の結果、現在の格納領域ポイン
タ（ＣＵＲ，Ｅ）６ｊの指すエントリのメッセージ・デ
ータと等しいメツセージが比較テーブル７に登録されて
いたならば、判定処理ステップ３１ｈにて、その旨が判
断され、該メッセージ・データと処置フラグ・フィール
ド７ｄの値を判定処理部１０へ渡す（処理ステップ３１
１．処理ステップ３１ｋ）、比較の結果、比較テーブル
７内に一致するメッセージ・データが存在しないならば
、次のメッセージ・データを受信する処理ステップ３１
ａへ戻る。In the processing step 31g, the message identifier 6c and the message number 6f of the message data previously stored in the data buffer 6, that is, the message data of the entry pointed to by the current storage area pointer (CUR, E) 6j.
The message identifier 7b and number 7c in the comparison table 7 shown in FIG. 3 are compared. The number of comparisons is the number of registrations (N)
This is the number of times of 7a. As a result of the comparison, if a message equal to the message data of the entry pointed to by the current storage area pointer (CUR, E) 6j is registered in the comparison table 7, it is determined in a determination processing step 31h, The message data and the value of the treatment flag field 7d are passed to the determination processing unit 10 (processing step 31
1. Processing step 31k), if as a result of the comparison there is no matching message data in the comparison table 7, processing step 31 of receiving the next message data;
Return to a.

第１２図は第１図の障害判定処理部１０におけるソフト
ウェア障害発生時の処理フローであり、第１３図はハー
ドウェア障害発生時におけるＣＰｕインタフェース処理
部（ＣＰＵＩ）４の処理フローである。第１２図を参照
するに、比較処理部（ＣＭＰ）５より制御が移ると判定
処理ステップ３２ａにてパラメータとして渡された処置
フラグ７ｅのビット７を調べて、該計算機システムを停
止させるか否かを判定する。これはビット７の値が１な
らば計算機システムの停止を意味する。計算機システム
を停止させるときには、処理ステップ３２ｂにてサービ
スプロセッサ５ＶＰ２０７に計算機２００の停止指令の
信号を１ｉＬ１ｂに送出すれば良い。FIG. 12 is a processing flow when a software failure occurs in the failure determination processing section 10 of FIG. 1, and FIG. 13 is a processing flow of the CPU interface processing section (CPUI) 4 when a hardware failure occurs. Referring to FIG. 12, when control is transferred from the comparison processing unit (CMP) 5, bit 7 of the treatment flag 7e passed as a parameter is checked in a determination processing step 32a to determine whether or not to stop the computer system. Determine. This means that if the value of bit 7 is 1, the computer system will stop. To stop the computer system, it is sufficient to send a signal to the service processor 5VP207 to stop the computer 200 to 1iL1b in processing step 32b.

次に、処理ステップ３２ｃにて第６図で示した障害要約
情報１６を作成する。障害要約情報１６はサイト識別子
フィールド１６ａ、障害発生日時フィールド１６ｂ、障
害種別フィールド１６Ｃ１および概略情報フィールド１
６ｄで成っている。Next, in processing step 32c, fault summary information 16 shown in FIG. 6 is created. The failure summary information 16 includes a site identifier field 16a, a failure occurrence date and time field 16b, a failure type field 16C1, and a summary information field 1.
It is made up of 6d.

ソフトウェア障害の場合には、障害種別フィールド１６
ｃの値は′Ｓ′となり、さらに、概略情報フィールド１
６ｄには、該障害メッセージ・データが格納される。処
理ステップ３２ｄでは、診断処理部８へのパラメータと
して第３図に示した処置フラグ７ｄを準備し、次に診断
処理部８（第１２図のフローチャートではＤｉＡＧと表
記）へ制御を渡す、なお、このときソフトウェアの障害
の旨を’　５ＯＦＴ’　として表しである。In the case of a software failure, the failure type field 16
The value of c is 'S', and the summary information field 1
6d stores the fault message data. In the processing step 32d, the treatment flag 7d shown in FIG. 3 is prepared as a parameter to the diagnosis processing section 8, and then control is passed to the diagnosis processing section 8 (denoted as DiAG in the flowchart of FIG. 12). At this time, the software failure is expressed as '5OFT'.

診断処理部８から制御が戻ると、処理ステップ３２ｆに
て送受信処理部１４を経由して監視・保守用計算機シス
テム２５０へ障害の発生の旨を通報する。このとき、第
６図の障害要約情報１６が監視・保守用計算機システム
２５０へ転送される。When the control is returned from the diagnostic processing unit 8, the occurrence of the failure is notified to the monitoring/maintenance computer system 250 via the transmission/reception processing unit 14 in processing step 32f. At this time, the failure summary information 16 shown in FIG. 6 is transferred to the monitoring/maintenance computer system 250.

では次に、ハードウェア障害発生時の障害情報の収集ま
での動作を説明する。第１３図はＣＰＵインタフェース
処理部（ＣＰＵＩ）４の処理フローである。ハードウェ
ア障害が発生すると、信号線Ｌ２、エンコーダ１を経由
して、その旨がＣＰＵインタフェース処理部（ＣＰＵＩ
）４に報告される。このとき、データ・レジスタＤＴ３
には、障害発生の理由コードが保持されている。処理ス
テップ３３ａでは、ソフトウェア障害のときと同様に、
第６図の障害要約情報を作成する。このとき、障害種別
フィールド１６ｃの値は、八−ドウエア障害であるので
ＩＨ′　となる。また、概略情報フィールド１６ｄには
障害発生時のメッセージ・データに加えてＤＴ３に保持
されている理由コードを理由コード・フィールド１６ｇ
に格納する。Next, we will explain the operation up to the collection of failure information when a hardware failure occurs. FIG. 13 is a processing flow of the CPU interface processing unit (CPUI) 4. When a hardware failure occurs, a notification to that effect is sent to the CPU interface processing unit (CPUI) via the signal line L2 and encoder 1.
)4 will be reported. At this time, data register DT3
holds the reason code for the failure. In processing step 33a, as in the case of software failure,
Create the failure summary information shown in Figure 6. At this time, the value of the failure type field 16c is IH' because it is an eight-doware failure. In addition to the message data at the time of failure occurrence, the summary information field 16d also contains the reason code held in the DT3 in the reason code field 16g.
Store in.

処理ステップ３３ｂにて、該理由コード１６ｅを診断処
理部８へのパラメータとして、次の診断処理部８（第１
３図のフローチャートではＤＩＡＧと表記）へ制御を渡
す、なお、このときハードウェア障害の旨をＨ′として
表しである０診断処理部８から制御が戻ると、処理ステ
ップ３３ｃにて、第１２図の処理ステップ３２ｆへ制御
を移す。その処理ステップ３２ｆでの処理は、先のソフ
トウェア障害の処理で説明したように、送受信処理部１
４を経由して監視・保守用計算機システム２５〇八障害
の発生の旨を通報する。当然のことながらこのとき第６
図の障害要約情報１６が監視・保守用計算機システム２
５０へ転送される。In processing step 33b, the reason code 16e is used as a parameter to the diagnostic processing unit 8, and the next diagnostic processing unit 8 (first
In the flowchart of FIG. 3, the control is passed to DIAG). At this time, the hardware failure is expressed as H'. When the control returns from the diagnostic processing unit 8, the process shown in FIG. Control is then transferred to processing step 32f. The processing at the processing step 32f is performed by the transmitting/receiving processing unit 1 as described above in the software failure processing.
4, the monitoring and maintenance computer system 2508 is notified of the occurrence of a failure. Naturally, at this time the 6th
The failure summary information 16 in the figure is for the monitoring/maintenance computer system 2.
Transferred to 50.

ここで、第１図の診断処理部８の動作を説明する０診断
処理部８はソフトウェア障害のときには障害判定処理部
１０．ハードウェア障害のときにはＣＰＵインタフェー
ス処理部（ＣＰＵＩ）４から制御が渡る。また、コンソ
ール族［１０２や監視・保守用の計算機システム２５０
からの動作指令によっても動作する。第１４図（ａ）、
第１４図（ｂ）は障害の発生を監視制御装置１００自身
で検知したときの処理フローを示している。Here, the operation of the diagnostic processing section 8 in FIG. In the event of a hardware failure, control is passed from the CPU interface processing unit (CPUI) 4. In addition, the console family [102 and computer systems for monitoring and maintenance 250
It also operates according to operation commands from. Figure 14(a),
FIG. 14(b) shows a processing flow when the supervisory control device 100 itself detects the occurrence of a failure.

ソフトウェア障害、またはハードウェア障害の発生によ
って、障害判定処理部１０、またはＣＰＵインタフェー
ス処理部（ＣＰｔＪＩ）４から制御が移ると、処理ステ
ップ３４ａでは、１視対象の計算機システム２００内の
各処理ユニットで保持しているログアウト情報を得る準
備処理を行なう。When control is transferred from the failure determination processing unit 10 or the CPU interface processing unit (CPtJI) 4 due to the occurrence of a software failure or a hardware failure, in processing step 34a, each processing unit in the computer system 200 to be viewed is Performs preparation processing to obtain retained logout information.

ここで、診断処理部８が各処理ユニット（ＩＵ２０１゜
Ｅｕ２Ｏ３など）のログアウト情報等のハードウェア情
報を得るときには、第１図のアドレス・レジスタＡＤ２
．データ・レジスタＤＴ３に対応する値を設定する。第
４図は第３図の処置フラグ７ｅのビット位置番号（０＝
ｎビツトのビット位置番号）に対応するアドレス・レジ
スタＡＤ２゜データ・レジスタＤＴ３の値を示している
。なお、アドレス・レジスタＡＤ２の値は各処理ユニッ
トの番号に対応しており。Here, when the diagnostic processing unit 8 obtains hardware information such as logout information of each processing unit (IU201°Eu2O3, etc.), the address register AD2 in FIG.
．． Set the corresponding value in data register DT3. FIG. 4 shows the bit position number (0=
The value of address register AD2 and data register DT3 corresponding to n-bit bit position number) is shown. Note that the value of address register AD2 corresponds to the number of each processing unit.

１）命令制御！ニット（ＩＵ）２０１．、．１２）実行
ユニット（ＥＵ）２０２　　、、．２３）入出力処理ユ
ニット（ＩＯＰ）　　、、、３４）メモリ制御ユニット
（ＭＣＵ）　　、、、４５）主記憶装置（ＭＳ）　　　
　　　　、、、５６）サービス・プロセッサ（ＳＶＰ）
、、、６のように、アドレス付けしである。したがって
、ログアウト情報はハードウェア使用領域２０６ｂに存
在するためＡＤ２の値はＭＳ２０６対応の５′が設定さ
れ、ＤＴ３には先頭アドレスが設定される。1) Command control! Knit (IU) 201. ,． 12) Execution unit (EU) 202 , . 23) Input/output processing unit (IOP), 34) Memory control unit (MCU), 45) Main memory (MS)
,,,56) Service Processor (SVP)
, , 6, the addresses are assigned. Therefore, since the logout information exists in the hardware use area 206b, the value of AD2 is set to 5' corresponding to the MS 206, and the start address is set to DT3.

処理ステップ３４ｂでは、収集したログアウト情報を第
５図の障害情報エリアＬ８ｃに格納し、ダンプ識別子１
８ａにはログアウト情報の旨の識別を設定する。また、
レコード長１８ｂには収集したデータの長さをバイト数
で設定する。次の判定処理ステップ３４ｃではソフトウ
ェア障害であるか否かを判定する。これは診断処理部８
に引き渡されたパラメータで判定する０判定処理の結果
、ソフトウェア障害でない場合、すなわちハードウェア
障害の場合には第１４図（ｂ）の処理ステップ３４に〜
処理ステップ３４ｍを実行する。In the processing step 34b, the collected logout information is stored in the failure information area L8c in FIG.
Identification indicating logout information is set in 8a. Also,
The length of the collected data is set in the number of bytes in the record length 18b. In the next determination processing step 34c, it is determined whether or not there is a software failure. This is the diagnostic processing section 8
As a result of the 0 determination process, which is determined based on the parameters handed over to
Processing step 34m is executed.

一方、ソフトウェア障害、すなわちオペレーティング・
システム等の障害のときには処理ステップ３４ｄ〜処理
ステツプ３４ｉを実行する。先ず。On the other hand, software failures, i.e. operating
In the event of a system failure, processing steps 34d to 34i are executed. First.

処理ステップ３４ｄでは、処置フラグ７ｅのビット数を
繰返し回数（ループ回数）ＬＯＯＰとする。In the processing step 34d, the number of bits of the treatment flag 7e is set to the number of repetitions (loop number) LOOP.

また、カウンタｉの値を０にして初期化する。ここで、
ＬＯＯＰ、およびｉは作業変数であり、ハードウェア、
あるいは処理プログラム内の作業領域に確保しても揚わ
ない。Also, the value of the counter i is initialized to 0. here,
LOOP, and i are work variables, hardware,
Or, even if it is allocated in the work area within the processing program, it will not work.

カウンタｉの値がループ回数ＬＯＯＰの値になるまで処
理ステップ３４ａ〜処理ステツプ３４ｉを繰り返す。こ
の処理は処置フラグ・ピント７ｅの各ビットを調べて、
該ビットの値が′　１′ならば、そのビット位置番号に
対応するハードウェア情報を収集する０判定処理ステッ
プ３４ａでは。Processing steps 34a to 34i are repeated until the value of counter i reaches the value of loop number LOOP. This process examines each bit of the treatment flag focus 7e,
If the value of the bit is '1', then in step 34a of the 0 determination process, hardware information corresponding to that bit position number is collected.

カウンタｉの値に対応する処置フラグ・ビット７ｅのビ
ット位置の値を調べる。その結果、指定なし、すなわち
値が′０′であるならば、処理ステップ３４ｉへ進む。The value of the bit position of action flag bit 7e corresponding to the value of counter i is checked. As a result, if there is no designation, that is, the value is '0', the process advances to step 34i.

一方、ｔ！定がなされていると、処理ステップ３４ｆに
てアドレス・レジスタＡＤ２．データ・レジスタＤＴ３
に値を設定し、当該の処理ユニットのログアウト動作を
起動する。この処理は第４図に示したように、カウンタ
ｉの値が診断番号に対応し、それに基づいてアドレス・
レジスタＡＤ２゜データ・レジスタＤＴ３の設定値が一
意的に決まる。次に、処理ステップ３４ｇでは、読み出
したハードウェア情報に第５図で示したダンプ識別子１
８ａを付加し、−時記憶ファイル１０４に格納する、ダ
ンプ識別子１８ａは診断番号１のＯ８制御テーブルなら
ば、′Ｏ８制御テーブル′の識別、実行ユニット（ＥＵ
）２０２のハードウェア情報ならば、’ＥＵ’の識別が
設定される。On the other hand, t! If the address register AD2. Data register DT3
Set the value to , and start the logout operation of the corresponding processing unit. As shown in Figure 4, this process is based on the value of the counter i corresponding to the diagnosis number and the address
The setting value of register AD2゜data register DT3 is uniquely determined. Next, in processing step 34g, the read hardware information is given the dump identifier 1 shown in FIG.
If the dump identifier 18a is the O8 control table with diagnosis number 1, the dump identifier 18a that is added with the 'O8 control table' and stored in the -time storage file 104 is the identification and execution unit (EU
) 202, the identification of 'EU' is set.

次に、処理ステップ３４ｈにて、カウンタｉの値を＋１
した後、処理ステップ３４ｉにてカウンタｉの値がルー
プ回数ＬＯＯＰの値に達したか否かを検査する。達して
いなければ、処理ステップ３４ｅへ戻る。カウンタｉの
値がループ回数ＬＯＯＰの値に達したならば、この診断
処理部８の動作は終了する。Next, in processing step 34h, the value of counter i is increased by +1.
After that, in processing step 34i, it is checked whether the value of the counter i has reached the value of the loop number LOOP. If not, the process returns to step 34e. When the value of the counter i reaches the value of the loop number LOOP, the operation of the diagnostic processing section 8 ends.

第１４図（ｂ）はハードウェア障害発生時の診断処理部
８の処理フローである。これら一連の処理は第１４図（
ａ）の判定処理ステップ３４ｃから移される。先ず、処
理ステップ３４ｋにて、計算機システムＣＰＵ２００内
の各処理ユニットからハードウェア情報を読み出す。こ
の処理は、先に説明したように、アドレス・レジスタＡ
Ｄ２゜データ・レジスタＤＴ３に値を設定して各処理ユ
ニットを起動することになるが、ここでは第４図に示す
全ての診断番号１７ａの情報を収集する。FIG. 14(b) is a processing flow of the diagnostic processing unit 8 when a hardware failure occurs. These series of processes are shown in Figure 14 (
The process is moved from the determination processing step 34c of a). First, in processing step 34k, hardware information is read from each processing unit within the computer system CPU 200. As explained earlier, this process is performed using the address register A.
Each processing unit is activated by setting a value in the D2° data register DT3, but here information on all diagnosis numbers 17a shown in FIG. 4 is collected.

次の処理ステップ３４ｍでは、読み出したハードウェア
情報に第５図のダンプ識別子１８ａを個々の情報に付加
して、−時記憶ファイル１０４に格納する。In the next processing step 34m, the dump identifier 18a shown in FIG. 5 is added to the read hardware information and stored in the -time storage file 104.

以上によって、監視・制御対象の計算機システム２００
での障害発生時の動作、すなわち監視・保守用計算機シ
ステム２５０への通報、ならびに障害情報の収集動作が
完了し、監視・保守用計算機システム２５０側では、保
守員の現地派遣指示。As described above, the computer system 200 to be monitored and controlled
When the operation at the time of a failure occurs, that is, the notification to the monitoring/maintenance computer system 250 and the failure information collection operation are completed, the monitoring/maintenance computer system 250 instructs to dispatch maintenance personnel to the site.

障害要因の分析１回復手段の自動生成等の動作に入る。Failure factor analysis 1 Start operations such as automatic generation of recovery measures.

これらの動作を説明する前に、監視・制御装置ｉ！１０
０でのコマンド解釈処理部１３．転送処理部１２の動作
を説明する。Before explaining these operations, the monitoring and control device i! 10
Command interpretation processing unit 13. The operation of the transfer processing unit 12 will be explained.

第８図はコンソール装置１０２．２５２からコマンドが
入力されるときの形式、ならびに監視・保守用の計算機
システム２５０から指令されたときのコマンドの形式を
表している。監視・制御装置１００内のコマンド解釈処
理部１３はコンソール装置Ｉ　Ｏ２，監視・保守用の計
算機システム２５０のコンソール装置２５２から手動に
よるコマンド投入によっても動作可能であり、かつ監視
・保守用の計算機システム２５０内の収集・解析処理部
２３からコマンドのデータ・ストリームを自動的に生成
して送出されたことによる動作も可能である。第８図に
示したコマンドの一覧は本発明の一実施例であり、追加
可能である。FIG. 8 shows the format of commands input from the console device 102.252 and the format of commands issued from the computer system 250 for monitoring and maintenance. The command interpretation processing unit 13 in the monitoring/control device 100 can also be operated by manually inputting commands from the console device IO2, the console device 252 of the computer system 250 for monitoring/maintenance, and the computer system for monitoring/maintenance. It is also possible to operate by automatically generating and sending a command data stream from the collection/analysis processing unit 23 in 250. The list of commands shown in FIG. 8 is an example of the present invention, and additions can be made.

第１５図（ａ）、第１５図（ｂ）は第８図のコマンド対
応の処理フローを示している。処理ステップ３５ａにて
コマンドのデータ・ストリームを得た後、処理ステップ
３５ｂにて各コマンド対応の処理に分岐する。15(a) and 15(b) show a processing flow corresponding to the command in FIG. 8. After obtaining the command data stream in processing step 35a, processing branches to processing corresponding to each command in processing step 35b.

（１）ＧＥＴＭＳＧこのコマンドはデータ・バッファ６（第２図の符号６ａ
）の内容をポインタＮＥＸＴ、Ｅ６にの指す領域からポ
インタＣＵＲ，Ｅｅｊの指す領域の１つ前までを一時記
憶フアイル１０４に格納する（処理ステップ３５ｃ）。(1) GETMSG This command is sent to data buffer 6 (symbol 6a in Figure 2).
) is stored in the temporary storage file 104 from the area pointed to by the pointers NEXT and E6 to the area immediately before the area pointed to by the pointers CUR and Eej (processing step 35c).

（２）ＧＥＴＨＡＲＤこのコマンドは計算機システム２００内の各処理ユニッ
トのハードウェア情報を得るためのコマンドであり、先
に説明した診断処理部８を起動して処理を遂行させるも
のであり、処理ステップ３５ｄ〜処理ステツプ３５ｆが
実行される。(2) GETHARD This command is a command for obtaining hardware information of each processing unit in the computer system 200, and is used to start the diagnostic processing unit 8 described above to perform processing, and is executed in processing step 35d. -Processing step 35f is executed.

（３）ＧＥＴＬＯＧこのコマンドも計算機システム２００内の各処理ユニッ
トのハードウェア情報を得るためのコマンドであるが、
特に、各処理ユニットのハードウェア保持情報（第４図
の診断番号３〜６）を収集する。このために、処理ステ
ップ３５ｇが動作するが、（２）と同様に５診断処理部
８を起動して処理を遂行させる。(3) GETLOG This command is also a command for obtaining hardware information of each processing unit in the computer system 200.
In particular, the hardware holding information of each processing unit (diagnosis numbers 3 to 6 in FIG. 4) is collected. For this purpose, processing step 35g is operated, and similarly to (2), the 5-diagnosis processing section 8 is activated to perform the processing.

（４）ＡＣＴＩＯＮこのコマンドは監視・保守用計算機システム２５０から
送出されてくる障害回復手順のデータ列をコンソール装
置１０２の表示装置やハードコピー装置等の出力装置に
出力する（処理ステップ３５ｈ）。第７図は障害回復手
順のデータ列１９の形式を示しており、回復手順フィー
ルド１９ｃに一連の回復手順が格納されている。(4) ACTION This command outputs the data string of the failure recovery procedure sent from the monitoring/maintenance computer system 250 to an output device such as a display device of the console device 102 or a hard copy device (processing step 35h). FIG. 7 shows the format of the data string 19 of the failure recovery procedure, and a series of recovery procedures are stored in the recovery procedure field 19c.

（５）ＲＥＣＯＶＥＲこのコマンドは監視・保守用計算機システム２５０側に
て回復手順を生成出来ず、保守員が現地にて試行錯誤を
繰り返しながら障害の発生した計算機システムを回復さ
せたときに、その回復手順を監視・保守用計算機システ
ム２５０へ転送する時に用いられる。この処理は処理ス
テップ３５ｉにて、コンソール装置１０２から入力され
た回復手順を順次に第７図の回復手順フィールド１９ｃ
に格納し、他のフィールド１９ａ〜１９ｄを完成させた
後、転送処理部１２を経由して監視・保守用計算機シス
テム２５０へ転送する。(5) RECOVER This command is used when a recovery procedure cannot be generated on the monitoring/maintenance computer system 250 side and maintenance personnel are on-site using trial and error to recover a failed computer system. It is used when transferring procedures to the monitoring/maintenance computer system 250. In step 35i, this process sequentially inputs the recovery procedure input from the console device 102 into the recovery procedure field 19c in FIG.
After completing the other fields 19a to 19d, the data is transferred to the monitoring/maintenance computer system 250 via the transfer processing unit 12.

（６）ＳＵＭＭＡＲＹこのコマンドは第６図の障害発生の要約情報を表示する
ためのものである。障害を起した計算機システム用の監
視制御装置のコンソール装！１０２から指令された場合
には、処理ステップ３５ｊによって、第６図の障害要約
情報１６をコンソール装置１０２の表示画面に表示する
。一方、監視・保守用計算機システム２５０のコンソー
ル装置２５２、あるいは収集・解析処理部２３から指令
されたときには、処理ステップ３５ｋによって、第６図
の障害要約情報１６を転送処理部１２を経由して、計算
機システム２５０へ転送する６（７）ＴＲＡＮＳＦＥＲこのコマンドは一時記憶フアイル１０４に格納されてい
る障害情報を、処理ステップ３５ｍによって計算機シス
テム２５０へ転送する。なお、このとき、転送処理部１
２では、第９図に示すように転送データ量の削減処理１
２ａを行なう。すなわち、障害情報２８ａに対して同一
のデータ列が続くと記１号２８ｃを挿入し、新たなデー
タ列２８ｂを転送する。これによって、第１図の回線Ｌ
４を通るデータ量が削減される効果が生じる。(6) SUMMARY This command is used to display summary information of the failure occurrence shown in FIG. Console equipment for monitoring and control equipment for a computer system that has failed! 102, the fault summary information 16 of FIG. 6 is displayed on the display screen of the console device 102 in processing step 35j. On the other hand, when a command is issued from the console device 252 of the monitoring/maintenance computer system 250 or the collection/analysis processing section 23, the fault summary information 16 shown in FIG. 6 (7) TRANSFER to the computer system 250 This command transfers the failure information stored in the temporary storage file 104 to the computer system 250 by processing step 35m. Note that at this time, the transfer processing unit 1
2, as shown in FIG.
Do 2a. That is, when the same data string continues with the failure information 28a, the symbol 1 28c is inserted and a new data string 28b is transferred. As a result, line L in Figure 1
This has the effect of reducing the amount of data passing through 4.

（８）ＤＩＳＰＬＡＹこのコマンドは現地の保守員がコンソール装置１０２を
用いて一時記憶フアイル１０４に格納されている障害情
報を表示させたり、あるいは監視・保守用の計算機シス
テム２５０のコンソール装置２５２に表示させたいとき
に用いられる。具体的な処理は、処理ステップ３５ｎに
て、−時記憶ファイル１０４、データ・バッファ６の内
容を当該のコンソール装置に表示する。(8) DISPLAY This command is used by local maintenance personnel to display failure information stored in the temporary storage file 104 using the console device 102, or to display it on the console device 252 of the computer system 250 for monitoring and maintenance. It is used when you want. Specifically, in processing step 35n, the contents of the -time storage file 104 and data buffer 6 are displayed on the relevant console device.

以上がコマンド解釈処理部１３の動作である２なお、上
記で述べたコマンド処理の中で、（４）のＡＣＴＩＯＮ
コマンドの処理においては、処理ステップ３５ｈにて保
守員に回復手順を表示する一実施例を開示しているが、
ハードウェア部品の交換を伴わない回復に関しては、保
守員の操作を必要とせずに回復出来る場合もある。その
実施例については、後に第２１図以降の図を用いて説明
する。The above is the operation of the command interpretation processing unit 13.2In addition, in the command processing described above, (4) ACTION
In the command processing, an embodiment is disclosed in which a recovery procedure is displayed to maintenance personnel in processing step 35h;
Regarding recovery that does not involve replacing hardware parts, it may be possible to recover without requiring any operations by maintenance personnel. The embodiment will be described later with reference to FIG. 21 and subsequent figures.

では次に、監視・保守用計算機システム２５０が監視制
御装置から障害発生の通報を受けたときの動作を説明す
る。障害発生の通報は、第６図で示した障害要約情報１
６が回線Ｌ４を介して１通信処理部２１が受信し、制御
を収集・解析処理部２３に渡す。Next, the operation when the monitoring/maintenance computer system 250 receives a notification of the occurrence of a failure from the monitoring control device will be described. To report the occurrence of a failure, use the failure summary information 1 shown in Figure 6.
6 is received by the 1 communication processing section 21 via the line L4, and the control is passed to the collection/analysis processing section 23.

第１６図（ａ）は収集・解析処理部２３における障害通
報を受信したときの処理フローを示している６まず、処
理ステップ３６ａでは、受信した障害要約情報１６をＢ
ＵＦ２４．および障害情報ファイル２５６に格納する。FIG. 16(a) shows the processing flow when a fault report is received in the collection/analysis processing unit 23.6 First, in processing step 36a, the received fault summary information 16 is
UF24. and stored in the failure information file 256.

この時点で保守要員を現地に派遣しても良いし、あるい
は後の処理ステップ３６ｄにて回復手順が判明した時点
で派遣しても構わない。次に、処理ステップ３６ｂでは
、第８図で説明したＴＲＡＮＳＦＥＲコマンドを監視制
御装置１００に発行し、詳細な障害情報を収集する。収
集した障害情報は処理ステップ３６ｃにて障害情報ファ
イル２５６に、−旦格納される。Maintenance personnel may be dispatched to the site at this point, or may be dispatched when the recovery procedure is determined in the subsequent processing step 36d. Next, in processing step 36b, the TRANSFER command explained in FIG. 8 is issued to the supervisory control device 100, and detailed failure information is collected. The collected fault information is temporarily stored in the fault information file 256 in processing step 36c.

次に、障害要約情報１６を判例検索・登録処理部２６に
渡し、制御も移す。第１６図（ａ）では。Next, the failure summary information 16 is passed to the precedent search/registration processing unit 26, and control is also transferred. In FIG. 16(a).

ＳＴＧと表記しである。なお、このときは′検索′要求
となる。判例検索・登録処理部２６では、障害要約情報
１６に類似した過去の障害例を判例記憶ファイル２５４
から取り出して、ＢＵＦ２４に格納する。この処理は、
後に第１７図を用いて説明する。過去の障害例がＢＵＦ
２４に格納されると１次に照合処理部（ＥＸＴＲ）２５
を起動する。It is written as STG. Note that in this case, it becomes a 'search' request. The case precedent search/registration processing unit 26 stores past failure cases similar to the failure summary information 16 in the case precedent storage file 254.
, and store it in BUF24. This process is
This will be explained later using FIG. 17. Past failure cases are BUF
24, the primary verification processing unit (EXTR) 25
Start.

この照合処理部（ＥＸＴＲ）２５では、候補群の中から
詳細な障害情報まで比較し、一致したならば、その判例
とともに制御を戻し、一致した障害例が存在しなかった
ならば、′不一致′の旨で制御を戻す。This matching processing unit (EXTR) 25 compares detailed fault information from among the candidate groups, and if a match is found, the control is returned together with the case, and if there is no matched fault example, it is determined that there is a 'mismatch'. Return control with this message.

過去の障害例と一致したならば、回復手順を生成するた
めに、回復指示生成処理部２７を起動する。第１６図（
ａ）では、回復指示生成処理部２７をＧＥＮで表記しで
ある。回復指示生成処理部２７は第７図に示した回復情
報１９を作成する。If it matches a past failure example, the recovery instruction generation processing unit 27 is activated to generate a recovery procedure. Figure 16 (
In a), the recovery instruction generation processing unit 27 is expressed as GEN. The recovery instruction generation processing unit 27 generates recovery information 19 shown in FIG.

そこで、収集・解析処理部２３は処理ステップ３６ｄに
て回復情報１９を監視装置１００へ転送する。Therefore, the collection/analysis processing unit 23 transfers the recovery information 19 to the monitoring device 100 in processing step 36d.

一方、過去の障害例と一致しなかったならば。On the other hand, if it did not match the past failure cases.

現在、発生した障害を新たに判例記憶ファイル２５４に
登録するために、判例検索・登録処理部（ＳＴＧ）２６
を起動する。この場合には、′登録″の旨を表記する。In order to newly register the currently occurring failure in the precedent storage file 254, the precedent search/registration processing unit (STG) 26
Start. In this case, the word ``registration'' will be written.

この登録処理に際しては、回復手順が明記されずに格納
されるが、後に１回復手順が判明したとき、あるいは現
地にて回復させた熾、第８図のＲＥＣＯＶＥＲコマンド
を保守員が入力することにより１回復手順が完成する。During this registration process, the recovery procedure is stored without being specified, but when one recovery procedure is later found out, or when recovery is performed on-site, maintenance personnel can input the RECOVER command shown in Figure 8. 1 Recovery procedure is completed.

判例検索・登録処理部（ＳＴＧ）２６にて該障害例の登
録が完了すると、再び、制御が収集・解析処理部２３に
戻る。When the case law search/registration processing unit (STG) 26 completes registration of the failure example, control returns to the collection/analysis processing unit 23 again.

収集・解析処理部２３では、処理ステップ３６ｆにて１
回復手順を生成出来なかった旨の印を第７図の回復手順
フィールド１９ｃに格納し、該データ列１９を監視制御
装置ｉ！１００へ転送する。その後、処理ステップ３６
ｇにて、監視・保守用計算機システム２５０のコンソー
ル装置２５２にも回復手順を生成出来なかった旨を表示
して、経験者の知恵を借りることになる。そこで、処理
ステップ３６ｈでは、専門家がコンソール装置２５より
。In the collection/analysis processing unit 23, 1
A mark indicating that the recovery procedure could not be generated is stored in the recovery procedure field 19c in FIG. 7, and the data string 19 is sent to the supervisory control device i! Transfer to 100. Thereafter, processing step 36
At step g, a message indicating that the recovery procedure could not be generated is also displayed on the console device 252 of the monitoring/maintenance computer system 250, and the wisdom of an experienced person is requested. Therefore, in processing step 36h, the expert uses the console device 25.

第８図で示したコマンドを投入しながら障害の分析を行
ない、回復手順を探ることになる。このときには、現地
の保守員も回復手順を探っており。While inputting the commands shown in FIG. 8, the failure will be analyzed and a recovery procedure will be explored. At this time, local maintenance personnel are also searching for recovery procedures.

結果として、並行して検討することになる。もしも、監
視・保守用計算機システム２５０側が早く回復手順を得
たならば、第７図の回復情報１９を監視制御装置１００
へ転送する。As a result, they will be considered in parallel. If the monitoring/maintenance computer system 250 side obtains the recovery procedure quickly, the recovery information 19 in FIG.
Transfer to.

第１６図（ｂ）は収集・解析処理部２３における回復手
順の登録処理を示している。回復手順は監視制御装置１
００側のコンソール装ｆｆ１１０２から第８図のＲＥＣ
：０ＶＥＲコマンドを投入しても良いし、監視・保守用
計算機システム２５０のコンソール装置２５２から投入
しても良い。なお。FIG. 16(b) shows the registration process of the recovery procedure in the collection/analysis processing unit 23. The recovery procedure is the monitoring control device 1.
REC from the console device ff1102 on the 00 side in Figure 8
:0VER command may be input, or may be input from the console device 252 of the monitoring/maintenance computer system 250. In addition.

監視・保守用計算機システム２５０の解釈指令処理部２
２の動作は、監視制御装置１００側のコマンド解釈処理
部１３と基本的に同一と考えて良い。Interpretation command processing unit 2 of monitoring/maintenance computer system 250
The operation of No. 2 may be considered to be basically the same as that of the command interpretation processing section 13 on the supervisory control device 100 side.

処理ステップ３６ｉでは、第７図の回復情報１９を得る
０次に１判例検索・登録処理部（ＳＴＧ）２６を起動し
て、先に回復手順が未完成の該障害判例を完成させる。In processing step 36i, the 0th and 1st case search/registration processing unit (STG) 26 that obtains the recovery information 19 in FIG. 7 is activated to complete the failure case for which the recovery procedure has not yet been completed.

では次に、第１０図、および第１７図〜第２０図を用い
て１判例検索・登録処理部２６．照合処理部２５１回復
指示生成処理部２７の動作を説明する。第１０図は判例
記憶ファイル２５４の構成を示している０判例記憶ファ
イル２５４内ではハードウェア障害とソフトウェア障害
を分離して記憶させている。これは検索速度を速めるた
めであす、別のファイルに記憶させることを意識してい
る訳ではない。第１０図を参照するに１判例記憶ファイ
ル２５４内ではハードウェア障害とソフトウェア障害毎
に管理テーブル３０によって管理されており、ハードウ
ェア障害判例に関しては、ハードウェア障害登録数（Ｋ
）３０ａ、格納エリア・ポインタ（Ｈ）３０ｂが記憶さ
れている。また、判例の実体は格納エリア・ポインタ（
Ｈ）３０ｂにより指されている。１つの判例情報２９は
、理由コード２９ａ、障害メツセージ２９ｂ、関連メツ
セージ２９ｃ、Ｊｉ囚ラフイールド２９ｄ障害情報２９
ｆ、回復手順２９ｇ、統計情報２９ｈで成っている。こ
こで、理由コード２９ａは第６図の障害要約情報１６内
の理由コード１６ｅが格納され、障害メツセージ２９ｂ
も障害メツセージ・フィールド１６ｄが格納される。関
連メツセージ２９ｃは障害メツセージ２９に関連したメ
ツセージ、あるいは該障害メツセージ２９を引き起こし
た要因メツセージを第６図のデータ・バッファ６のＣＵ
Ｒ，Ｅ　（６ｊ）の指すメツセージから前に戻って検索
し、相当するメツセージを該領域２９ｃに格納する。ま
た、障害情報２９ｆには、先に説明した障害発生時に収
集した情報が格納される。Next, using FIG. 10 and FIGS. 17 to 20, 1 case law search/registration processing section 26. The operation of the verification processing section 251 and the recovery instruction generation processing section 27 will be explained. FIG. 10 shows the structure of the case storage file 254. In the case storage file 254, hardware failures and software failures are stored separately. This is to speed up the search and is not intended to be stored in a separate file the next day. Referring to FIG. 10, in the case storage file 254, hardware failures and software failures are managed by the management table 30, and regarding hardware failure cases, the number of hardware failure registrations (K
) 30a and storage area pointer (H) 30b are stored. In addition, the entity of the precedent is stored in the storage area pointer (
H) Pointed to by 30b. One case law information 29 includes a reason code 29a, a failure message 29b, a related message 29c, a Ji prisoner rough field 29d failure information 29
It consists of f, recovery procedure 29g, and statistical information 29h. Here, the reason code 29a stores the reason code 16e in the failure summary information 16 in FIG. 6, and the failure message 29b
A fault message field 16d is also stored. The related message 29c stores a message related to the fault message 29 or a message that caused the fault message 29 in the CU of the data buffer 6 in FIG.
The search is performed backwards from the message indicated by R, E (6j), and the corresponding message is stored in the area 29c. Further, the failure information 29f stores information collected at the time of occurrence of the failure described above.

回復手順２９ｇには、当該障害に対して施した回復手順
、例えば第７図の回復手順フィールド１９ｃのような回
復手順が格納される。統計情報２９ｈには、該障害の発
生回数等の統計情報が格納される。The recovery procedure field 29g stores a recovery procedure performed for the failure, such as the recovery procedure field 19c in FIG. 7. The statistical information 29h stores statistical information such as the number of times the failure occurs.

他方、ソフトウェア障害情報に関しても、管理テーブル
３０内にソフトウェア障害登録数（Ｌ）３０Ｃ，格納エ
リア・ポインタ（Ｓ）３０ｄによって、実体が管理され
ており、実体内の１つの判例情報４１は、理由コードが
存在しないことを除いて、ハードウェア障害情報と同じ
である。On the other hand, regarding software failure information, the entity is managed in the management table 30 by the number of software failure registrations (L) 30C and the storage area pointer (S) 30d, and one case law information 41 in the entity is the reason. Same as hardware fault information except that no code is present.

第１７図は判例検索・登録処理部（ＳＴＧ）２６の検索
処理の処理フローである。第７図を参照するに、先ず、
処理ステップ３７ａ、３７ｂにて第１０図の障害要約情
報１６の概略情報１６ｄと判例記憶ファイル２５４内の
各エントリ２９゜４１と比較する。ここで、ハードウェ
ア障害の場合には理由コード２９ａ、ソフトウェア障害
の場合には障害メツセージ４１ｂと比較する。比較の結
果、障害要約情報１６と等しいエントリが存在したなら
ば、処理ステップ３７ｃにて、該エントリを候補の１つ
としてＢＵＦ２４に格納する。判定処理−ステップ３７
ｄでは、全てのエントリを検索するまで処理ステップ３
７ａ〜処理ステツプ３７ｃを繰り返す。以上によって、
障害要約情報１６に対応する候補がＢＵＦ２４に格納さ
れる。FIG. 17 is a processing flow of the search process of the case law search/registration processing unit (STG) 26. Referring to Figure 7, first,
In processing steps 37a and 37b, the summary information 16d of the failure summary information 16 in FIG. Here, the reason code 29a is compared in the case of a hardware failure, and the failure message 41b is compared in the case of a software failure. As a result of the comparison, if an entry equal to the failure summary information 16 is found, in processing step 37c, this entry is stored in the BUF 24 as one of the candidates. Judgment process - step 37
d, process step 3 until all entries are retrieved.
7a through processing steps 37c are repeated. By the above,
Candidates corresponding to the failure summary information 16 are stored in the BUF 24.

この後、照合処理部２５によって、発生した障害に一致
した判例が抽出される。Thereafter, the matching processing unit 25 extracts precedents that match the fault that has occurred.

第１８図は判例検索・登録処理部（ＳＴＧ）２６の登録
処理の処理フローである。まず、判定処理ステップ３８
ａでは、障害情報の登録か、あるいは回復情報の登録か
を判定する。障害情報の登録ならば、処理ステップ３８
ｂを実行し、回復情報の登録の登録ならば、処理ステッ
プ３８Ｃ２処理ステツプ３８ｄを実行する。障害情報の
登録のとき、処理ステップ３８ｂにて第６図の障害要約
情報１６と詳細情報をハードウェア障害／ソフトウェア
障害に分けて情報領域２９、または情報領域４１に格納
する。一方、回復情報の登録ならば、第７図の回復情報
１９に対応する判例情報２９、または判例情報４１を検
索する。次に、検索した判例情報２９１判例情報４１の
回復手順２９ｇ、または回復手順４１ｇに第７図の回復
手順１９ｃを格納する。FIG. 18 is a processing flow of the registration process of the precedent search/registration processing unit (STG) 26. First, determination processing step 38
In a, it is determined whether failure information or recovery information is to be registered. If failure information is registered, processing step 38
b is executed, and if the recovery information is registered, processing step 38C2 is executed, processing step 38d is executed. When registering fault information, the fault summary information 16 and detailed information shown in FIG. 6 are divided into hardware faults/software faults and stored in the information area 29 or the information area 41 in processing step 38b. On the other hand, if recovery information is to be registered, case information 29 or case information 41 corresponding to recovery information 19 in FIG. 7 is searched. Next, the recovery procedure 19c of FIG. 7 is stored in the recovery procedure 29g or recovery procedure 41g of the searched case information 291 and case information 41.

第１９図（ａ）、第１９図（ｂ）は第１図の照合処理部
２５の処理フロー図である。処理ステップ３９ａにてハ
ードウェア障害の照合であるか、あるいはソフトウェア
の障害かを判定する。ハードウェア障害の照合ならば、
第１９図（ｂ）の処理ステップ３９ｉ〜処理ステツプ３
９ｐを実行する。ソフトウェアの障害判例の照合ならば
、処理ステップ３９ｂ〜処理ステツプ３９ｈを実行する
。FIGS. 19(a) and 19(b) are processing flow diagrams of the collation processing section 25 of FIG. 1. In processing step 39a, it is determined whether the failure is a hardware failure or a software failure. If checking for hardware failure,
Processing step 39i to processing step 3 in FIG. 19(b)
Execute 9p. If it is a comparison of software failure cases, processing steps 39b to 39h are executed.

先ず、処理ステップ３９ｂにて、作業変数Ｃｏｕｎｔに
ＢＵＦ２４内に格納されている判例の候補数を設定する
０次に、カウンタｉを０に初期化する。First, in processing step 39b, the number of precedent candidates stored in the BUF 24 is set to the work variable Count.Next, a counter i is initialized to zero.

ここで、作業変数Ｃｏｕｎｔおよびカウンタは、第１４
図（ａ）と同様に、ハードウェア、あるいは作業領域に
確保しても構わない。Here, the work variable Count and the counter are the fourteenth
As in Figure (a), it may be secured in hardware or in a work area.

処理ステップ３９ｃにてカウンタｉを＋１した後、処理
ステップ３９ｄにてカウンタｉの値が作業変数Ｃｏｕｎ
ｔの値を超えたか否かを判定する。After incrementing the counter i by 1 in processing step 39c, the value of counter i is changed to the work variable Coun in processing step 39d.
It is determined whether the value of t has been exceeded.

カウンタｉの値が作業変数Ｃｏ　ｕ　ｎ　ｔの値を超え
ていたならば、該障害と一致した判例がなかったことに
なり、不一致の終了となる。カウンタｉの値が作業変数
Ｃｏｕｎｔの値を超えていなければ、判定処理ステップ
３９ｅにて障害メツセージ４１ｂが等しいか否かを判定
する。一致しなければ処理ステップ３９ｃへ戻る。次に
、判定処理ステップ３９ｆにて関連メツセージ４１ｃが
等しいか否かを判定する。ここでも一致しなければ処理
ステップ３９ｃへ戻る。障害メツセージ４１ｂ、関連メ
ツセージ４１ｃも一致したならば、判定処理ステップ３
９ｇにて障害情報４１ｆが一致しているか否かを調べる
。その結果、一致しなければ処理ステップ３９ｃへ戻る
。一方、一致したならば、処理ステップ３９ｈにて、カ
ウンタｉの示すエントリの回復手順４１ｇを本照合処理
部２５を呼び出した処理部、すなわち収集・解析処理部
２３へ渡し、一致終了となる。なお、このとき、統計情
報４１ｈの障害発生回数を＋１する。If the value of the counter i exceeds the value of the work variable Count, it means that there is no precedent that matches the failure, and the mismatch ends. If the value of the counter i does not exceed the value of the work variable Count, it is determined in determination processing step 39e whether or not the failure messages 41b are equal. If they do not match, the process returns to step 39c. Next, in determination processing step 39f, it is determined whether the related messages 41c are equal. If they do not match here as well, the process returns to step 39c. If the failure message 41b and related message 41c also match, judgment processing step 3
At step 9g, it is checked whether the failure information 41f matches. As a result, if they do not match, the process returns to step 39c. On the other hand, if there is a match, the recovery procedure 41g for the entry indicated by the counter i is passed to the processing section that called the main verification processing section 25, that is, the collection/analysis processing section 23, in processing step 39h, and the matching ends. At this time, the number of failure occurrences in the statistical information 41h is incremented by 1.

第１９図（ｂ）はハードウェア障害の判例の照合処理で
ある。まず、処理ステップ３９ｊにて、作業変数Ｃｏｕ
ｎｔにＢＵＦ２４内に格納されている判例の候補数を設
定する。次に、カウンタｉを０に初期化する１次に、処
理ステップ３９ｋにて、カウンタｉを＋１した後、処理
ステップ３９ｍにてカウンタｉの値が作業変数Ｃｏｕｎ
ｔの値を超えたか否かを判定する。カウンタｉの値が作
業変数Ｃｏｕｎｔの値を超えていたならば、該障害と一
致した判例がなかったことになり、不一致の終了となる
。カウンタｉの値が作業変数Ｃｏｕｎｔの値を超えてい
なければ、判定処理ステップ３９ｎにて理由コード２９
ａが等しいかを判定する。−致しなければ処理ステップ
３９にへ戻る。次に、判定処理ステップ３９０にて障害
情報２９ｆが一致しているか否かを調べる。その結果、
一致しなければ処理ステップ３９にへ戻る。一方、一致
したならば、処理ステップ３９ｐにて、カウンタｉの示
すエントリの回復手順２９ｇを本照合処理部２５を呼び
出した処理部、すなわち収集・解析処理部２３へ渡し、
一致終了となる。なお、このとき、統計情報２９ｈの障
害発生回数を＋１する。FIG. 19(b) shows a process of collating precedents for hardware failures. First, in processing step 39j, the work variable Cou
The number of case precedent candidates stored in the BUF 24 is set in nt. Next, counter i is initialized to 0. Next, in processing step 39k, counter i is incremented by 1, and then in processing step 39m, the value of counter i is changed to the work variable Coun.
It is determined whether the value of t has been exceeded. If the value of the counter i exceeds the value of the work variable Count, it means that there is no precedent that matches the failure, and the mismatch ends. If the value of counter i does not exceed the value of work variable Count, reason code 29 is determined in judgment processing step 39n.
Determine whether a is equal. - If not, the process returns to step 39. Next, in determination processing step 390, it is checked whether the failure information 29f matches. the result,
If they do not match, the process returns to step 39. On the other hand, if they match, in a processing step 39p, the recovery procedure 29g of the entry indicated by the counter i is passed to the processing unit that called the main matching processing unit 25, that is, the collection/analysis processing unit 23,
The match ends. At this time, the number of failure occurrences in the statistical information 29h is incremented by 1.

以上が第１図の照合処理部２５の動作である。The above is the operation of the collation processing section 25 shown in FIG.

では次に、回復指示生成処理部２７の動作を説明する。Next, the operation of the recovery instruction generation processing section 27 will be explained.

第２０図は回復指示生成処理部２７の処理フロー図であ
る。まず、処理ステップ４０ａにおいて、ＢＵＦ２４内
で照合、選択された判例情報２９、または判例情報４１
の回復手順２９ｇ、または４１ｇを得る。次の処理ステ
ップ４０ｂにて第７図の障害回復情報１６を作成する。FIG. 20 is a processing flow diagram of the recovery instruction generation processing section 27. First, in the processing step 40a, the case law information 29 or the case law information 41 collated and selected in the BUF 24
Recovery procedure: Obtain 29g or 41g. In the next processing step 40b, failure recovery information 16 shown in FIG. 7 is created.

すなわち。Namely.

処理ステップ４０ａで得た回復手順２９ｇ、または回復
手順４１ｇをフィールドＬ９ｃに格納する。The recovery procedure 29g or 41g obtained in processing step 40a is stored in field L9c.

なお、第７図のフィールド１９ａ〜フイールド１９ｂは
第６図のフィールド１９ａ〜フイールド１６ｂを複写す
ることによって得られる。また、回復日時フィールド１
９ｄには、この処理ステップにて回復手順を生成下時点
の日時を設定する。Note that fields 19a to 19b in FIG. 7 are obtained by copying fields 19a to 16b in FIG. 6. Also, recovery date and time field 1
In 9d, the date and time at which the recovery procedure was generated in this processing step is set.

では次に、監視制御装置１００側にて保守員の介入１例
えばハードウェア・パッケージの交換作業を必要としな
いときの自動回復動作について説明する。第２１図は第
７図で示した回復情報１９の回復手順フィールド１９ｃ
内に、計算機システムの標準再起動の時に行なうコンソ
ール装置の操作のデータ列４５ｂが格納されている例を
示している。第２２図は、このときのコマンド解釈処理
部１３の処理フローを示している。コマンド解釈処理部
１３は第８図で示したコマンドの一覧の他に監視・保守
用計算機システム２５０から送られてくる回復情報１９
の内容を調べて動作すれば良い、まず、処理ステップ４
３ａにて第２１図の指令列４４に対応して制御対象の計
算機システム２００を停止させ、かつシステム・リセッ
トの指令を発行する。これはデータ・バッファ６に上記
の動作のための指令データを格納することにより。Next, an explanation will be given of an automatic recovery operation when maintenance personnel's intervention 1, for example, replacement of a hardware package is not required on the supervisory control device 100 side. FIG. 21 shows the recovery procedure field 19c of the recovery information 19 shown in FIG.
An example is shown in which a data string 45b of console device operations to be performed at the time of standard restart of the computer system is stored. FIG. 22 shows the processing flow of the command interpretation processing section 13 at this time. In addition to the command list shown in FIG. 8, the command interpretation processing unit 13 also receives recovery information 19 sent from the monitoring and maintenance computer system 250.
All you have to do is check the contents and operate. First, process step 4.
At step 3a, the computer system 200 to be controlled is stopped and a system reset command is issued in response to the command sequence 44 in FIG. This is done by storing command data for the above operations in the data buffer 6.

線Ｌｌｂを経てサービス・プロセッサＳ　Ｖ　Ｐ　２０
７に送られる０次の判定処理ステップ４３ｂではデータ
４５ａを調べて、標準再起動か否かを検査する、標準再
起動でないならば、先にＡ　ＣＴ　Ｉ　ＯＮコマンドで
説明したように、それ以降のデータ列４５ｂをコンソー
ル装置１０２やハードコピー装置等に出力する。Service processor S V P 20 via line Llb
In the zero-order judgment processing step 43b sent to Step 7, the data 45a is examined to check whether or not it is a standard restart. The data string 45b is output to the console device 102, hard copy device, or the like.

標準再起動が指定されているならば、処理ステップ４３
ｃにて回復手順のデータ列４５ｂをデータ・バッファ６
、画面バッファ９に格納する。これにより、該データ列
４５ｂはコンソール装置１０２にも表示されることにな
り、かつ、線Ｌｌａ。If standard restart is specified, process step 43
At step c, the data string 45b of the recovery procedure is transferred to the data buffer 6.
, stored in the screen buffer 9. As a result, the data string 45b is also displayed on the console device 102, and the line Lla.

Ｌｌｂを経て計算機システム２００のサービス・プロセ
ッサ５ＶＰ２０７やオペレーティング・システム○５２
０８ａに渡すことが出来る。Service processor 5VP207 and operating system ○52 of computer system 200 via Llb
It can be handed over to 08a.

〔Effect of the invention〕

本発明によれば、計算機システムの障害監視と回復のた
めに、制御対象の計算機システム側に監視制御装置が存
在し、常時、計算機システムの振舞を監視するとともに
、障害発生時には、該計算機システムの障害情報を自動
的に収集し、遠隔地の監視保守用計算機システムへ障害
情報を転送する。障害の通報を受けた監視保守用計算機
システムでは、直ちに保守員を派遣するとともに、該障
害情報を過去の障害判例と照合することにより、適切な
回復手段を生成し、現地に到着した保守員に自動的に指
示を与えるので、計算機システｌ−の障害発生から回復
までの時間が大幅に短縮できる効果がある。According to the present invention, in order to monitor and recover from failures in computer systems, a supervisory control device is present on the side of the computer system to be controlled, and constantly monitors the behavior of the computer system, and when a failure occurs, Fault information is automatically collected and transferred to a remote monitoring and maintenance computer system. When a computer system for monitoring and maintenance receives a notification of a failure, it immediately dispatches maintenance personnel, and by comparing the failure information with past failure precedents, it generates appropriate recovery measures and sends them to the maintenance personnel who arrive at the site. Since the instructions are automatically given, the time from the occurrence of a failure to the recovery of the computer system I- can be significantly shortened.

さらに、保守員の介入を必要としない障害に対しては、
該計算機システムの再起動を自動的に行なうので、無人
運転サービスの実現に一歩近づく効果がある。Furthermore, for failures that do not require intervention by maintenance personnel,
Since the computer system is automatically restarted, this has the effect of bringing one step closer to realizing unmanned driving services.

さらに、遠隔地の監視保守用計算機システムは複数の計
算機システムの障害発生を監視しているので、小人数で
多くの計算機システムを運営できる効果がある。Furthermore, since the remote monitoring and maintenance computer system monitors the occurrence of failures in multiple computer systems, it has the effect of allowing a small number of people to operate many computer systems.

さらに、遠隔地の監視保守用計算機システムは複数の計
算機システムの障害発生時の障害情報。Furthermore, remote monitoring and maintenance computer systems provide failure information when failures occur in multiple computer systems.

および回復手段を一括して記憶しているので、障害管理
情報の蓄積が可能となる効果がある。Since the information and recovery methods are stored all at once, it is possible to accumulate failure management information.

[Brief explanation of the drawing]

第１図は本発明の情報処理システムの障害監視装置と制
御方法の特徴的な構成を示した図、第２図は第１図のバ
ッファ６の構成を示した図、第３図は比較テーブル７の
構成を示した図、第４図は診断処理部８で使用する診断
番号とアドレス・レジスタ（ＡＤ）２．データ・レジス
タ（ＤＴ）３への値の対応を示した図、第５図は一時記
憶フアイル１０４に障害情報を格納するとき、および監
視・保守用計算機システム２５０へ障害情報を転送する
ときのデータの形式を示した図、第６図は監視制御装置
１００から監視・保守用計算機システム２５０へ障害発
生時の障害概略情報を転送するときのデータの形式を示
した図、第７図は監視・保守用計算機システム２５０か
ら監視制御装置１００へ回復手順を転送するとき、ある
いは保守員が現地にて回復作業の後、監視制御装置１０
０から監視・保守用計算機システム２５０へ回復手順を
転送するときのデータの形式を示した図、第８図はコン
ソール装置１０２，２５２からのコマンドの形式、また
は監視・保守用計算機システム２５０から監視制御装置
１００へのコマンドの形式を示した図、第９図は監視制
御装置ｇｌｏｏから監視・保守用計算機システム２５０
へ障害情報を転送するときに、転送するデータ量の削減
処理を施す１つの例を示した図、第１０図は判例記憶フ
ァイル２５４の構成を示した図、第１１図は第１図の比
較処理部（ＣＭＰ）５の処理フロー図、第１２図は障害
判定処理部１０の処理フロー図、第１３図はＣＰＵイン
タフェース処理部（ＣＰＵＩ）４の処理フロー図、第１
４図（ａ）、（ｂ）は診断処理部８の処理フロー図、第
１５図（ａ）。（ｂ）は監視制御装ｗ１００内のコマンド解釈処理部１
３の処理７０−図、第１６図（ａ）、（ｂ）は監視・保
守用計算機システム２５０内の収集・解析処理部２３の
処理フロー図、第１７図は判例検索・登録処理部２６の
検索処理の処理フロー図、第１８図は判例検索・登録処
理部２６の登録処理の処理フロー図、第１９図（ａ）、
（ｂ）は照合処理部２５の処理フロー図、第２０図は回
復指示生成処理部２７の処理フロー図、第２１図は第７
１で示した回復情報１９の回復手順フィールド１９ｃ内
に、計算機システムの標準再起動の時に行なうコンソー
ル装置の操作のデータ列が格納されている例を示した図
、第２２図は、このときのコマンド解釈処理部１３での
処理フローを示した図、である。１・・・エンコーダ、２・・・アドレス・レジスタ（Ａ
　Ｄ）。３・・・データ・レジスタ（ＤＴ）、４・・・ＣＰＵイ
ンタフェース処理部、５・・・比較処理部（ＣＭＰ）、
６・・・データ・バッファ、７・・・比較テーブル、８
・・・診断処理部、９・・・画面バッファ、１ｏ・・・
障害判定処理部、１１・・・分配器、１２・・・転送処
理部、１３・・・コマンド解釈処理部、１４・・・送受
信処理部。２１・・・通信処理部、２２・・・解釈指令処理部、２
３・・・収集・解析処理部、２４・・・作業用バッファ
ＢＵＦ。２５・・・照合処理部、２６・・・判例検索・登録処理
部、２７・・・回復指示生成処理部、１００・・・監視
制御装置、１０２・・・コンソール装置、１０４・・・
−時記憶ファイル、２００・・・監視制御対象の情報処
理システム（計算機システム）、２５０・・・監視・保
守用計算機システム、２５２・・・コンソール装置、２
５４・・・判例記憶ファイル。２５６・・・障害情報格納ファイル。図第回罵図冨４図 ■ 図葛 ■ ■ 第図第図 χ ｌθ 図不１国冨２図１ρ 第３図葛／４図（とり葛５図（ｔｌ）！ネ ■ ５固Ｃｂ）竿１６凹（ｄン第６図（ｂ）第？’ｆｆ図不９図（ａ、）５葛Ｚ／図７FIG. 1 is a diagram showing the characteristic configuration of the fault monitoring device and control method of the information processing system of the present invention, FIG. 2 is a diagram showing the configuration of the buffer 6 in FIG. 1, and FIG. 3 is a comparison table. 7 shows the configuration of the diagnostic number and address register (AD) 2.7 used in the diagnostic processing unit 8. FIG. 5 is a diagram showing the correspondence of values to the data register (DT) 3, and FIG. FIG. 6 is a diagram showing the format of data when transferring summary failure information from the monitoring and control device 100 to the monitoring and maintenance computer system 250, and FIG. When transferring the recovery procedure from the maintenance computer system 250 to the monitoring and control device 100, or after a maintenance worker performs recovery work on site, the monitoring and control device 10
FIG. 8 is a diagram showing the format of data when transferring the recovery procedure from the computer system 250 for monitoring and maintenance from the computer system 250 for monitoring and maintenance. A diagram showing the format of a command to the control device 100, FIG. 9 is a diagram showing the format of a command to the control device 100.
Figure 10 is a diagram showing the structure of the precedent storage file 254, and Figure 11 is a comparison of Figure 1. 12 is a processing flow diagram of the processing unit (CMP) 5, FIG. 12 is a processing flow diagram of the failure determination processing unit 10, and FIG. 13 is a processing flow diagram of the CPU interface processing unit (CPUI) 4.
4(a) and (b) are process flow diagrams of the diagnostic processing unit 8, and FIG. 15(a). (b) is the command interpretation processing unit 1 in the supervisory control device w100.
3, FIGS. 16(a) and 16(b) are process flow diagrams of the collection/analysis processing unit 23 in the monitoring/maintenance computer system 250, and FIG. A process flow diagram of the search process, FIG. 18 is a process flow diagram of the registration process of the case law search/registration processing unit 26, and FIG.
(b) is a processing flow diagram of the collation processing section 25, FIG. 20 is a processing flow diagram of the recovery instruction generation processing section 27, and FIG. 21 is a processing flow diagram of the recovery instruction generation processing section 27.
FIG. 22 is a diagram showing an example in which a data string of console device operations performed at the time of standard restart of a computer system is stored in the recovery procedure field 19c of the recovery information 19 shown in 1. 3 is a diagram showing a processing flow in a command interpretation processing unit 13. FIG. 1...Encoder, 2...Address register (A
D). 3... Data register (DT), 4... CPU interface processing section, 5... Comparison processing section (CMP),
6...Data buffer, 7...Comparison table, 8
...Diagnosis processing unit, 9...Screen buffer, 1o...
Failure determination processing section, 11... Distributor, 12... Transfer processing section, 13... Command interpretation processing section, 14... Transmission/reception processing section. 21... Communication processing section, 22... Interpretation command processing section, 2
3... Collection/analysis processing unit, 24... Work buffer BUF. 25... Collation processing unit, 26... Case law search/registration processing unit, 27... Recovery instruction generation processing unit, 100... Monitoring control device, 102... Console device, 104...
- Time storage file, 200... Information processing system (computer system) to be monitored and controlled, 250... Computer system for monitoring and maintenance, 252... Console device, 2
54... Judgment memory file. 256... Failure information storage file. Figure No. 1 Concave (d Fig. 6 (b) No. ?'ff Fig. 9 Fig. (a,) 5 Kuzu Z/ Fig. 7

Claims

[Scope of Claims] 1. At least one central processing unit having a main storage device, a group of input/output devices, a function of a console for maintenance operations of the central processing unit, and an operation under the central processing unit. a first information processing system equipped with a console function that allows message display and input of commands for an operating system; and a second information processing system that monitors and controls the operation of the first information processing system from a remote location. In the system configuration, a first storage means for storing a certain number of message data of an operating system of the first information processing system or command data input from a console display device; and a first storage means for detecting an abnormal state. a second storage means for storing message data of the operating system; a first comparison means for comparing message data of the operating system with the contents of the second storage means; and a first comparison means for comparing the contents of the second storage means as a result of the comparison. If they match, the control means notifies the second information processing system at the remote location of the occurrence of a failure, and the content of the first storage means is stored in the second information processing system at the remote location based on the command from the second information processing system at the remote location. 2. A fault monitoring device and control method for an information processing system, characterized by comprising a control means for sending out data to the information processing system. 2. At least one central processing unit with a main memory, a group of input/output devices, the functions of a console for maintenance operations of the central processing unit, and the messages of an operating system operating under the central processing unit. In the configuration of a first information processing system equipped with a console function that enables display and input of commands, and a second information processing system that monitors and controls the operation of the first information processing system from a remote location, A control means for receiving a signal indicating that a hardware failure has occurred in the central processing unit of the first information processing system; a control means for notifying a second information processing system at a remote location of the occurrence of a failure; A control means for reading out the state of hardware of a central processing unit in the first information processing system based on a command from the second information processing system, and a control means for sending the read information to the second information processing system. A fault monitoring device and control method for an information processing system, characterized by the following. 3. At least one central processing unit with a main memory device, a group of input/output devices, the functions of a console for maintenance operations of the central processing unit, and the messages of an operating system operating under the central processing unit. In the configuration of a first information processing system equipped with a console function that enables display and input of commands, and a second information processing system that monitors and controls the operation of the first information processing system from a remote location, When the second information processing system receives a report of the occurrence of a failure in the first information processing system, the contents of the first storage means or the hardware of the central processing unit in the first information processing system A fault monitoring device and control method for an information processing system, characterized in that a second information processing system is provided with a control means for obtaining the status. 4. At least one central processing unit having a main memory, a group of input/output devices, the functions of a console for maintenance operations of the central processing unit, and the messages of an operating system operating under the central processing unit. In the configuration of a first information processing system equipped with a console function that enables display and input of commands, and a second information processing system that monitors and controls the operation of the first information processing system from a remote location, When a failure occurs in the first information processing system and the second information processing system obtains failure information of the first information processing system, a third information processing system that stores failure precedents in the second information processing system control means for comparing the failure information of the storage means and the first information processing system with the precedents in the third storage means; and as a result of the comparison, if a matching failure precedent exists, A fault monitoring device and control method for an information processing system, characterized by comprising a control means for generating a processing procedure for recovering from a fault. 5. The recovery processing procedure obtained by the control means that generates the processing procedure for recovering from a failure in the first information processing system as described in claim 4 is performed by maintenance personnel who By inputting a command to refer to the recovery processing procedure from the console/display device of the system, the control means transmits the recovery processing procedure from the second information processing system, and the received recovery processing procedure is sent to the second information processing system. 5. A fault monitoring device and control method for an information processing system according to claim 4, further comprising a control means for displaying information on a console display device of the first information processing system. 6. The contents of the second storage means recited in claim 1 are stored on a console connected to the first information processing system.
Claim 1, characterized by comprising a control means that can be changed according to an instruction from a display device or a control means that can be changed according to an instruction from a second information processing system.
Fault monitoring device and control method for the information processing system described in Section 1. 7. The signal indicating the occurrence of a failure as set forth in claim 2 is generated by a specific monitoring program running under the first information processing system constantly inspecting the state of the hardware;
As a result, a fault monitoring device for an information processing system according to claim 2, further comprising a control means for executing a command to generate a signal indicating the occurrence of the fault when an abnormality is detected. Control method. 8. The signal indicating the occurrence of a failure as set forth in claim 2 indicates that a hardware monitoring mechanism operating under the first information processing system detects an abnormality in the operating state of the hardware. 3. A fault monitoring device and control method for an information processing system according to claim 2, further comprising a control means for executing an instruction to generate a signal indicating that the fault has occurred. 9. The control means for reading the state of the hardware according to claim 2 is held by the control means for addressing each processing unit of the central processing unit and each addressed processing unit. 3. A fault monitoring device and control method for an information processing system according to claim 2, characterized in that the device comprises a control means that receives the state of the hardware via a dedicated data line. 10. In the control means for reading out the state of the hardware according to claim 2, for the main storage device,
3. A fault monitoring device and control method for an information processing system according to claim 2, further comprising means for reading out data in a special area that cannot be accessed by an operating system. 11. In the control means for reading the state of the hardware according to claim 2 or 10, the storage area address in the main storage device is sent to the main storage device, and the corresponding data is sent to the main storage device. A fault monitoring device and control method for an information processing system according to claim 2 or claim 10, characterized by comprising a control means for obtaining the following. 12. The hardware state information read by the control means for reading out the hardware state according to claim 2, 10, or 11 is stored in the fourth storage means. Claim 2, or 10, or 11, characterized in that the claim comprises a control means for
Fault monitoring device and control method for the information processing system described in Section 1. 13. The system is characterized by comprising a control means for editing and sending the fault information as set forth in claim 1 or 2 when the fault information is sent to the second information processing system. A fault monitoring device and control method for an information processing system according to claim 1 or 2. 14. The editing means described in the preceding paragraph is characterized in that when editing the contents of the first storage means, the processing step consists of extracting only message data from the operating system and command data from the console display device. A fault monitoring device and control method for an information processing system according to claim 13. 15. In the editing means according to claim 13, when editing the hardware information, the control means edits the contents of the fourth storage means, and when editing the contents of the main storage device, the same data is contiguous. 14. The fault monitoring device and control method for an information processing system according to claim 13, further comprising a processing step of replacing the fault with data indicating that fact. 16. The information processing system as set forth in claim 4, wherein the third storage means as set forth in claim 4 stores phenomena and causes of failure occurrence, and recovery history. fault monitoring device and control method. 17. As a result of checking with the third storage means described in claim 4, if no matching precedent is found, the phenomenon of the occurrence of the failure is added to the third storage means as a new failure precedent. 5. A fault monitoring device and control method for an information processing system according to claim 4, further comprising a control means for storing the information. 18. If no matching precedent is found as a result of checking with the third storage means described in claim 4, the maintenance personnel will later perform the recovery processing procedure from the console display device of the first information processing system. The method includes a control means for displaying that no case law is found when a reference command is input, and a processing means for displaying the contents of the first storage means and the fourth storage means on a console display device. A fault monitoring device and control method for an information processing system according to claim 5. 19. As a result of checking with the third storage means recited in claim 17, if no matching precedent is found and the cause of the failure and the recovery procedure are later found, the cause of the failure and the recovery procedure 18. The fault monitoring device and control method for an information processing system according to claim 17, further comprising a control means for correcting and storing the corrected data in the third storage means in response to a fault phenomenon. 20. As a result of checking with the third storage means described in claim 4, if no matching precedent is found, the second storage means
Information according to claim 5, characterized in that the display output device of the information processing system is equipped with processing means for displaying and outputting the failure information obtained from the first storage means and the fourth storage means. Processing system failure monitoring device and control method. 21. At least one central processing unit having a main memory, a group of input/output devices, the functions of a console for maintenance operations of the central processing unit, and the messages of an operating system operating under the central processing unit. In the configuration of a first information processing system equipped with a console function that allows display and input of commands, and a second information processing system that monitors and controls the operation of the first information processing system from a remote location,
When a failure occurs in the first information processing system and the second information processing system obtains failure information of the first information processing system, a second information processing system that stores failure precedents in the second information processing system; If there is a matching fault case as a result of the comparison, the first information processing system is
a control means for generating a processing procedure for recovering from a fault in the information processing system;
The control means for transferring the information to the control device monitoring the information processing system and the control device monitoring the first information processing system restart the first information processing system based on the recovery processing procedure. A fault monitoring device and a control method for an information processing system, characterized by comprising a control procedure for controlling the same. 22. At least one central processing unit having a main memory, a group of input/output devices, the functions of a console for maintenance operations of the central processing unit, and the messages of an operating system operating under the central processing unit. In the configuration of a first information processing system equipped with a console function that allows display and input of commands, and a second information processing system that monitors and controls the operation of the first information processing system from a remote location,
Claims 1, 2, or 3, characterized in that the second information processing system is equipped with a control means that can monitor and control the operations of the plurality of first information processing systems, or A fault monitoring device and control method for an information processing system according to item 4 or 21.