JPS58144263A

JPS58144263A - Fault processing system for dispersion processing system

Info

Publication number: JPS58144263A
Application number: JP57026043A
Authority: JP
Inventors: Masahiro Sakata; 正博坂田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1982-02-22
Filing date: 1982-02-22
Publication date: 1983-08-27

Abstract

PURPOSE:To reduce fault correcting time, by executing automatic diagnosis, automatic memory dumping and automatic IPL under the control of a console service processor in a dispersion processing system having a host and plural subhosts. CONSTITUTION:In addition to a console service processor 22 and a service file, a program file 23, a diagnosis program file 24 and a memory dumping file 25 are connected to a subhost 2. When a fault is generated in normal operation, the console service processor 22 detects the fault, loads the diagnosis program file 24, the memory dumping file 25 and the program file 23 and then, after the automatic diagnosis for hardware and automatic memory dumping and automatic initial program loading for software, reports these results to a host 1. Thus the fault can be discriminated in the host 1 side and the repairing and recovering time is reduced, improving the maintenance efficiency and system reliability.

Description

【発明の詳細な説明】本発明はホストと複数のサブホストを有する分散処理シ
ステムに係抄、特にサブホストの障害修復時間短縮化、
保守効率向上、システム信頼性の向上に好適な障害処理
方式に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a distributed processing system having a host and a plurality of subhosts.
This invention relates to a failure handling method suitable for improving maintenance efficiency and system reliability.

従来技術゛　ホストと複数のサブホストを有する分散処理システ
ムにおいて、サブホストのオペレータ不在、保守要員不
在等を前提とした無人化運転方式が従来より種々試みら
れている．しかし、サブホストニハハードウエアの自動
診断、ソフトウェアの自動メモリダンプ、自動イニシャ
ルプログラムロード（ＩＰＬ）機能等が欠如しているた
め、障害発生をホスト等で検知すると、保守要員がサブ
ホストへ出動しているのが現況で、次のような問題があ
った。Prior Art In distributed processing systems that have a host and multiple sub-hosts, various unmanned operation methods have been attempted based on the assumption that sub-host operators and maintenance personnel are absent. However, since the subhost lacks automatic hardware diagnosis, automatic software memory dump, and automatic initial program load (IPL) functions, maintenance personnel are dispatched to the subhost when a failure is detected on the host. The current situation is as follows.

（１）　　ハードウェアの障害かソフトウェアの障害か
が出動時点で不明であるため、ノ・−ドウエア、ソフト
ウェア両面の保守要員の出動が必要であリ、さらに、保
守要員がサブホストの設置場所に到着後、障害切分けが
開始されることになるため、障害切分は時間が必要とな
る。(1) Since it is unknown at the time of dispatch whether it is a hardware failure or a software failure, it is necessary to dispatch maintenance personnel for both hardware and software, and furthermore, the maintenance personnel arrive at the location where the sub-host is installed. After that, troubleshooting will start, so troubleshooting will take time.

（２）　　ハードウェアの障害時でも、保守要員出動時
に障害部位が指摘されていないため、保守要員がサブホ
ストの設置場所に到着後、障害部位を指摘し、然る後保
守用部品を調達する為、障害回復時間が長くなる。(2) Even in the event of a hardware failure, the faulty part is not pointed out when maintenance personnel are dispatched, so after the maintenance staff arrives at the sub-host installation location, they can point out the faulty part and then procure maintenance parts. , the failure recovery time becomes longer.

（３）　　ソフトウェアの障害は、再ＩＰＬＫよりシス
テムが動作する可能性が高いが、自動再ＩＰＬ及び後刻
でのソフトウェア障害解析のための自動メモリダンプ機
能が存在しない為、システム信頼性上、あるいはソフト
ウェア品質向上の上で問題がある。(3) In the case of software failure, the system is more likely to operate than re-IPLK, but since there is no automatic re-IPL or automatic memory dump function for later software failure analysis, system reliability or software There are problems in improving quality.

発明の目的この発明の目的とするところは、前記の如き従来技術の
間趙点を解決することであり、分散処理システムにおけ
る障害修復時間の短縮化、保守効率向と１　システム信
頼性の向上を図るととＫある。Purpose of the Invention The purpose of the present invention is to solve the drawbacks of the prior art as described above, and to shorten the time for troubleshooting, improve maintenance efficiency, and improve system reliability in distributed processing systems. There is a plan.

仁の発明の％徴とするところは、サブホストに診断用プ
ログラム（ＭＤ）、メモリダンプ用プログラム、通常の
プログラムを格納した磁気ディスク装置等のファイルメ
モリを持ち、コンソールサービスプロセッサの制御の下
に自動診断、自動メモリダンプ、自動ＩＰＬを実行させ
ることにより、障害修復時間の短縮化、保守効率向上、
システム信頼性向上を可能ならしめるものである。The key feature of Jin's invention is that the subhost has a file memory such as a magnetic disk device that stores a diagnostic program (MD), a memory dump program, and regular programs, and automatically runs under the control of a console service processor. By running diagnostics, automatic memory dump, and automatic IPL, fault recovery time is shortened, maintenance efficiency is improved,
This makes it possible to improve system reliability.

実施例の説明第１図はホストと複数のサブホストを有する分散処理シ
ステムのブロック図である０図中、１はホスト処理装置
（以下、ホストと略称する）であり、該ホス１通信回線
１１を介して各々サブホスト処理装置２（以下、サブホ
ストと略称する）が接続され、各サブホスト８には端末
回線１２を介して端末装置群８が接続されている。DESCRIPTION OF THE EMBODIMENTS FIG. 1 is a block diagram of a distributed processing system having a host and a plurality of sub-hosts. A sub-host processing device 2 (hereinafter abbreviated as sub-host) is connected to each sub-host processing device 2 via a terminal line 12, and a terminal device group 8 is connected to each sub-host 8 via a terminal line 12.

第２図は本発明の一実施例のブロック図で、便宜上、ホ
ス′トと１つのサブホストとの接続のみを示したもので
ある。第８図において、サブホスト２にはコンソールサ
ービスプロセッサ２２及び業務用ファイル以外に、プロ
グラムファイル２８、診断プログラム（ＭＤ）用ファイ
ル２４、メモリダンプ用ファイル２５が接続されている
。こ＼で、メモリダンプとはメインメモリ上の内容（プ
ログラム、データ等）を外部記憶装置へ出力することで
、ソフトウェア障害の解析に非常に有効なものである。FIG. 2 is a block diagram of one embodiment of the present invention, and for convenience only shows the connection between the host and one sub-host. In FIG. 8, in addition to a console service processor 22 and business files, a program file 28, a diagnostic program (MD) file 24, and a memory dump file 25 are connected to the subhost 2. Here, a memory dump is an output of the contents (programs, data, etc.) in the main memory to an external storage device, and is very effective in analyzing software failures.

メモリダンププログラムはファイル２５よりメインメモ
リにロードされた後、メモリダンプを実行する。なお、
２１はサブポス）Ｚ内のハードコア部を示す。第８図は
該実施例の動作を説明するためのフローチャー１である
。The memory dump program executes a memory dump after being loaded into the main memory from the file 25. In addition,
21 indicates the hardcore section in Subpos) Z. FIG. 8 is a flowchart 1 for explaining the operation of this embodiment.

サブホスト２の運転は、コンソールサービスプロセッサ
２２又はホスト１からの起動により、プログラムファイ
ル２８に格納さねているプログラムがサブホス）２にロ
ードされて開始される。通常運転では、端末装置１１８
よりメツセージを久方し、それがサブホスト２に到着後
、サブホスト内で一定の処理を実行し、さらにホス）１
での処理を実施すべくメツセージはサブホスト２よりホ
ストｌへ転送される。処理完了後は、ホストｌより応答
メツセージをサブホスト２を経由して膚末装＋１８へ送
信する。Operation of the sub-host 2 is started by loading the program stored in the program file 28 into the sub-host 2 upon activation from the console service processor 22 or the host 1. During normal operation, the terminal device 118
After the message arrives at subhost 2, certain processing is executed within the subhost, and then the message is sent to host 1).
The message is transferred from subhost 2 to host 1 for processing. After the processing is completed, the host 1 sends a response message to the dermatology + 18 via the sub-host 2.

次に、通常運転中に何らかの障害が発生した場合の動作
について説明する。通常運転中にサブホス）２の障害で
あるマシンチェックが発生した場合、コンソールサービ
スプロセッサ２２はマシンチェックを検出後、ＭＤ７ア
イルｚ４より診断プログラムをロードする。ロードされ
た診断プログ’）ムはバー）’：ｒア部８１全８１して
サブホスト２の診断を実行後、診断結果をホス）ｌへ転
送する。Next, an explanation will be given of the operation when some kind of failure occurs during normal operation. When a machine check, which is a failure of the sub-host) 2, occurs during normal operation, the console service processor 22, after detecting the machine check, loads a diagnostic program from the MD7 aisle z4. The loaded diagnostic program ')' executes the diagnosis of the sub-host 2 by running the entire system 81 and transfers the diagnosis result to the host.

診断結果を受信し九ホス）１では、障害が発生したサブ
ホスト名と診断結果を保守要員に出力する。After receiving the diagnosis result, in step 1), the name of the sub-host where the failure occurred and the diagnosis result are output to the maintenance personnel.

連絡を受けた′保守要員は、該結果によりサブホスト２
の修復を実施する。The maintenance personnel who received the notification will
carry out repairs.

次に、通常運転中にプログラム障害であるプログラムル
ープ等が発生し九場合の動作について説明する。通常運
転中にプログラム障害であるプログラムループ等が発生
し九場合、コンソールサービスプロセッサ８２はプログ
ラムループ壽を検出後、メモリタ”ンプ用ファイル２５
よりメモリダンププログラムをロードする。ロードされ
たメモリダンププログラムにより、メモリダンプ用ファ
イル２５へメモリダンプを実行する。コンソールサービ
スプロセッサ２ｚはメモリダンプ完了を検知後、プログ
ラムファイル２Ｂより再度プログラムをロードし、サブ
ホスｌの自勘運転再開始を図る。ここで、サブホスト２
のシステム再開始が正常に実行された場合は、ホス）１
とサブホスト８間の通信回線１１を使用して、メモリダ
ンプ用ファイルｚ５よりメモリダンプ内容がホストｌへ
転送され、プログラム障害原因の解析が行われる。Next, the operation in the case where a program failure such as a program loop occurs during normal operation will be described. If a program failure, such as a program loop, occurs during normal operation, the console service processor 82 detects the program loop and then stores it in the memory stamp file 25.
Load the memory dump program. The loaded memory dump program executes memory dump to the memory dump file 25. After the console service processor 2z detects the completion of the memory dump, it reloads the program from the program file 2B and attempts to restart the subhost I's independent operation. Here, subhost 2
If the system restart is successful, the host) 1
Using the communication line 11 between the host 1 and the sub-host 8, the memory dump contents are transferred from the memory dump file z5 to the host 1, and the cause of the program failure is analyzed.

一方、サブホスト２のシステム再開始が不成功であった
場合は、ホストｌからのヘルスチェックにより障害が検
出され、障害修復処理が開始される。On the other hand, if the system restart of sub-host 2 is unsuccessful, a failure is detected by a health check from host 1, and failure repair processing is started.

すなわち、分散処理システムにおいては、ホストからサ
ブホストへ診断データを送信し、該診断データに対する
サブホストからの回答をホストでチェックすることによ
り、ホスト側でサブホストの障害を検知する。これがヘ
ルスチェックである。That is, in a distributed processing system, a failure in a subhost is detected on the host side by transmitting diagnostic data from the host to the subhost and checking the response from the subhost to the diagnostic data. This is a health check.

次に、通常運転中に周辺装置の障害である■０エラーが
発生した場合の動作について説明する。Next, an explanation will be given of the operation when a 0 error, which is a failure of a peripheral device, occurs during normal operation.

通常運転中にＩＯ障害が発生し九場合、サブホスト８内
のオペレーティングシステムはＩＯ障害検出後、該サブ
ホストが動作可能か不可能かを判定する。サブホストが
動作可能であれば、ホストｌへＩＯ障害が発生した旨の
メツセージを送信後、通常運転を続行する。該ＩＯ障害
によりサブホスト２の運転が不可能な場合、ホスト１か
らのヘルスチェックにより障害が検出され、障害修復処
理が開始される。If an IO failure occurs during normal operation, the operating system within the sub-host 8 determines whether the sub-host is operable or not after detecting the IO failure. If the sub-host is operational, it sends a message to host l to the effect that an IO failure has occurred, and then continues normal operation. If the sub-host 2 cannot be operated due to the IO failure, the failure is detected by a health check from the host 1, and failure repair processing is started.

発明の効果以上の説明から明らかな如く、本発明によれば、次の様
な効果が得られる。Effects of the Invention As is clear from the above explanation, according to the present invention, the following effects can be obtained.

（１）　　遠隔地にあるサブホストで障害が発生したと
しても、処理装置障害、プログラム障害、周辺装置障害
がホスト側で識別可能となり、出動する保守要員の初期
動作が的確なものとなる。(1) Even if a failure occurs in a sub-host located in a remote location, processing device failures, program failures, and peripheral device failures can be identified on the host side, and the initial actions of dispatched maintenance personnel will be accurate.

（２）処理装置障害の場合、持８する保守部品は的確で
あるため、保守効率、経済性が向上する。(2) In the event of a processing device failure, the maintenance parts available are accurate, so maintenance efficiency and economic efficiency are improved.

（３）処理装置障害の場合、修復時間が短縮される。(3) In the case of a processing device failure, the repair time is shortened.

（４）　　プログラム障吾の場合、自動的にメモリダン
プを取得する為、回復時間が短縮される。(4) In the case of program failure, a memory dump is automatically obtained, reducing recovery time.

（５）　　プログラム障害の場合、自動ＩＰＬを実行す
るため、タイミングに関係したプログラム障害叫ではシ
ステム停止することなく運転続行可能となり、システム
信頼性が向上する。(5) In the case of a program failure, automatic IPL is executed, so even if a timing-related program failure occurs, operation can be continued without stopping the system, improving system reliability.

（６）　　上記（５）の場合でもメモリダンプ内容をホ
ストへ送信し、プログラム障害原因解析が可能であり、
プログラム品質向上に有益となる。(6) Even in the case of (5) above, it is possible to send the memory dump contents to the host and analyze the cause of the program failure.
This is useful for improving program quality.

（７）醐辺装置ｉｌ障害の場合でもサブホストの動作が
可能な場合、ホストへ障害連絡がなされているので適切
な保守が可能となる。(7) If the sub-host is able to operate even in the case of a failure in the Sapphire device, appropriate maintenance is possible because the host is notified of the failure.

[Brief explanation of drawings]

第１図は本発明で対象とする分散処理システムのブロッ
ク図、第２′図は本発明の一実施例のブロック図、第８
図は第２図の動作を説明する九めの流れ図である。ｌ・・・′ホスト、２・・・サブホスト、８・・・端末
装置、１１・・・通信回線、１２・・・端末回線、２１
・・・ハードコアｍ、２２・・・コンソールサービスプ
ロセッサ、ｚ８・・・プログラムファイル、２４・・・
診断プログラム用ファイル、２５・・・メモリダンプ用
ファイル。第１　図３　　　　　３５　　　　　５　　　　　　　　５　　
　　　３第２図FIG. 1 is a block diagram of a distributed processing system targeted by the present invention, FIG. 2' is a block diagram of an embodiment of the present invention, and FIG.
This figure is the ninth flowchart explaining the operation of FIG. 2. l...' host, 2... subhost, 8... terminal device, 11... communication line, 12... terminal line, 21
...hardcore m, 22...console service processor, z8...program file, 24...
Diagnosis program file, 25...Memory dump file. 1st Figure 3 35 5 5
3Figure 2

Claims

[Claims]

(1) In a distributed processing system having a host processing device, a plurality of sub-host processing devices connected to the host processing device, and nine terminal devices connected to the sub-host processing device, the sub-host processing device includes a console service processor. A file memory is provided to store a diagnostic program and a memory dump program, and under the control of the console service processor, when a hardware failure occurs in the sub-host processing device, the diagnostic program is automatically executed and the results are processed by the host. In the event of a program failure, the memory dump is automatically acquired and an initial program load is executed to recover the system, and in the event of a terminal device failure, the terminal device failure is communicated to the host processing device. A failure handling method for distributed processing systems.