JPH07121392A

JPH07121392A - Automatic fault information sampling system for restarting computer system

Info

Publication number: JPH07121392A
Application number: JP5270255A
Authority: JP
Inventors: Shinji Kitagawa; 伸司北河; Sadao Oki; 定雄大木; Mika Tamatoshi; 美香玉利
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1993-10-28
Filing date: 1993-10-28
Publication date: 1995-05-12

Abstract

PURPOSE:To easily sample hardware fault information at the time of computer system stop so as to facilitate the maintenance/operation of a computer system as a result. CONSTITUTION:When system stop is generated at a computer system 1, an information sampling system start processing part 6 starts the computer system 1 with system configuration 10 dedicated to fault information sampling. Afterwards, a regular system start processing part 8 starts again the computer system 1 with on-line system configuration 11 after a fault information sample processing part 7 sample processing part 7 samples the information of a LAN controller 2. This operation is automatically performed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、分散コピュータシステ
ム等の無人システムにおけるシステム再立ち上げ時の障
害情報採取方式に係り、コンピュータシステムが異常に
よって停止したときに、自動的に障害情報を採取するこ
とによって、コンピュータシステムの停止時間を短縮
し、その被害を最小限にとどめることを目的とするコン
ピュ−タシステムの、コンピュータシステム再立ち上げ
時の障害情報自動採取方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a failure information collecting method when a system is restarted in an unmanned system such as a distributed computer system, and automatically collects failure information when a computer system stops due to an abnormality. Therefore, the present invention relates to a method for automatically collecting failure information when a computer system is restarted, for the purpose of reducing the down time of the computer system and minimizing the damage.

【０００２】[0002]

【従来の技術】分散コンピュータシステムのような無人
のコンピュータシステムでは、システムに異常が発生し
て停止したときに、自動的に主メモリの内容を採取し
て、再ＩＰＬを行ってシステムの再立ち上げを行わせて
いる。この再ＩＰＬを自動的に行うシステムの立ち上げ
方式として、従来、例えば特開平４−１３７０２３号公
報に記載の「コンピュータシステムの再立ち上げ方式」
がある。2. Description of the Related Art In an unmanned computer system such as a distributed computer system, when an abnormality occurs in the system and the system is stopped, the contents of the main memory are automatically collected and the IPL is performed again to restart the system. It is raising. As a system start-up method for automatically performing this re-IPL, a "computer system restart-up method" described in, for example, Japanese Patent Application Laid-Open No. 4-137023 is conventionally used.
There is.

【０００３】[0003]

【発明が解決しようとする課題】上記従来技術では、コ
ンピュータシステムが障害により停止して、自動的に再
ＩＰＬする際主メモリの内容だけを、自動的に磁気ディ
スク装置に設けた障害情報採取用のファイルに採取して
いるため、コンピュータシステム停止の原因調査、解析
を行う情報としては不十分な場合が有るという問題があ
った。すなわち、コンピュータシステムを停止させた要
因が磁気ディスク装置等の周辺装置や通信制御装置等の
ハードウエアに有る場合、磁気ディスク装置であれば自
己診断結果情報、通信制御装置であればその装置の制御
メモリ内容等のハードウエア固有の障害情報が採取され
てなく、原因調査、解析に時間を要し対策が遅れるとい
う問題があった。In the above prior art, when the computer system stops due to a failure and automatically re-IPLs, only the contents of the main memory are automatically provided in the magnetic disk device for collecting failure information. Since it is collected in the file, there is a problem that it may not be sufficient as information for investigating and analyzing the cause of the computer system outage. In other words, if the cause of stopping the computer system is a peripheral device such as a magnetic disk device or hardware such as a communication control device, self-diagnosis result information if the device is a magnetic disk device, or control of that device if it is a communication control device. There was a problem that hardware-specific failure information such as memory contents was not collected, time was required for cause investigation and analysis, and countermeasures were delayed.

【０００４】これは、通常システム停止時に主メモリ等
の障害情報を採取する機能は、システム停止時の状態保
存と、システムが停止するような異常発生時にＯＳ（オ
ペーレーティングシステム）の下で実行することは危険
であるという理由で、ＯＳと独立したスタンドアロン型
で実現されるが、周辺装置等のハードウエアの障害情報
を採取する処理は複雑であり、スタンドアロン型での実
現は困難であるということに起因している。This is because the function of collecting fault information such as main memory when the system is stopped is executed under the OS (Operating System) when the system is stopped and the system is stopped. Since it is dangerous to do so, it is realized by a stand-alone type that is independent of the OS, but the process of collecting fault information of hardware such as peripheral devices is complicated, and it is difficult to realize by the stand-alone type. It is due to that.

【０００５】本発明の目的は、コンピュータシステムが
障害により停止して、自動的に再ＩＰＬする際主メモリ
の内容だけでなく、障害の要因となった周辺装置等の障
害情報も採取することにより、システム停止の原因調
査、解析を迅速、容易にすることにある。また、周辺装
置等の障害情報を容易に採取できるようにすることにあ
る。An object of the present invention is to collect not only the contents of the main memory but also the failure information of the peripheral device or the like that caused the failure when the computer system stops due to a failure and is automatically re-IPLed. The purpose is to speed up and facilitate the cause investigation and analysis of the system stop. Another object is to make it possible to easily collect fault information of peripheral devices and the like.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するため
に、１つのコンピュータシステムの中に、オンライン等
の本来の運用を行うシステム構成の他に、障害情報を採
取するための障害情報採取専用システム構成を設けた。
障害情報採取専用システム構成は、本来の運用を行うシ
ステムと異なり、周辺装置を使用せず障害時の状態を保
存する。In order to achieve the above-mentioned object, in addition to a system configuration for performing the original operation such as online in one computer system, dedicated to collecting fault information for collecting fault information The system configuration was set up.
Unlike the system that performs the original operation, the system configuration for collecting failure information saves the state at the time of failure without using peripheral devices.

【０００７】また、障害によるシステム停止後の再ＩＰ
Ｌでは、障害情報採取専用システム構成で立ち上げ、こ
の構成で障害情報採取処理を実行するようにした。Further, re-IP after the system is stopped due to a failure
In L, the system for booting fault information is started up, and the fault information collecting process is executed with this configuration.

【０００８】また、障害情報採取後は、本来のシステム
構成を選択して、再度システムを自動的に立ち上げ直す
ようにした。Further, after collecting the fault information, the original system configuration is selected and the system is automatically restarted.

【０００９】さらに、システム停止時に停止の原因とな
った周辺装置を検出し、前記障害情報採取処理へ渡すよ
うにした。Further, when the system is stopped, the peripheral device causing the stop is detected and passed to the fault information collecting process.

【００１０】[0010]

【作用】コンピュータシステムに、障害情報を採取する
ための障害情報採取専用システムと、システム停止時に
この障害情報採取専用システムを自動的に立ち上げる情
報採取システム立ち上げ処理と、障害情報採取後、本来
の運用を行うシステムを自動的に立ち上げる正システム
立ち上げ処理を設ける。更に、システム停止時に原因と
なった周辺装置等のハードウエアを検出し、障害情報採
取専用システムに連絡する障害原因検出処理を設ける。
システム停止が発生すると、主メモリの情報をダンプフ
ァイルヘ出力後、障害原因検出処理が障害の原因となっ
たハードウエアを検出し、これを識別情報として保存す
る。その後、情報採取システム立ち上げ処理が、障害情
報採取専用システムを立ち上げ情報採取用のジョブを起
動する。情報採取用のジョブは、障害原因検出処理が保
存した、システム停止の原因となったハードウエア識別
情報に従って、必要最小限のハードウエアに関する情報
を採取する。情報採取が完了したのち、正システム立ち
上げ処理は本来の運用を行うシステムを自動的に立ち上
げる。In the computer system, a failure information collection dedicated system for collecting failure information, an information collection system start-up process that automatically starts up this failure information collection dedicated system when the system is stopped, The primary system startup processing that automatically starts up the system that operates Further, there is provided failure cause detection processing for detecting hardware such as peripheral devices that have caused the system stop and contacting the failure information collection dedicated system.
When the system is stopped, the information on the main memory is output to the dump file, and the failure cause detection process detects the hardware causing the failure and saves it as identification information. After that, the information collection system startup processing starts up the failure information collection dedicated system and starts a job for information collection. The information collecting job collects the minimum necessary information about the hardware according to the hardware identification information that is the cause of the system stop and saved by the failure cause detecting process. After the information collection is completed, the main system startup processing automatically starts the system that performs the original operation.

【００１１】以上により、システム停止時に、原因調
査、解析のために必要な情報の採取が自動的に行え、シ
ステム障害時の運用が容易になる。また、障害情報の採
取をＯＳの下で行なうことにより、障害情報採取機能の
実現も容易になる。As described above, when the system is stopped, the information necessary for the cause investigation and the analysis can be automatically collected, and the operation at the time of the system failure becomes easy. Further, by collecting the failure information under the OS, the failure information collecting function can be easily realized.

【００１２】[0012]

【実施例】以下に、本発明の実施例を図面によって説明
する。Embodiments of the present invention will be described below with reference to the drawings.

【００１３】図１は、本発明を実施するためのコンピュ
ータシステムの基本的な構成を示す。図１において、コ
ンピュータシステム１には、オンライン処理等で使用す
る、ＬＡＮ（ローカルエリアネットワーク）制御装置２
やＣＣＰ（通信制御装置）３がある。更に、本発明を実
施するための、主メモリ採取処理部４、障害原因検出処
理部５、情報採取システム立上げ処理部６、障害情報採
取処理部７、正システム立ち上げ処理部８、障害ハード
ウェア識別情報ファイル９、障害情報採取専用システム
構成10、およびオンラインシステム構成11を設けてい
る。障害ハードウェア識別情報ファイル９には、コンピ
ュータシステム停止時の原因となったハードウェアを識
別する情報を記録する。FIG. 1 shows the basic configuration of a computer system for implementing the present invention. In FIG. 1, a computer system 1 includes a LAN (local area network) control device 2 used for online processing and the like.
And CCP (communication control device) 3. Further, for implementing the present invention, the main memory collection processing unit 4, the failure cause detection processing unit 5, the information collection system startup processing unit 6, the failure information collection processing unit 7, the main system startup processing unit 8, and the failure hardware. A wear identification information file 9, a failure information collection dedicated system configuration 10, and an online system configuration 11 are provided. In the faulty hardware identification information file 9, information for identifying the hardware that has caused the computer system to stop is recorded.

【００１４】図１のコンピュータシステム１においてＬ
ＡＮ制御装置２の障害でシステムが停止し、必要な情報
を採取した後、オンライン処理を再開する例を、図２の
処理フローで説明する。In the computer system 1 of FIG. 1, L
An example of restarting the online processing after the system is stopped due to the failure of the AN control device 2 and necessary information is collected will be described with reference to the processing flow of FIG.

【００１５】図２において、オンライン開始のＩＰＬ
（ステップ20）でオンラインシステム構成11によりコン
ピュータシステム１は立ち上がりオンライン稼動してい
る（ステップ21）。このとき、ＬＡＮ制御装置２障害で
システム停止22が発生すると、主メモリ採取処理部４が
主メモリの情報を採取（ステップ23）した後、障害原因
検出処理部５は、原因となったハ−ドウェアがＬＡＮ制
御装置２であることを検出し、これを障害ハードウェア
識別情報ファイル９へ記録する（ステップ24）。障害ハ
ードウェア識別情報ファイル９は30の状態となる。その
後、情報採取システム立ち上げ処理部６により再ＩＰＬ
が起動され（ステップ25）、コンピュータシステム１
は、障害情報採取専用システム構成10でＩＰＬされる
（ステップ26）。次に、障害情報採取処理部７は、障害
ハードウェア識別情報ファイル９を参照して、障害情報
を採取すべきハードウェアを選択し（ステップ27）、Ｌ
ＡＮ制御装置２が持つ障害情報を採取する（ステップ2
8）。次に、正システム立上げ処理部６が再ＩＰＬを起
動し（ステップ29）、コンピュータシステム１は、再び
オンラインシステム構成11でＩＰＬされ（ステップ2
0）、オンライン稼動状態となる（ステップ21）。以上
のように、システム停止時に、主メモリだけでなく、周
辺装置等のハードウェアに関する情報も自動的に採取
し、その後、オンラインも自動的に立ち上げることがで
きる。In FIG. 2, the IPL for online initiation
In (step 20), the computer system 1 is started up by the online system configuration 11 and is online (step 21). At this time, if the system stop 22 occurs due to a failure of the LAN control device 2, the main memory collection processing section 4 collects information in the main memory (step 23), and then the failure cause detection processing section 5 causes the cause of the error to occur. It is detected that the hardware is the LAN control device 2, and this is recorded in the faulty hardware identification information file 9 (step 24). The faulty hardware identification information file 9 is in the 30 state. After that, the information collection system start-up processing unit 6 re-IPLs.
Is started (step 25), and the computer system 1
Is IPLed with the fault information collection dedicated system configuration 10 (step 26). Next, the failure information collection processing unit 7 refers to the failure hardware identification information file 9 and selects the hardware for which the failure information should be collected (step 27).
The failure information of the AN control device 2 is collected (step 2
8). Next, the primary system start-up processing unit 6 starts re-IPL (step 29), and the computer system 1 is IPLed again with the online system configuration 11 (step 2).
0), the online operation state is set (step 21). As described above, when the system is stopped, not only the main memory but also information about hardware such as peripheral devices can be automatically collected, and then online can be automatically started.

【００１６】また、上記実施例において、システム停止
の原因がハードウェアと無関係ならば、障害情報採取専
用システムを立ち上げず、直接オンラインシステムを立
ち上げてしまい、オンライン停止時間を短くすることも
できる。Further, in the above embodiment, if the cause of the system stop is unrelated to the hardware, the online system is directly started without starting the failure information collection dedicated system, and the online stop time can be shortened. .

【００１７】[0017]

【発明の効果】本発明によれば、コンピュータシステム
の構成を自動的に切換えることによって、通常ＯＳの下
でしか採取できない周辺装置等のハードウェアの障害情
報をコンピュータシステムの異常による停止時にも自動
的に採取できるようになり、コンピュータシステム停止
の原因調査・解析が容易になり、その結果コンピュータ
システムの保守・運用が容易になるという効果がある。According to the present invention, by automatically switching the configuration of the computer system, the failure information of the hardware such as the peripheral device which can be collected only under the normal OS is automatically displayed even when the computer system is stopped due to an abnormality. Since it is possible to collect the information, it becomes easy to investigate and analyze the cause of the computer system stoppage, and as a result, the maintenance and operation of the computer system become easier.

[Brief description of drawings]

【図１】本発明の実施例を実現するコンピュータシステ
ムの基本的な構成。FIG. 1 is a basic configuration of a computer system that implements an embodiment of the present invention.

【図２】本発明の、実施例として、オンイランシステム
稼動中にハードウェアの障害でシステム停止した場合
に、必要なハードウェアの情報を採取して、再びオンラ
インシステムを稼動させるときの処理例を示すフローチ
ャート。FIG. 2 is an example of processing according to an embodiment of the present invention, in a case where the system is stopped due to a hardware failure while the on-Iran system is operating, necessary hardware information is collected and the online system is operated again. The flowchart which shows.

[Explanation of symbols]

１…コンピュータシステム、２…ＬＡＮ（ローカルエリアネットワーク）制御装置、３…ＣＣＰ（通信制御装置）、４…主メモリ採取処理部、５…障害原因検出処理部、６…情報採取システム立ち上げ処理部、７…障害情報採取処理部、８…正システム立ち上げ処理部、９…障害ハードウェア識別情報ファイル、１０…障害情報採取専用システム構成、１１…オンラインシステム構成。 DESCRIPTION OF SYMBOLS 1 ... Computer system, 2 ... LAN (Local Area Network) control device, 3 ... CCP (communication control device), 4 ... Main memory collection processing unit, 5 ... Failure cause detection processing unit, 6 ... Information collection system startup processing unit , 7 ... Failure information collection processing section, 8 ... Primary system startup processing section, 9 ... Failure hardware identification information file, 10 ... Failure information collection dedicated system configuration, 11 ... Online system configuration.

Claims

[Claims]

1. When an abnormality occurs in a computer system that is operating and the computer system is stopped, IPL is automatically re-started to restart the computer system. When restarting the computer system, it causes a stop. A means for detecting a part (central processing unit or a peripheral device such as a disk or a communication control device) and a dedicated information collecting system for automatically collecting failure information of only the part causing the stop are provided. A method for automatically collecting fault information when restarting a computer system, which is characterized in that