JP3334174B2

JP3334174B2 - Fault handling verification device

Info

Publication number: JP3334174B2
Application number: JP22766192A
Authority: JP
Inventors: 由美高橋; 真次宮原
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-08-27
Filing date: 1992-08-27
Publication date: 2002-10-15
Anticipated expiration: 2017-10-15
Also published as: JPH0675807A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、多重処理システムにお
ける障害処理機能の検証装置に関し、特にオペレーティ
ングシステムを含むシステム全体の障害処理機能を検証
する装置に関する。The present invention relates to relates to a verification device failure processing function in a multi-processing system, and more particularly to apparatus you verify <br/> failure processing function of the entire system including the operating system.

【０００２】近年，計算機システムが社会のあらゆる分
野で使用され，その処理能力及び信頼性が益々，要求さ
れるようになった。処理能力を増強するため，中央処理
装置，メモリ等を多重化した多重処理システムが普及し
ている。これら多重処理システムの一部に故障が発生し
たとき，故障した装置を切り離してシステム性能を低下
させてでもシステムが稼働し続けることができるよう
に，耐故障性（faulttolerant）を重視した多重処理シ
ステムが広く使用されている。計算機システムを構成す
る各装置は各種のエラー，故障等の障害を検出する機能
を備えており，オペレーティングシステム（ＯＳ）は，
障害の箇所又は装置を特定し，その装置をシステムから
切り離すなどの処理を行うため，障害情報をロギングす
る機能を有する。従って，このような多重処理システム
において，ＯＳを含めたシステム全体の，障害検出及び
障害情報収集機能を含む故障処理機能を的確に少ない人
手で短時間に検証することができる方式が要求されてい
る。In recent years, computer systems have been used in all fields of society, and their processing power and reliability have been increasingly required. In order to increase the processing capacity, a multiprocessing system in which a central processing unit, a memory, and the like are multiplexed has been widely used. When a failure occurs in a part of these multi-processing systems, a multi-processing system with an emphasis on fault tolerance (faulttolerant) is designed so that the system can continue to operate even if the failed device is separated and the system performance is reduced. Widely used. Each device constituting the computer system has a function of detecting various errors and failures such as failures. The operating system (OS)
It has a function of logging failure information in order to identify the location or device of the failure and to disconnect the device from the system. Accordingly, in such a multi-processing system, a method is required that can accurately verify a failure processing function including a failure detection and a failure information collecting function of the entire system including the OS in a short time with a small number of hands. .

【０００３】[0003]

【従来の技術】従来は，計算機システムの処理装置（Ｃ
ＰＵ）ごとに設けられた外部保守支援装置を用いてシス
テムの障害検出機能の検査を行った。外部保守支援装
置，例えば，サービスプロセッサ（ＳＶＰ）は，ＣＰＵ
とは別個の独立したプロセッサを内蔵し，システムを構
成する各装置内のレジスタ等のハードウェアを所定の状
態にセットし，またその状態を読み取る機能を備え，計
算機システム内のＣＰＵを含む各装置に故意に故障状態
を設定し，その結果，装置が呈する状態情報を，ＯＳの
動作とは無関係に，収集してダンプし，技術者は収集さ
れた情報を分析することによってシステム全体として故
障検出機能が正しく機能するか否かを検査していた。2. Description of the Related Art Conventionally, a processing unit (C
An external maintenance support device provided for each of the PUs (PU) was used to inspect the failure detection function of the system. An external maintenance support device, for example, a service processor (SVP)
Each of the devices including a CPU in the computer system has a function of setting a hardware such as a register in each device constituting the system to a predetermined state, and reading the state, and incorporating an independent processor separate from the computer. Intentionally set a failure state, and as a result, collect and dump state information presented by the device irrespective of the operation of the OS, and a technician analyzes the collected information to detect a failure in the entire system. Checked whether the function works correctly.

【０００４】[0004]

【発明が解決しようとする課題】上記のように従来方法
によると，外部保守支援装置を使用してシステム内の装
置に擬似障害を発生させ，障害情報を収集し，人手によ
って故障検出機能を確認したので，(1) 擬似故障を発生
させ，その情報を収集するための外部保守支援装置を必
要とし，システム価格が高価となる，(2) ＯＳを含めた
システム全体としての故障検出機能を検証することがで
きないため，実際的でない，(3) 収集した障害情報を人
手で検査するため，人手と時間がかかるという問題点が
あった。As described above, according to the conventional method, a pseudo failure is generated in a device in a system using an external maintenance support device, failure information is collected, and a failure detection function is manually confirmed. As a result, (1) a simulated fault is generated, an external maintenance support device is required to collect the information, and the system price becomes expensive. (2) The fault detection function of the entire system including the OS is verified. (3) There is a problem that it takes time and labor to collect collected trouble information manually.

【０００５】本発明は，多重処理システムにおける障害
処理機能を，実際に則した状態で，短時間に，かつ，低
コストで検査することができる障害処理検証装置を提供
することを目的とする。SUMMARY OF THE INVENTION It is an object of the present invention to provide a fault processing verification apparatus capable of checking a fault processing function in a multi-processing system in a short time and at low cost in a state where the fault processing function is actually performed.

【０００６】[0006]

【課題を解決するための手段】図１は，本発明の原理ブ
ロック図を示す。図において，21は，装置の障害情報を
ロギングするオペレーティングシステム，３は，システ
ムを構成する装置群を１台の特定の処理装置１と残りの
他の装置２とに切り離す切り離し手段，４は，特定の擬
似障害41を他の装置２に発生させる擬似障害発生手段，
５は，特定の擬似障害41が他の装置２に発生したとき，
オペレーティングシステム21によってロギングされるべ
き障害情報の期待値を作成する期待値作成手段，６は，
擬似障害発生手段４によって該特定の擬似障害41が発生
された際にオペレーティングシステム21によってロギン
グされた障害情報と，期待値作成手段５によって作成さ
れた期待値とを照合する照合手段である。FIG. 1 is a block diagram showing the principle of the present invention. In the figure, reference numeral 21 denotes an operating system for logging device failure information, 3 denotes a separating means for separating a group of devices constituting the system into one specific processing device 1 and the remaining other devices 2, and 4 denotes a separating device. Pseudo failure generating means for generating a specific pseudo failure 41 in another device 2,
5 indicates that when a specific pseudo failure 41 occurs in another device 2,
Expected value creating means for creating an expected value of fault information to be logged by the operating system 21;
This is matching means for checking the fault information logged by the operating system 21 when the specific pseudo fault 41 is generated by the pseudo fault generating means 4 with the expected value created by the expected value creating means 5.

【０００７】[0007]

【作用】本発明によれば，装置の障害を検出し，その障
害情報をオペレーティングシステム21によってロギング
する機能を有する複数の処理装置と，複数の処理装置に
よって共有される共有装置とから構成される多重処理シ
ステムの障害処理検証方式おいて，切り離し手段３はシ
ステムを構成する装置群を１台の特定の処理装置１と残
りの他の装置２とに切り離し，擬似障害発生手段４は特
定の擬似障害41を他の装置２に発生させ，期待値作成手
段５は特定の擬似障害41が他の装置２に発生したときオ
ペレーティングシステム21によってロギングされるべき
障害情報の期待値を作成し，照合手段６は擬似障害発生
手段４によって該特定の擬似障害41が発生された際にオ
ペレーティングシステム21によってロギングされた障害
情報と，期待値作成手段５によって作成された期待値と
を照合するので，照合手段６による照合結果に基づい
て，多重処理システムの障害処理を検証することが可能
となる。According to the present invention, there are provided a plurality of processing units having a function of detecting a failure of a device and logging the failure information by the operating system 21, and a shared device shared by the plurality of processing units. In the fault processing verification method of the multiprocessing system, the separating means 3 separates a group of devices constituting the system into one specific processing device 1 and the remaining other devices 2, and the simulated fault generating means 4 performs The fault 41 is generated in the other device 2, and the expected value creating means 5 creates the expected value of the fault information to be logged by the operating system 21 when the specific pseudo fault 41 occurs in the other device 2, Reference numeral 6 denotes fault information logged by the operating system 21 when the specific pseudo fault 41 is generated by the pseudo fault generating means 4 and an expected value creation step. Since collates the expected value created by 5, based on the collation result of the collating means 6, it is possible to verify the failure processing of the multiprocessing system.

【０００８】[0008]

【実施例】図２は本発明の実施例を示すシステム構成
図，図３はシステム構成定義テーブル，図４はアクセス
条件テーブル，図５は障害設定テーブル，図６はＯＳロ
ギングテーブルである。2 is a system configuration diagram showing an embodiment of the present invention, FIG. 3 is a system configuration definition table, FIG. 4 is an access condition table, FIG. 5 is a failure setting table, and FIG. 6 is an OS logging table.

【０００９】全図を通して，同一符号は同一又は同様な
構成要素を示す。図２において，ＣＰＵ１〜ｎは，それ
ぞれ，主メモリ，入出力制御装置等を備えたｎ台の処理
装置であって，共有メモリＳＳＭ０及びＳＳＭ０２を共
有することによって相互の通信を行う。ＣＰＵ１〜ｎ
は，また，アダプタ装置ＡＤで代表される複数の入出力
装置を共用する。ＣＰＵ１〜ｎと共有メモリＳＳＭ０１
及びＳＳＭ０２とは二重化されたバス＃０及び＃１を介
して接続され，バスハンドラ装置ＢＨ10及びＢＨ11は，
一方のバスが使用中または故障のときは他方のバスを使
用してＣＰＵ１〜ｎとＳＳＭ０１，ＳＳＭ０２とを接続
して通信させるように制御する。同様に，ＣＰＵ１〜ｎ
とアダプタ装置ＡＤとは二重化されたバス＃０及び＃１
を介して接続され，バスハンドラ装置ＢＨ20及びＢＨ21
は，一方のバスが使用中または故障のときは他方のバス
を使用してＣＰＵ１〜ｎとアダプタ装置ＡＤとを接続し
て通信させるように制御する。Throughout the drawings, the same reference numerals indicate the same or similar components. In FIG. 2, CPUs 1 to n are n processors each having a main memory, an input / output control device, and the like, and perform mutual communication by sharing a shared memory SSM0 and SSM02. CPU1 to n
Share a plurality of input / output devices represented by the adapter device AD. CPU1 to n and shared memory SSM01
And SSM02 are connected via duplexed buses # 0 and # 1, and the bus handlers BH10 and BH11 are
When one of the buses is in use or has a failure, the other bus is used to control the CPUs 1 to n to connect to SSM01 and SSM02 for communication. Similarly, CPUs 1 to n
And the adapter device AD are duplicated buses # 0 and # 1
And the bus handlers BH20 and BH21
When one of the buses is in use or has a failure, the other bus is used to connect the CPUs 1 to n and the adapter device AD for communication.

【００１０】ＣＰＵ１〜ｎは，それぞれ，システム構成
定義テーブル（図３参照）を備えており，ＣＰＵ１〜ｎ
に対応してＩ（実装），ＮＡ（アクセス禁止），ＮＩ
（未実装）の何れかを設定してシステム資源として使用
の可否を指定することにより，物理的には接続されてい
ても，論理的には接続を切り離すことができ，ＣＰＵ１
〜ｎごとに任意にシステム構成することができる。従っ
て，システム構成定義テーブルを図３に示すように設定
することによりＣＰＵ１を，他のＣＰＵ２〜ｎから論理
的に切り離すことができる。Each of the CPUs 1 to n has a system configuration definition table (see FIG. 3).
(Implementation), NA (Access prohibited), NI
(Not implemented) to specify whether it can be used as a system resource, so that it can be logically disconnected even if it is physically connected.
Ｎn can be configured arbitrarily. Accordingly, by setting the system configuration definition table as shown in FIG. 3, the CPU 1 can be logically separated from the other CPUs 2 to n.

【００１１】また，ＣＰＵ１〜ｎは，それぞれ，通常の
命令に加えて，予め組み込まれている診断機能を実行す
る診断命令を有する。例えば，他のＣＰＵ，共有メモリ
ＳＳＭ０１，ＳＳＭ０２，アダプタ装置ＡＤ等へ診断命
令を発行することによって，その機能または一部の機能
を停止させる。Each of the CPUs 1 to n has a diagnostic instruction for executing a built-in diagnostic function in addition to a normal instruction. For example, by issuing a diagnostic instruction to another CPU, shared memory SSM01, SSM02, adapter device AD, or the like, the function or a part of the function is stopped.

【００１２】このように構成したシステムにおいて，Ｃ
ＰＵ２〜ｎはＯＳ配下で通常の動作を実行させ，ＣＰＵ
１はテストプログラムを実行させる。テストプログラム
は，ＣＰＵ２〜ｎ，共有メモリＳＳＭ０１，ＳＳＭ０
２，アダプタ装置ＡＤ等の他の装置に対して順次，擬似
障害を発生させる診断命令を発行する。そして，診断命
令による擬似障害ごとに，ＯＳが実際にロギングする障
害情報と，予め求めておいた期待値とを比較することに
よって，ハードウェアが正しく障害を検出し，ＯＳが正
しく障害情報をロギングするか否かを検証する。In the system configured as described above, C
PU2 to PU2 execute normal operations under the OS, and
1 executes a test program. The test programs are CPUs 2 to n, shared memories SSM01 and SSM0.
(2) A diagnostic instruction for generating a pseudo failure is sequentially issued to other devices such as the adapter device AD. By comparing the fault information actually logged by the OS with the expected value obtained in advance for each pseudo fault by the diagnostic instruction, the hardware correctly detects the fault and the OS logs the fault information correctly. Verify whether or not to do.

【００１３】図７は，本発明の実施例のフローチャート
である。本発明の実施例の作用を，擬似障害を発生する
対象装置を共有メモリＳＳＭ０１とした場合について，
図７に基づき，図２を参照して説明する。 (1) システム中の１台のＣＰＵ，例えば，ＣＰＵ１を他
の装置（ＣＰＵ２〜ｎ）から切り離して，擬似的に障害
を発生させるためのテストプログラムを走行させる検証
システムとして立ち上げる。即ち，イニシャルプログラ
ムロード時に図３に示すように，ＣＰＵ１が保持するシ
ステム構成定義テーブルにおいて，被検証システムのＣ
ＰＵ２〜ＣＰＵｎ（ＯＳ配下で動作する）に対してＮＡ
（アクセス禁止）を設定することによって，ＣＰＵ１に
よるＣＰＵ２〜ＣＰＵｎへのアクセスを禁止する。従っ
て，ＣＰＵ１とＣＰＵ２〜ＣＰＵｎとは物理的には接続
されていても，論理的には切り離されて，ＯＳ下で動作
する被検証システムＣＰＵ２〜ＣＰＵｎから検証システ
ムＣＰＵ１を切り離すことができる。FIG. 7 is a flowchart of an embodiment of the present invention. The operation of the embodiment of the present invention will be described in the case where the target device in which the pseudo failure occurs is the shared memory SSM01.
This will be described with reference to FIG. 2 based on FIG. (1) One of the CPUs in the system, for example, the CPU 1 is separated from the other devices (CPUs 2 to n) and started up as a verification system for running a test program for causing a pseudo failure. That is, as shown in FIG. 3, when the initial program is loaded, in the system configuration definition table held by the CPU 1, the C
NA for PU2 to CPUn (operating under OS)
By setting (access prohibition), access by the CPU 1 to the CPUs 2 to n is prohibited. Therefore, even though the CPU 1 and the CPUs 2 to CPUn are physically connected, they are logically disconnected and the verification system CPU1 can be separated from the systems to be verified CPU2 to CPUn operating under the OS.

【００１４】次に，ＯＳ配下で動作するＣＰＵ２〜ＣＰ
Ｕｎを検証システムであるＣＰＵ１から切り離すため，
これらＣＰＵのシステム構成定義テーブルにおいて，図
３に示すように，検証システムであるＣＰＵ１に対して
ＮＩ（未実装）を設定することにより，ＣＰＵ２〜ＣＰ
ＵｎからＣＰＵ１へのアクセスは禁止する。従って，物
理的には接続されていても論理的には，被検証システム
のＣＰＵ２〜ＣＰＵｎから，検証システムのＣＰＵ１は
未実装に見える。 (2) 検証システムのＣＰＵ１においてテストプログラム
を起動する。 (3) 被検証システムのＣＰＵ２〜ＣＰＵｎにおいてＯＳ
を起動する。 (4) 擬似障害を確実に発生させるために，オペレータ
は，このテーブルにＯＳが起動された時点から時系列
に，ＯＳがアクセスする装置の名称，時間，アドレスを
含む情報を入力したアクセス条件テーブル（図４参照）
を作成しておく。 (5) テストプログラムは，前記(2) で作成したアクセス
条件テーブルに基づいて，テスト対象の共有メモリ装置
ＳＳＭ０１に対して診断命令を発行して，例えば，１０
００番地に２００ｍｓの間，動作停止を指示する擬似障
害（ＨＡＬＴ）を設定する。 (6) 前記(5) で擬似障害を設定した装置名を示すＳＳＭ
０１と，設定した障害内内容を示すＨＡＬＴと，設定し
た時刻を示す２３時５９分５９秒１００ミリ秒を障害設
定テーブル（図５参照）の形式でディスク装置ＤＫに格
納する。 (7) ＯＳがテスト対象の共有メモリ装置であるＳＳＭ０
１をアクセスして, ＨＡＬＴ（障害）状態が発生する。 (8) ＯＳは障害情報（２３時５９分５９秒１２０ミリ秒
にＳＳＭ０１がＨＡＬＴ状態となった）をＯＳロギング
テーブルの形式（図６参照）でディスク装置ＤＫに格納
する (9) 上記(6) でテストプログラムが設定し，ディスク装
置ＤＫに格納した障害内容（２３時５９分５９秒１００
〜３００ミリ秒の間にＳＳＭ０１がＨＡＬＴする）と，
上記(8) でＯＳがディスク装置ＤＫにロギングした障害
情報（２３時５９分５９秒１２０ミリ秒にＳＳＭ０１が
ＨＡＬＴ状態となった）とを照合する。照合結果が妥当
であれば，ハードウェアの障害検出機能とＯＳのロギン
グ機能の両方を検証することができる。Next, the CPUs 2 to CP operating under the OS
In order to separate Un from the verification system CPU1,
In the system configuration definition table of these CPUs, as shown in FIG. 3, by setting NI (not mounted) for the CPU 1 as the verification system,
Access from Un to the CPU 1 is prohibited. Therefore, even though they are physically connected, logically, the CPU 1 of the verification system does not appear to be mounted from the CPUs 2 to CPU n of the system to be verified. (2) The test program is started in the CPU 1 of the verification system. (3) OS in CPU2 to CPUn of the system to be verified
Start (4) In order to reliably generate the pseudo failure, the operator inputs the information including the name, time, and address of the device accessed by the OS in this table in chronological order from the time when the OS is started. (See Fig. 4)
Is created. (5) The test program issues a diagnostic command to the shared memory device SSM01 to be tested based on the access condition table created in (2),
At 200, a pseudo fault (HALT) for instructing to stop the operation is set for 200 ms. (6) SSM indicating the name of the device for which the pseudo failure was set in (5) above
01, HALT indicating the contents of the set fault, and 23: 59: 59: 100 ms indicating the set time are stored in the disk device DK in the form of a fault setting table (see FIG. 5). (7) OS is SSM0 which is a shared memory device to be tested
Accessing 1 causes a HALT (failure) state. (8) The OS stores the failure information (the SSM01 is in the HALT state at 23: 59: 59: 120 ms) in the disk device DK in the OS logging table format (see FIG. 6). ) Is set by the test program and stored in the disk drive DK (23: 59: 59: 100
SSM01 HALT during ~ 300 ms)
In step (8), the OS checks the failure information logged in the disk drive DK (the SSM01 is in the HALT state at 23: 59: 59: 120 ms). If the collation result is appropriate, both the hardware failure detection function and the OS logging function can be verified.

【００１５】[0015]

【発明の効果】以上説明したように，本発明によると，
多重処理システムを構成する装置群を１台の特定の処理
装置（検証システム）と残りの他の装置（被検証システ
ム）に切り離し，被検証システムに擬似障害が発生した
ときオペレーティングシステムによってロギングされる
べき障害情報の期待値を作成し，検証システムは擬似障
害を被検証システムに発生させ，実際にオペレーティン
グシステムによってロギングされた障害情報と，作成し
た期待値の障害情報とを照合することによって多重処理
システムの障害処理を検証するので，従来使用した外部
保守支援装置の代わりに１台のＣＰＵによって他の装置
の障害処理機能を検証するため，安価にシステムの故障
処理の検証を行うことができ，また，ハードウェアとＯ
Ｓとが共同して処理した障害情報を検査するため，実際
の運用に近い状態でシステムの障害処理機能を検証する
できるという効果がある。As described above, according to the present invention,
The device group constituting the multi-processing system is separated into one specific processing device (verification system) and the other device (verified system), and is logged by the operating system when a pseudo failure occurs in the system to be verified. The expected value of the fault information to be created is created, and the verification system generates a simulated fault in the system to be verified, and multiplexes the fault information actually logged by the operating system with the created fault information of the expected value. Since the failure handling of the system is verified, the failure handling function of another device is verified by one CPU instead of the external maintenance support device used conventionally, so that the failure handling of the system can be verified at low cost. In addition, hardware and O
Since the failure information processed jointly with S is inspected, there is an effect that the failure handling function of the system can be verified in a state close to actual operation.

[Brief description of the drawings]

【図１】本発明の原理ブロック図FIG. 1 is a block diagram showing the principle of the present invention.

【図２】本発明の実施例を示すシステム構成図FIG. 2 is a system configuration diagram showing an embodiment of the present invention.

【図３】システム構成定義テーブルFIG. 3 System configuration definition table

【図４】アクセス条件テーブルFIG. 4 Access condition table

【図５】障害設定テーブルFIG. 5 is a failure setting table.

【図６】ＯＳロギングテーブルFIG. 6: OS logging table

【図７】本発明の実施例のフローチャートFIG. 7 is a flowchart of an embodiment of the present invention.

[Explanation of symbols]

ＣＰＵ１〜ｎ処理装置ＳＳＭ０１，０２共有メモリ装置ＢＨ10,11,20,21 バスハンドラ装置ＡＤアダプタ装置ＤＫディスク装置 CPU 1 to n processing device SSM01,02 Shared memory device BH10,11,20,21 Bus handler device AD adapter device DK disk device

フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 11/22 - 11/26 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (58) Fields surveyed (Int.Cl. ⁷ , DB name) G06F 11/22-11/26 JICST file (JOIS)

Claims

(57) [Claims]

1. A multiprocessing system comprising a plurality of processing units having a function of detecting a device failure and logging the failure information by an operating system, and a shared device shared by the plurality of processing devices. in processing the verification apparatus, the system is provided with a disconnecting means for disconnecting the and one particular processing unit and the rest of the other devices in the multiple processor, the particular processing device, pseudo disorders other and the pseudo fault generating means for generating the device, when the該疑 similar fault occurs in another device, an expected value generation means for generating an expected value of error information to be logged by the operating system, by該疑similar failure means Fault information logged by the operating system when the pseudo fault has occurred, and fault information created by the expected value creating means. Verifying means for verifying fault processing of the multiprocessing system based on a result of the verification by the verifying means.
Equipment .

2. The operating system and the expected value creating means log fault information including the time of occurrence of the pseudo fault and create an expected value of the fault information, respectively.
Fault processing verification apparatus according to claim 1, characterized in that that.