JPH0675807A

JPH0675807A - Fault processing verifying system

Info

Publication number: JPH0675807A
Application number: JP4227661A
Authority: JP
Inventors: Yumi Takahashi; 由美高橋; Shinji Miyahara; 真次宮原
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-08-27
Filing date: 1992-08-27
Publication date: 1994-03-18
Anticipated expiration: 2017-10-15
Also published as: JP3334174B2

Abstract

PURPOSE:To execute an inspection in a short time, and also, at a low cost by collating actual fault information subjected to logging and fault information of a generated expected value and verifying a fault processing. CONSTITUTION:A detaching means 3 detaches a device group for constituting the system into one set of specific processing device 1 and the remaining other device 2. In this case, by an artificial fault generating means 4, a specific artificial fault 41 is generated by the other device 2, and when the specific artificial fault 41 is generated in the other device 2, an expected value generating means 5 generates an expected value of fault information to be subjected to logging by an operating system 21. Accordingly, when the specific artificial fault 41 is generated by the artificial fault generating means 4, a collating means 6 collates the fault information subjected to logging by the operating system 21 and the expected value generated by the expected value generating means 5, therefore, based on a result of collation by the collating means 6, a fault processing can be verified.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は，多重処理システムにお
ける障害処理機能の検証方式に関し，特にオペレーティ
ングシステムを含むシステム全体の障害処理機能を検証
する試験方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a failure processing function verification method in a multi-processing system, and more particularly to a test method for verifying the failure processing function of the entire system including an operating system.

【０００２】近年，計算機システムが社会のあらゆる分
野で使用され，その処理能力及び信頼性が益々，要求さ
れるようになった。処理能力を増強するため，中央処理
装置，メモリ等を多重化した多重処理システムが普及し
ている。これら多重処理システムの一部に故障が発生し
たとき，故障した装置を切り離してシステム性能を低下
させてでもシステムが稼働し続けることができるよう
に，耐故障性（faulttolerant）を重視した多重処理シ
ステムが広く使用されている。計算機システムを構成す
る各装置は各種のエラー，故障等の障害を検出する機能
を備えており，オペレーティングシステム（ＯＳ）は，
障害の箇所又は装置を特定し，その装置をシステムから
切り離すなどの処理を行うため，障害情報をロギングす
る機能を有する。従って，このような多重処理システム
において，ＯＳを含めたシステム全体の，障害検出及び
障害情報収集機能を含む故障処理機能を的確に少ない人
手で短時間に検証することができる方式が要求されてい
る。In recent years, computer systems have been used in all fields of society, and their processing ability and reliability have been increasingly required. A multi-processing system in which a central processing unit, a memory and the like are multiplexed has been widely used in order to increase the processing capacity. When a failure occurs in a part of these multi-processing systems, a multi-processing system that emphasizes fault tolerance so that the system can continue operating even if the failed device is disconnected and the system performance is reduced. Widely used. Each device that constitutes the computer system has a function of detecting various errors, failures such as failures, and the operating system (OS) is
It has a function to log failure information in order to identify the location of the failure or the device and perform processing such as disconnecting the device from the system. Therefore, in such a multi-processing system, there is a demand for a method capable of accurately verifying a failure processing function including a failure detection and failure information collection function of the entire system including an OS in a short time with a small number of humans. .

【０００３】[0003]

【従来の技術】従来は，計算機システムの処理装置（Ｃ
ＰＵ）ごとに設けられた外部保守支援装置を用いてシス
テムの障害検出機能の検査を行った。外部保守支援装
置，例えば，サービスプロセッサ（ＳＶＰ）は，ＣＰＵ
とは別個の独立したプロセッサを内蔵し，システムを構
成する各装置内のレジスタ等のハードウェアを所定の状
態にセットし，またその状態を読み取る機能を備え，計
算機システム内のＣＰＵを含む各装置に故意に故障状態
を設定し，その結果，装置が呈する状態情報を，ＯＳの
動作とは無関係に，収集してダンプし，技術者は収集さ
れた情報を分析することによってシステム全体として故
障検出機能が正しく機能するか否かを検査していた。2. Description of the Related Art Conventionally, a processing device (C
An external maintenance support device provided for each PU) was used to test the system fault detection function. An external maintenance support device, for example, a service processor (SVP) is a CPU
Each device including a CPU in a computer system that has a function to set the hardware such as registers in each device that configures the system to a predetermined state and to read the state Therefore, a failure state is intentionally set, and as a result, the state information presented by the device is collected and dumped regardless of the operation of the OS, and the technician analyzes the collected information to detect a failure in the entire system. I was inspecting whether the function worked properly.

【０００４】[0004]

【発明が解決しようとする課題】上記のように従来方法
によると，外部保守支援装置を使用してシステム内の装
置に擬似障害を発生させ，障害情報を収集し，人手によ
って故障検出機能を確認したので，(1) 擬似故障を発生
させ，その情報を収集するための外部保守支援装置を必
要とし，システム価格が高価となる，(2) ＯＳを含めた
システム全体としての故障検出機能を検証することがで
きないため，実際的でない，(3) 収集した障害情報を人
手で検査するため，人手と時間がかかるという問題点が
あった。According to the conventional method as described above, an external maintenance support device is used to generate a pseudo fault in a device in the system, fault information is collected, and a fault detection function is manually confirmed. As a result, (1) it requires an external maintenance support device to generate pseudo-faults and collects the information, and the system price becomes expensive. (2) The fault detection function of the entire system including the OS is verified. It is not practical because it cannot be done. (3) Since the collected fault information is manually inspected, there is a problem that it takes time and labor.

【０００５】本発明は，多重処理システムにおける障害
処理機能を，実際に即した状態で，短時間に，かつ，低
コストで検査することができる障害処理検証方式を提供
することを目的とする。It is an object of the present invention to provide a failure processing verification method capable of inspecting a failure processing function in a multi-processing system in a practically suitable state in a short time and at low cost.

【０００６】[0006]

【課題を解決するための手段】図１は，本発明の原理ブ
ロック図を示す。図において，21は，装置の障害情報を
ロギングするオペレーティングシステム，３は，システ
ムを構成する装置群を１台の特定の処理装置１と残りの
他の装置２とに切り離す切り離し手段，４は，特定の擬
似障害41を他の装置２に発生させる擬似障害発生手段，
５は，特定の擬似障害41が他の装置２に発生したとき，
オペレーティングシステム21によってロギングされるべ
き障害情報の期待値を作成する期待値作成手段，６は，
擬似障害発生手段４によって該特定の擬似障害41が発生
された際にオペレーティングシステム21によってロギン
グされた障害情報と，期待値作成手段５によって作成さ
れた期待値とを照合する照合手段である。FIG. 1 shows a block diagram of the principle of the present invention. In the figure, 21 is an operating system for logging device failure information, 3 is a disconnecting means for disconnecting a device group constituting the system into one specific processing device 1 and the rest of the other devices 2, and 4 is Pseudo fault generating means for generating a specific pseudo fault 41 in another device 2,
5 is, when a specific pseudo fault 41 occurs in another device 2,
Expected value creating means for creating an expected value of failure information to be logged by the operating system 21,
It is a collation unit for collating the fault information logged by the operating system 21 when the specific pseudo fault 41 is generated by the pseudo fault generation unit 4 with the expected value created by the expected value creation unit 5.

【０００７】[0007]

【作用】本発明によれば，装置の障害を検出し，その障
害情報をオペレーティングシステム21によってロギング
する機能を有する複数の処理装置と，複数の処理装置に
よって共有される共有装置とから構成される多重処理シ
ステムの障害処理検証方式おいて，切り離し手段３はシ
ステムを構成する装置群を１台の特定の処理装置１と残
りの他の装置２とに切り離し，擬似障害発生手段４は特
定の擬似障害41を他の装置２に発生させ，期待値作成手
段５は特定の擬似障害41が他の装置２に発生したときオ
ペレーティングシステム21によってロギングされるべき
障害情報の期待値を作成し，照合手段６は擬似障害発生
手段４によって該特定の擬似障害41が発生された際にオ
ペレーティングシステム21によってロギングされた障害
情報と，期待値作成手段５によって作成された期待値と
を照合するので，照合手段６による照合結果に基づい
て，多重処理システムの障害処理を検証することが可能
となる。According to the present invention, it is composed of a plurality of processing devices having a function of detecting a device failure and logging the failure information by the operating system 21, and a shared device shared by the plurality of processing devices. In the failure processing verification method for a multi-processing system, the disconnecting means 3 disconnects a group of devices constituting the system into one specific processing device 1 and the remaining other devices 2, and the pseudo-fault generating means 4 uses a specific dummy. The fault 41 is generated in the other device 2, and the expected value creating means 5 creates the expected value of the fault information to be logged by the operating system 21 when the specific pseudo fault 41 occurs in the other device 2, and the collating means. Reference numeral 6 denotes failure information logged by the operating system 21 when the specific failure 41 is generated by the failure generating means 4, and an expected value creation procedure. Since collates the expected value created by 5, based on the collation result of the collating means 6, it is possible to verify the failure processing of the multiprocessing system.

【０００８】[0008]

【実施例】図２は本発明の実施例を示すシステム構成
図，図３はシステム構成定義テーブル，図４はアクセス
条件テーブル，図５は障害設定テーブル，図６はＯＳロ
ギングテーブルである。2 is a system configuration diagram showing an embodiment of the present invention, FIG. 3 is a system configuration definition table, FIG. 4 is an access condition table, FIG. 5 is a failure setting table, and FIG. 6 is an OS logging table.

【０００９】全図を通して，同一符号は同一又は同様な
構成要素を示す。図２において，ＣＰＵ１〜ｎは，それ
ぞれ，主メモリ，入出力制御装置等を備えたｎ台の処理
装置であって，共有メモリＳＳＭ０及びＳＳＭ０２を共
有することによって相互の通信を行う。ＣＰＵ１〜ｎ
は，また，アダプタ装置ＡＤで代表される複数の入出力
装置を共用する。ＣＰＵ１〜ｎと共有メモリＳＳＭ０１
及びＳＳＭ０２とは二重化されたバス＃０及び＃１を介
して接続され，バスハンドラ装置ＢＨ10及びＢＨ11は，
一方のバスが使用中または故障のときは他方のバスを使
用してＣＰＵ１〜ｎとＳＳＭ０１，ＳＳＭ０２とを接続
して通信させるように制御する。同様に，ＣＰＵ１〜ｎ
とアダプタ装置ＡＤとは二重化されたバス＃０及び＃１
を介して接続され，バスハンドラ装置ＢＨ20及びＢＨ21
は，一方のバスが使用中または故障のときは他方のバス
を使用してＣＰＵ１〜ｎとアダプタ装置ＡＤとを接続し
て通信させるように制御する。Throughout the drawings, the same reference numerals indicate the same or similar components. In FIG. 2, CPUs 1 to n are n processing devices each including a main memory, an input / output control device, and the like, and communicate with each other by sharing the shared memories SSM0 and SSM02. CPU1 to n
Also shares a plurality of input / output devices represented by the adapter device AD. CPU1 to n and shared memory SSM01
, And SSM02 via redundant buses # 0 and # 1, and the bus handler devices BH10 and BH11 are
When one of the buses is in use or is out of order, the other bus is used to control the CPUs 1 to n and the SSM01 and SSM02 to be connected for communication. Similarly, CPUs 1 to n
And the adapter device AD are duplicated buses # 0 and # 1
Bus handler devices BH20 and BH21
When one bus is in use or is out of order, the other bus is used to connect the CPUs 1 to n and the adapter device AD for communication.

【００１０】ＣＰＵ１〜ｎは，それぞれ，システム構成
定義テーブル（図３参照）を備えており，ＣＰＵ１〜ｎ
に対応してＩ（実装），ＮＡ（アクセス禁止），ＮＩ
（未実装）の何れかを設定してシステム資源として使用
の可否を指定することにより，物理的には接続されてい
ても，論理的には接続を切り離すことができ，ＣＰＵ１
〜ｎごとに任意にシステム構成することができる。従っ
て，システム構成定義テーブルを図３に示すように設定
することによりＣＰＵ１を，他のＣＰＵ２〜ｎから論理
的に切り離すことができる。Each of the CPUs 1 to n has a system configuration definition table (see FIG. 3).
Corresponding to I (implementation), NA (access prohibition), NI
By setting any of (unimplemented) and designating the availability as a system resource, the connection can be logically disconnected even if it is physically connected.
It is possible to arbitrarily configure the system for each n. Therefore, the CPU 1 can be logically separated from the other CPUs 2 to n by setting the system configuration definition table as shown in FIG.

【００１１】また，ＣＰＵ１〜ｎは，それぞれ，通常の
命令に加えて，予め組み込まれている診断機能を実行す
る診断命令を有する。例えば，他のＣＰＵ，共有メモリ
ＳＳＭ０１，ＳＳＭ０２，アダプタ装置ＡＤ等へ診断命
令を発行することによって，その機能または一部の機能
を停止させる。Further, each of the CPUs 1 to n has a diagnostic command for executing a built-in diagnostic function in addition to a normal command. For example, by issuing a diagnostic command to another CPU, shared memory SSM01, SSM02, adapter device AD, etc., that function or part of the function is stopped.

【００１２】このように構成したシステムにおいて，Ｃ
ＰＵ２〜ｎはＯＳ配下で通常の動作を実行させ，ＣＰＵ
１はテストプログラムを実行させる。テストプログラム
は，ＣＰＵ２〜ｎ，共有メモリＳＳＭ０１，ＳＳＭ０
２，アダプタ装置ＡＤ等の他の装置に対して順次，擬似
障害を発生させる診断命令を発行する。そして，診断命
令による擬似障害ごとに，ＯＳが実際にロギングする障
害情報と，予め求めておいた期待値とを比較することに
よって，ハードウェアが正しく障害を検出し，ＯＳが正
しく障害情報をロギングするか否かを検証する。In the system thus constructed, C
PU2-n execute normal operation under OS, CPU
1 causes the test program to be executed. The test program is CPU2 to n, shared memory SSM01, SSM0
2. Sequentially issuing a diagnostic command for causing a pseudo fault to other devices such as the adapter device AD. Then, for each pseudo failure due to the diagnostic command, the hardware actually detects the failure by comparing the failure information actually logged by the OS with the expected value obtained in advance, and the OS correctly logs the failure information. Verify whether to do.

【００１３】図７は，本発明の実施例のフローチャート
である。本発明の実施例の作用を，擬似障害を発生する
対象装置を共有メモリＳＳＭ０１とした場合について，
図７に基づき，図２を参照して説明する。 (1) システム中の１台のＣＰＵ，例えば，ＣＰＵ１を他
の装置（ＣＰＵ２〜ｎ）から切り離して，擬似的に障害
を発生させるためのテストプログラムを走行させる検証
システムとして立ち上げる。即ち，イニシャルプログラ
ムロード時に図３に示すように，ＣＰＵ１が保持するシ
ステム構成定義テーブルにおいて，被検証システムのＣ
ＰＵ２〜ＣＰＵｎ（ＯＳ配下で動作する）に対してＮＡ
（アクセス禁止）を設定することによって，ＣＰＵ１に
よるＣＰＵ２〜ＣＰＵｎへのアクセスを禁止する。従っ
て，ＣＰＵ１とＣＰＵ２〜ＣＰＵｎとは物理的には接続
されていても，論理的には切り離されて，ＯＳ下で動作
する被検証システムＣＰＵ２〜ＣＰＵｎから検証システ
ムＣＰＵ１を切り離すことができる。FIG. 7 is a flowchart of the embodiment of the present invention. The operation of the embodiment of the present invention will be described with reference to the case where the shared memory SSM01 is used as the target device in which the pseudo fault occurs.
It will be described based on FIG. 7 and with reference to FIG. (1) One CPU in the system, for example, the CPU 1 is separated from other devices (CPUs 2 to n) and started up as a verification system for running a test program for artificially generating a failure. That is, as shown in FIG. 3 when the initial program is loaded, in the system configuration definition table held by the CPU 1, the C
NA for PU2 to CPUn (operating under OS)
By setting (prohibit access), access to the CPU2 to CPUn by the CPU1 is prohibited. Therefore, even if the CPU 1 and the CPUs 2 to CPUn are physically connected, they are logically separated and the verification system CPU 1 can be separated from the verification target systems CPU 2 to CPUn operating under the OS.

【００１４】次に，ＯＳ配下で動作するＣＰＵ２〜ＣＰ
Ｕｎを検証システムであるＣＰＵ１から切り離すため，
これらＣＰＵのシステム構成定義テーブルにおいて，図
３に示すように，検証システムであるＣＰＵ１に対して
ＮＩ（未実装）を設定することにより，ＣＰＵ２〜ＣＰ
ＵｎからＣＰＵ１へのアクセスは禁止する。従って，物
理的には接続されていても論理的には，被検証システム
のＣＰＵ２〜ＣＰＵｎから，検証システムのＣＰＵ１は
未実装に見える。 (2) 検証システムのＣＰＵ１においてテストプログラム
を起動する。 (3) 被検証システムのＣＰＵ２〜ＣＰＵｎにおいてＯＳ
を起動する。 (4) 擬似障害を確実に発生させるために，オペレータ
は，このテーブルにＯＳが起動された時点から時系列
に，ＯＳがアクセスする装置の名称，時間，アドレスを
含む情報を入力したアクセス条件テーブル（図４参照）
を作成しておく。 (5) テストプログラムは，前記(2) で作成したアクセス
条件テーブルに基づいて，テスト対象の共有メモリ装置
ＳＳＭ０１に対して診断命令を発行して，例えば，１０
００番地に２００ｍｓの間，動作停止を指示する擬似障
害（ＨＡＬＴ）を設定する。 (6) 前記(5) で擬似障害を設定した装置名を示すＳＳＭ
０１と，設定した障害内内容を示すＨＡＬＴと，設定し
た時刻を示す２３時５９分５９秒１００ミリ秒を障害設
定テーブル（図５参照）の形式でディスク装置ＤＫに格
納する。 (7) ＯＳがテスト対象の共有メモリ装置であるＳＳＭ０
１をアクセスして, ＨＡＬＴ（障害）状態が発生する。 (8) ＯＳは障害情報（２３時５９分５９秒１２０ミリ秒
にＳＳＭ０１がＨＡＬＴ状態となった）をＯＳロギング
テーブルの形式（図６参照）でディスク装置ＤＫに格納
する (9) 上記(6) でテストプログラムが設定し，ディスク装
置ＤＫに格納した障害内容（２３時５９分５９秒１００
〜３００ミリ秒の間にＳＳＭ０１がＨＡＬＴする）と，
上記(8) でＯＳがディスク装置ＤＫにロギングした障害
情報（２３時５９分５９秒１２０ミリ秒にＳＳＭ０１が
ＨＡＬＴ状態となった）とを照合する。照合結果が妥当
であれば，ハードウェアの障害検出機能とＯＳのロギン
グ機能の両方を検証することができる。Next, CPUs 2 to CP operating under the OS
To disconnect Un from the verification system CPU1,
In the system configuration definition table of these CPUs, as shown in FIG. 3, by setting NI (not mounted) to the CPU 1 which is the verification system,
Access from Un to CPU1 is prohibited. Therefore, even if they are physically connected, the CPU 1 of the verification system seems to be unmounted from the CPU 2 to the CPUn of the system to be verified logically. (2) Start the test program in the CPU 1 of the verification system. (3) OS in CPU2 to CPUn of the system to be verified
To start. (4) In order to reliably generate a pseudo fault, the operator inputs information including the name, time, and address of the device accessed by the OS in chronological order from the time when the OS is started in this table. (See Figure 4)
Is created. (5) The test program issues a diagnostic command to the shared memory device SSM01 to be tested based on the access condition table created in (2) above, for example, 10
A pseudo fault (HALT) for instructing the operation stop is set at the address 00 for 200 ms. (6) SSM indicating the device name for which the pseudo fault is set in (5) above
01, HALT indicating the set contents in the failure, and 23: 59: 59: 100 milliseconds indicating the set time are stored in the disk device DK in the format of the failure setting table (see FIG. 5). (7) OS is SSM0 which is a shared memory device to be tested
Accessing 1 causes a HALT (fault) condition to occur. (8) The OS stores the failure information (SSM01 goes into HALT state at 23: 59: 59: 120 ms) in the disk device DK in the format of the OS logging table (see FIG. 6). (9) Above (6 ), Set by the test program and stored in the disk device DK (23: 59: 59: 100)
SSM01 HALTs within ~ 300 milliseconds),
In the above (8), the OS is collated with the failure information logged in the disk device DK (SSM01 is in the HALT state at 23: 59: 59: 120 milliseconds). If the collation result is valid, both the hardware failure detection function and the OS logging function can be verified.

【００１５】[0015]

【発明の効果】以上説明したように，本発明によると，
多重処理システムを構成する装置群を１台の特定の処理
装置（検証システム）と残りの他の装置（被検証システ
ム）に切り離し，被検証システムに擬似障害が発生した
ときオペレーティングシステムによってロギングされる
べき障害情報の期待値を作成し，検証システムは擬似障
害を被検証システムに発生させ，実際にオペレーティン
グシステムによってロギングされた障害情報と，作成し
た期待値の障害情報とを照合することによって多重処理
システムの障害処理を検証するので，従来使用した外部
保守支援装置の代わりに１台のＣＰＵによって他の装置
の障害処理機能を検証するため，安価にシステムの故障
処理の検証を行うことができ，また，ハードウェアとＯ
Ｓとが共同して処理した障害情報を検査するため，実際
の運用に近い状態でシステムの障害処理機能を検証する
できるという効果がある。As described above, according to the present invention,
The devices that make up the multi-processing system are separated into one specific processing device (verification system) and the remaining other devices (systems to be verified), and when an artificial failure occurs in the system to be verified, it is logged by the operating system. The expected value of the expected failure information is created, the verification system causes a pseudo failure in the system to be verified, and the failure information actually logged by the operating system is compared with the created failure information of the expected value to perform multiprocessing. Since the failure processing of the system is verified, the failure processing function of the other apparatus is verified by one CPU instead of the external maintenance support device used conventionally, so that the failure processing of the system can be verified at low cost. Also, hardware and O
Since the fault information processed in cooperation with S is inspected, there is an effect that the fault handling function of the system can be verified in a state close to actual operation.

[Brief description of drawings]

【図１】本発明の原理ブロック図FIG. 1 is a block diagram of the principle of the present invention.

【図２】本発明の実施例を示すシステム構成図FIG. 2 is a system configuration diagram showing an embodiment of the present invention.

【図３】システム構成定義テーブル[Figure 3] System configuration definition table

【図４】アクセス条件テーブル[Figure 4] Access condition table

【図５】障害設定テーブル[Fig. 5] Fault setting table

【図６】ＯＳロギングテーブルFIG. 6 OS logging table

【図７】本発明の実施例のフローチャートFIG. 7 is a flowchart of an embodiment of the present invention.

[Explanation of symbols]

ＣＰＵ１〜ｎ処理装置ＳＳＭ０１，０２共有メモリ装置ＢＨ10,11,20,21 バスハンドラ装置ＡＤアダプタ装置ＤＫディスク装置 CPU1 to n Processing device SSM01,02 Shared memory device BH10,11,20,21 Bus handler device AD adapter device DK disk device

Claims

[Claims]

1. Multiprocessing comprising a plurality of processing devices having a function of detecting a device failure and logging the failure information by an operating system (21), and a shared device shared by the plurality of processing devices. In the system for failure verification of a system, the system is provided with a disconnecting means (3) for disconnecting a group of devices constituting the system into one specific processing unit (1) and the remaining other units (2), The specific pseudo device (1) is connected to the specific processing device (1) by another device.
Pseudo-fault generating means (4) that occurs in (2) and expected value of fault information that should be logged by the operating system (21) when the specific pseudo-fault (41) occurs in another device (2) And an expected value creating means (5) for creating the expected failure information, failure information logged by the operating system (21) when the specific pseudo failure (41) is generated by the pseudo failure generating means (4), and the expected information. A verification means (6) for verifying the expected value created by the value creation means (5) is provided, and the failure processing of the multiprocessing system is verified based on the verification result by the verification means (6). Fault handling verification method.

2. The fault processing verification method according to claim 1, wherein the operating system and the expected value creating means (5) respectively log fault information including an occurrence time and create an expected value.

3. The pseudo fault generation means (4) generates a pseudo fault by issuing a diagnostic command in synchronization with an access operation of the operating system (21). Fault handling verification method.