JPH08235009A

JPH08235009A - Automatic system degeneration detection method

Info

Publication number: JPH08235009A
Application number: JP7057901A
Authority: JP
Inventors: Kazutoshi Kobayashi; 千稔小林
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-02-22
Filing date: 1995-02-22
Publication date: 1996-09-13

Abstract

PURPOSE: To switch a primary system to a stand-by system even though the former system has no its down when a fault of a device is processed in an entire system and the degree of the fault exceeds a fixed level. CONSTITUTION: In an information processing system including a primary system, a stand-by system and a centralized monitoring device connected to both former systems, a fault management table is provided on the centralized monitoring device to describe the weight information 11, the system present performance deterioration value 12 and the system performance deterioration allowable value 13 which are given to the fault parts designated previously in the devices of each system as well as these fault parts. The centralized monitoring device receives the fault information from the primary system and analyzes this information to specify a fault part. Then the monitoring device reads the weight information on the fault part out of the fault management table, updates the value 12 based on the weight information to compare it with the value 13, and produces an interruption to switch the primary system to the stand-by system when the value 12 exceeds the value 13.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、情報処理システムの特
に大量データを常に瞬時に処理する必要のあるコンピュ
ータシステム及び公共性の高いシステムに適用すること
により、コンピュータシステムの安定した処理速度を常
に維持することを可能にする自動システム縮退検出方法
に関する。INDUSTRIAL APPLICABILITY The present invention is applied to a computer system which is required to process a large amount of data in an information processing system instantaneously at all times and a system having a high degree of publicity. It relates to an automatic system degradation detection method that makes it possible to maintain.

【０００２】[0002]

【従来の技術】従来のコンピュータシステムに於いて
は、特開平６−５２１３１号等で行われているような、
プロセッサ相互で他系のシステムを監視する方法が取ら
れていたが、この様な方式はシステムダウンを検出する
ための手法であって、システムダウンにつながらないよ
うなハードウェア障害（ＴＬＢ，ＢＳ等の障害による部
分的な使用禁止状態（デグラ）、マルチパス機能を有し
た入出力装置のパス閉塞等）による性能低下がシステム
ダウンに及ぼす影響を検出するには限界があった。障害
検出及び閉塞は装置単体の障害処理で行うために、詳細
な障害部位の表示及び装置の状態は各々の装置において
調査する必要があった。システムに必要な性能が関連す
る各装置の障害でどのくら影響を受けているのかを判断
できないため、システムダウンしなければ待機系（２重
系）への切替はできず、有効に資源が利用できていない
ために生ずる処理量（性能）のオーバフローが、システ
ムダウンとなって切替られた待機系に初めて移るため、
システムダウン後の処理もピーク処理のため大きくなっ
しまう問題があった。2. Description of the Related Art In a conventional computer system, as disclosed in JP-A-6-52131,
A method of monitoring the system of another system between processors has been taken, but such a method is a method for detecting a system down, and a hardware failure (TLB, BS, etc.) that does not lead to the system down. There is a limit in detecting the effect of system degradation due to performance degradation due to a partial use prohibition state (degra) due to a failure, path blockage of an I / O device having a multipath function, etc. Since the failure detection and the blockage are performed by the failure processing of the single device, it is necessary to investigate the detailed display of the failed part and the state of the device in each device. Since it is impossible to determine how much the system required performance is affected by the failure of each device, it is not possible to switch to the standby system (duplex system) unless the system goes down, and resources are effectively used. Since the overflow of processing amount (performance) that occurs because it is not done moves to the standby system that was switched over due to system down,
There was also a problem that the processing after the system went down became large due to peak processing.

【０００３】[0003]

【発明が解決しようとする課題】特開平６−５２１３１
号等で行われているような障害監視方式では、ハードウ
ェア障害による性能低下は検出出来ないため、性能低下
を早期に検出してシステムの信頼性を損なうことなく常
に安定したシステム処理能力を保証することは出来なか
った。本発明の目的は、システム全体で、どの装置でど
のような障害が発生しているのかを管理し、障害の程度
が或る一定のレベル以上になったとき、システムダウン
にならなくても、待機系のシステムに切替をできるよう
にし、常に安定したシステム処理能力を保証することに
ある。[Patent Document 1] Japanese Patent Application Laid-Open No. 6-52131
With a fault monitoring method such as the one used in No.1 or the like, performance deterioration due to hardware failure cannot be detected, so performance deterioration is detected early and system stability is always guaranteed without impairing system reliability. I couldn't do it. An object of the present invention is to manage what kind of failure occurs in which device in the entire system, and even if the system does not go down when the degree of failure exceeds a certain level, It is to be able to switch to the standby system and to always guarantee stable system processing capacity.

【０００４】[0004]

【課題を解決するための手段】上記目的を達成するた
め、本発明は、主系処理システムと、待機系処理システ
ムと、これらシステムに接続された集中監視装置を備え
る情報処理システムにおける自動システム縮退検出方法
であり、前記集中監視装置は、前記各処理システムの各
装置の予め指定された障害部位および該障害部位に与え
られた重み情報と、システム現状性能低下値と、システ
ム性能低下許容値を記述した障害管理テーブルを備え、
前記主系処理システムからの障害情報を受け、該障害情
報を解析して障害の発生した装置およびその障害部位を
特定し、該障害部位に与えられた重み情報を障害管理テ
ーブルから読み出し、該重み情報により前記システム現
状性能低下値を更新し、該更新したシステム現状性能低
下値と前記システム性能低下許容値を比較判定し、該更
新したシステム現状性能低下値が前記システム性能低下
許容値を超えるとき、前記主系処理システムに対して主
系処理システムから待機系処理システムへの切替を指示
する性能低下割込みを行うようにしている。In order to achieve the above object, the present invention provides an automatic system degeneration system in an information processing system including a main processing system, a standby processing system, and a centralized monitoring device connected to these systems. The detection method, the centralized monitoring device, the prespecified failure site of each device of each processing system and weight information given to the failure site, the current system performance degradation value, and the system performance degradation allowable value. Equipped with the described fault management table,
The fault information from the main processing system is received, the fault information is analyzed to identify the device in which the fault has occurred and its fault part, the weight information given to the fault part is read from the fault management table, and the weight is calculated. When the system current performance degradation value is updated by information, the updated system current performance degradation value and the system performance degradation allowable value are compared and determined, and the updated system current performance degradation value exceeds the system performance degradation allowable value. A performance-decreasing interrupt for instructing the main processing system to switch from the main processing system to the standby processing system is performed.

【０００５】[0005]

【作用】上記手段により、システムにおける障害発生に
よる性能低下を速やかに知ることができ、性能低下が一
定レベル以上になると待機系システムへの切替が出来
る。これにより、システム性能を常に安定した状態に維
持でき、システムの信頼性を向上できる。By the above means, it is possible to promptly know the performance deterioration due to the occurrence of a failure in the system, and it is possible to switch to the standby system when the performance deterioration exceeds a certain level. As a result, system performance can always be maintained in a stable state, and system reliability can be improved.

【０００６】[0006]

【実施例】以下に図面に基づいて本発明の実施例を説明
する。図１は、本発明の一実施例を示すシステムの全体
ブロック図である。該システムは、ＣＰＵ＃１〜ＣＰＵ
＃ｎとＩＯＣ＃１〜ＩＯＣ＃ｎが、各々イーサネットＬ
ＡＮ５を介して集中監視装置６に接続された構成を備え
ている。集中監視装置６は通信を行う通信アダプタ７
と、接続されている全ての装置の予め指定された障害部
位に関する情報が登録されている障害管理テーブル８
と、接続されている装置の状態を表示したり、ＣＰＵと
の接続状態を表すシステム構成管理テーブル１０と、接
続された装置からの障害情報を解析し性能低下を判定す
る障害処理プログラム９からなっている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is an overall block diagram of a system showing an embodiment of the present invention. The system includes CPU # 1 to CPU
#N and IOC # 1 to IOC # n are Ethernet L
The configuration is such that the centralized monitoring device 6 is connected via the AN 5. The centralized monitoring device 6 is a communication adapter 7 for communication.
And a failure management table 8 in which is registered information about a previously specified failure part of all connected devices.
And a system configuration management table 10 for displaying the state of the connected device and showing the state of connection with the CPU, and a fault processing program 9 for analyzing the fault information from the connected device and determining performance degradation. ing.

【０００７】図２は一設定情報例を設定した障害管理テ
ーブル８を示す図である。本テーブル情報には、個々の
装置障害の重要度を計算するための、集中監視装置６に
接続された全ての装置の予め指定された障害部位と該障
害部位に与えられた重み情報１１からなる障害部位情報
と、管理しているシステムの全体の障害状況を数値で表
したシステム現状性能低下値１２と、システムがどこま
で性能低下を許すのかを予め指定したシステム性能低下
許容値１３で構成される。システム現状性能低下値１２
がシステム性能低下許容値１３以上になった時に性能低
下割込みを起こさせる。FIG. 2 is a diagram showing the failure management table 8 in which one setting information example is set. This table information includes a prespecified fault part of all devices connected to the centralized monitoring device 6 and weight information 11 given to the fault part for calculating the importance of each device fault. It consists of failure part information, system current performance degradation value 12 that numerically represents the failure status of the entire managed system, and system performance degradation allowance value 13 that specifies in advance how much the system allows performance degradation. . Current system performance degradation value 12
Causes a performance degradation interrupt when the system performance degradation allowable value becomes 13 or more.

【０００８】図３は障害管理テーブル８の設定情報作成
処理の概要を示すフローチャートである。全ての装置単
位の障害部位情報は、あらかじめ全ての装置名対応に登
録されているファイルから読み込まれる。対象装置毎に
障害部位それぞれに与えられた重み情報１１を設定する
が、基本的にはシステム構成管理テーブル１０に基づ
き、接続されている装置が自動的に判るので、予め登録
されている重みの標準値が自動的に設定される仕組みに
なっている。この時には、例えばＣＰＵに対して４パス
接続が標準であるようなＩＯＣが２パスしか接続されて
いない場合には、対象ＩＯＣのパス障害用重み情報１１
は２倍に設定される。標準値を使用しない場合には、各
々の情報を変更することが可能である。この他には、管
理するシステムの性能低下許容値１３を管理テーブル情
報として設定する。この情報は毎日同一でも良いが、柔
軟性を持たせ変更する事が出来る様にしてある。FIG. 3 is a flow chart showing an outline of the setting information creation processing of the failure management table 8. The faulty part information for every device is read from a file registered in advance corresponding to all device names. Although the weight information 11 given to each failure part is set for each target device, basically, since the connected device is automatically known based on the system configuration management table 10, the weight information registered in advance is used. The standard value is set automatically. At this time, for example, when only two paths are connected to the CPU for which the four-path connection is standard, the weight information 11 for path failure of the target IOC.
Is set to double. When the standard value is not used, each information can be changed. In addition to this, the performance degradation allowable value 13 of the managed system is set as management table information. This information can be the same daily, but it is flexible and can be changed.

【０００９】図４は障害処理プログラム９による障害解
析処理の概要を示すフローチャートである。システムを
構成するＣＰＵ／ＩＯＣの各装置のいずれかに障害が発
生すると、障害の発生した装置からイーサネットＬＡＮ
５を経由して集中監視装置６に個々の装置障害情報が全
て送出される。集中監視装置６は通信アダプタ７を経由
して障害情報を受け取ると、障害内容の解析を障害処理
プログラム９により行い、障害が発生した装置及び部位
を特定すると共に、障害の重み１１を読み出し、読み出
した重みを全て加算することにより該当システムの現状
性能低下値１２を更新すると共に、該当装置に対して障
害が発生した事をシステム構成管理テーブル１０に反映
する処理を行い、該システム構成管理テーブル１０を更
新する。その後、システム性能低下許容値１３を読み出
し、該許容値１３と該当システムの現状性能低下値１２
とを比較判定する性能低下判定処理を行う。FIG. 4 is a flow chart showing an outline of the failure analysis processing by the failure processing program 9. When a failure occurs in any of the CPU / IOC devices that make up the system, the device that has the failure causes an Ethernet LAN
All individual device fault information is sent to the centralized monitoring device 6 via 5. When the centralized monitoring device 6 receives the failure information via the communication adapter 7, the failure processing program 9 analyzes the content of the failure to identify the device and the part where the failure has occurred, and read and read the failure weight 11. The current performance degradation value 12 of the corresponding system is updated by adding all the weights, and the system configuration management table 10 is processed to reflect that a failure has occurred in the relevant device. To update. Thereafter, the system performance degradation allowable value 13 is read, and the tolerance value 13 and the current system performance degradation value 12 of the corresponding system are read.
Performance degradation determination processing is performed to compare and determine.

【００１０】図５は、性能低下判定処理の概要を示すフ
ローチャートである。該当システムの現状性能低下値１
２が、許容値内であれば何も処理を行なわず完了する。
許容値を越えていた場合には、集中監視装置６は、シス
テム構成管理テーブル１０を参照して、許容値を越えた
システムのＣＰＵ＃１〜ＣＰＵ＃ｎに対して、イーサネ
ットＬＡＮ５を経由して性能低下割込み処理を行い、該
当システムが性能低下障害を起こしている事を報告す
る。各システムは、この割込みによる障害報告により常
に自分を含む関係するシステムの性能低下を速やかに知
ることが出来、待機系のシステムへの切替を行い、ま
た、派生するＩＯＣの切替及びホットスタンバイ切替を
自動的に行うことが出来る。これによりシステムは常に
安定した処理能力を確保することが出来るようになる。FIG. 5 is a flow chart showing an outline of the performance deterioration determination process. Current system performance degradation value 1
If 2 is within the allowable value, the process is completed without performing any processing.
If the allowable value is exceeded, the centralized monitoring device 6 refers to the system configuration management table 10 and sends to the CPUs # 1 to #n of the system that have exceeded the allowable value via the Ethernet LAN 5 Performs performance degradation interrupt processing and reports that the relevant system has a performance degradation failure. Each system can promptly know the performance deterioration of the related system including itself by the failure report due to this interrupt, switch to the standby system, and switch the derived IOC and hot standby. It can be done automatically. As a result, the system can always ensure stable processing capacity.

【００１１】また、集中監視装置６は、システム構成管
理テーブル１０に対して、性能低下障害を起こしている
システムの各装置を指示し、テーブル内容を更新して、
装置障害の状況を表示したり、障害が発生した時点で警
報を鳴らしてシステム管理者に障害発生を連絡すること
も可能である。また、集中監視装置６は、システム切替
時に、動作状態になった待機システムの各装置について
システムから指示を受け、テーブル内容を更新して、切
替られたシステムの各装置を表示することも可能であ
る。Further, the centralized monitoring device 6 instructs the system configuration management table 10 for each device in the system having the performance deterioration fault, updates the table contents,
It is also possible to display the status of a device failure and sound an alarm when a failure occurs to notify the system administrator of the failure occurrence. Further, when the system is switched, the centralized monitoring device 6 can receive an instruction from the system about each device of the standby system that is in the operating state, update the table contents, and display each device of the switched system. is there.

【００１２】[0012]

【発明の効果】本発明によれば、常に性能低下を速やか
に知ることにより待機系システムへの切替が出来るた
め、システム性能を常に安定した状態に維持する事が出
来、システムの信頼性を高めることができる。また、障
害部位の表示及び切替前後の装置状態についても、確認
可能となり、ハードウェアの集中監視を可能とすること
ができる。According to the present invention, since it is possible to switch to the standby system by always knowing the performance deterioration promptly, the system performance can always be maintained in a stable state and the system reliability is improved. be able to. Further, it is possible to confirm the display of the faulty part and the device state before and after the switching, and it is possible to perform the centralized monitoring of the hardware.

[Brief description of drawings]

【図１】本発明の一実施例を示すシステムの全体ブロッ
ク図を示す。FIG. 1 is an overall block diagram of a system showing an embodiment of the present invention.

【図２】一情報例を設定した障害管理テーブルを示す図
である。FIG. 2 is a diagram showing a failure management table in which an example of information is set.

【図３】障害管理テーブルの設定情報作成処理の概要を
示すフローチャートである。FIG. 3 is a flowchart showing an outline of a failure management table setting information creation process.

【図４】障害処理プログラムによる障害解析処理の概要
を示すフローチャートである。FIG. 4 is a flowchart showing an outline of failure analysis processing by a failure processing program.

【図５】性能低下判定処理の概要を示すフローチャート
である。FIG. 5 is a flowchart showing an outline of performance deterioration determination processing.

[Explanation of symbols]

１，２中央処理装置（ＣＰＵ）３，４入出力装置（ＩＯＣ）５イーサネットＬＡＮ６集中監視装置７通信アダプタ８障害管理テーブル９障害処理プログラム１０システム構成管理テーブル 1, 2 Central processing unit (CPU) 3, 4 Input / output device (IOC) 5 Ethernet LAN 6 Centralized monitoring device 7 Communication adapter 8 Fault management table 9 Fault processing program 10 System configuration management table

Claims

[Claims]

1. An automatic system degeneration detection method in an information processing system comprising a main processing system, a standby processing system, and a centralized monitoring device connected to these systems, wherein the centralized monitoring device comprises A failure management table that describes a previously designated failure part of each device of the system and weight information given to the failure part, a system current performance degradation value, and a system performance degradation allowable value, and is provided from the main processing system Received the failure information of, the failure information is analyzed to identify the device in which the failure has occurred and its failure part,
The weight information given to the failure part is read from the failure management table, the system current performance degradation value is updated by the weight information, and the updated system current performance degradation value and the system performance degradation allowable value are compared and determined. When the updated current system performance degradation value exceeds the system performance degradation allowable value, a performance degradation interrupt for instructing the main processing system to switch from the main processing system to the standby processing system is performed. Automatic system degeneration detection method.