JPH03250227A

JPH03250227A - Fault information control system for distributed system

Info

Publication number: JPH03250227A
Application number: JP2045399A
Authority: JP
Inventors: Reiji Hanawa; 塙　礼司; Kazuyuki Nishikawa; 西川　和幸
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1990-02-28
Filing date: 1990-02-28
Publication date: 1991-11-08

Abstract

PURPOSE:To reduce the load of a line by counting fault information and transmitting only the fault information exceeding a threshold value to a concentrated monitoring host. CONSTITUTION:When the no-response fault of a terminal is generated and a terminal address and a state code are present on a record accumulated in a storage table 123, a fault information control processing part 121 adds '1' to the fault count of the accumulated record and updates polling transmission time. On the other hand, when no accumulated record is present thereon, the accumulated record is prepared from the terminal address, state code and polling transmission time and written into the accumulating table 123. Next, the fault information control processing part 121 compares the terminal address and state code of the accumulated record with each terminal address and state code of each reference record in a reference table 122. When those terminal addresses and state codes are coincident with each other and the fault count of the stored record is larger than the threshold value of the reference record, the accumulated record is transmitted to the concentrated monitoring host and the fault count is reset. Thus, the load of the line is reduced.

Description

【発明の詳細な説明】〔産業上の利用分野〕本方式は、ネットワークシステム内の分散機において、
障害情報を集中監視ホストへ送信することを制御する方
式に関する。[Detailed Description of the Invention] [Industrial Application Field] This method is applicable to a distributed machine in a network system.
This invention relates to a method for controlling the transmission of failure information to a centralized monitoring host.

Ｃ従来の技術〕従来の方式は、昭６２−２１７７５０号公報に記載のよ
うに、ノードからのポーリング結果情報を蓄積する手段
及び障害箇所を推定する手段を管理センタに設けること
により、ポーリング時において端末が無応答の際に、オ
ペレータの障害箇所推定作業を軽減するとなっていた。C. Prior Art] As described in Japanese Patent Publication No. 62-217750, the conventional method is to provide a management center with means for accumulating polling result information from nodes and means for estimating the location of a failure. It was supposed to reduce the operator's work in estimating the location of a failure when a terminal does not respond.

[Problem to be solved by the invention]

上記従来方式は、分散機から端末へポーリング送出時、
端末が電源断あるいは未接続の場合も無応答になり集中
監視ホストへ障害情報を送信していた。しかし、端末の
電源断あるいは未接続を要因とする障害情報は、障害を
切分ける上で不必要であるにもかかわらず、分散機で、
障害情報を抑止することが考慮されておらず、回線の負
荷の増大及び集中監視ホストの負荷の増加という問題が
あった。In the above conventional method, when polling is sent from the distribution machine to the terminal,
Even if the terminal was powered off or disconnected, it would become unresponsive and send fault information to the central monitoring host. However, although information on failures caused by power-off or disconnection of terminals is unnecessary for isolating the failure, the distribution machine
Suppression of failure information was not considered, and there were problems of increased line load and increased load on the central monitoring host.

本方式は、分散機で、不必要な無応答障害情軸を抑止し
、回線の負荷の低減、及び集中監視ホストの負荷の低減
を目的とする。This method uses a distributed machine to suppress unnecessary unresponsive failures, reduce the load on the line, and reduce the load on the central monitoring host.

[Means to solve the problem]

上記目的を達成するために、集中監視ホストへ障害情報
を送信することを制御する障害情報制御処理部と、無応
答障害が発生した端末アドレス。In order to achieve the above object, there is provided a fault information control processing unit that controls sending fault information to a centralized monitoring host, and a terminal address where a non-response fault has occurred.

状態コード、障害カウント及びポーリング送出時刻から
成る蓄積レコードを複数個もつ蓄積テーブルと、端末ア
ドレス、状態コート、シきい値からなる参照レコードを
複数個もつ参照テーブルを分散機に設けることにより、
分散機が端末へポーリングを送出後、応答が無い場合、
当該端末アドレス、状態コード及びポーリング送出時刻
をメモリ上に保持し、障害情報制御処理部が上記蓄積テ
ーブルの各蓄積レコードの端末アドレス及び状態コード
を順次参照し、一致すれば当該蓄積レコードの障害カウ
ントを１加算しポーリング送出時刻を更新し、一方、一
致するものが無い場合、上記端末アドレス、上記状態コ
ード及び障害カウント（値１）から成る蓄積レコードを
上記蓄積テーブルに書込み、また、上記蓄積レコードの
ポーリング送出時刻と直前のポーリング送出時刻の差が
上記一定時間間隔より大きい場合、上記蓄積レコードの
障害カウントをリセットする。By providing the distribution machine with an accumulation table having multiple accumulated records consisting of status codes, fault counts, and polling sending times, and a reference table having multiple reference records consisting of terminal addresses, status codes, and threshold values,
If there is no response after the distributed machine sends polling to the terminal,
The terminal address, status code, and polling sending time are held in memory, and the failure information control processing unit sequentially refers to the terminal address and status code of each accumulated record in the accumulation table, and if they match, the failure count of the accumulated record is set. is added by 1 and the polling sending time is updated. On the other hand, if there is no match, an accumulation record consisting of the above terminal address, the above status code, and a failure count (value 1) is written to the above accumulation table, and the above accumulation record is If the difference between the polling sending time and the immediately preceding polling sending time is greater than the fixed time interval, the failure count of the accumulated record is reset.

つぎに、上記集中監視ホストへ上記蓄積レコードを送信
するか否かを判断する為、上記参照テーブルの各参照レ
コードを順次参照し、上記蓄積レコードの端末アドレス
及び状態コードが一致すれば、上記障害カウンタと当該
しきい値を比較し、上記障害カウンタが、当該しきい値
より大きい場合、上記蓄積レコードを上記集中監視ホス
トへ送信し、障害カウンタをリセットし、一方、上記蓄
積レコードの端末アドレス及び状態コード一致する参照
レコードが無い場合、上記蓄積レコードを上記集中監視
ホストへ送信し上記蓄積レコード内の障害カウントをリ
セットする。Next, in order to determine whether or not to send the stored record to the centralized monitoring host, each reference record in the reference table is sequentially referenced, and if the terminal address and status code of the stored record match, the fault is detected. The counter is compared with the threshold, and if the fault counter is larger than the threshold, the accumulated record is sent to the central monitoring host and the fault counter is reset, while the terminal address and If there is no reference record with a matching status code, the accumulated record is sent to the centralized monitoring host and the failure count in the accumulated record is reset.

[Effect]

障害情報制御処理部は、端末の無応答障害が発生し、当
該端末アドレス及び状態コードが蓄積テーブルの中の蓄
積レコードに存在する場合、当該蓄積レコードの障害カ
ウントを１加算し、ポーリング送出時刻を更新し、一方
、当該蓄積レコードが存在しない場合、当該端末アドレ
ス、状態レコード及びポーリング送出時刻から蓄積レコ
ードを作成し蓄積テーブルに書込む。つぎに、上記障害
情報制御処理部は、上記蓄積レコードの端末アドレス及
び状態コードと、参照テーブルの各参照レコードの端末
アドレス及び状態コードを比較し、一致する場合でかつ
当該蓄積レコードの障害カウントが当該参照レコードの
しきい値より大きい場合、集中監視ホストへ当該蓄積レ
コードを送信し、障害カウントをリセットする一方、上
記蓄積レコード内の上記端末アドレス及び上記状態コー
ドが上記参照テーブル内に存在しない場合、直ちに、上
記蓄積レコードを上記集中監視ホストへ送信し、上記蓄
積テーブル内の上記蓄積レコードの障害カウントをリセ
ットする。If a non-response failure occurs in a terminal and the terminal address and status code exist in the accumulated record in the accumulated table, the failure information control processing unit adds 1 to the failure count of the accumulated record and sets the polling sending time. On the other hand, if the storage record does not exist, a storage record is created from the terminal address, status record, and polling sending time and written in the storage table. Next, the fault information control processing section compares the terminal address and status code of the stored record with the terminal address and status code of each reference record in the reference table, and if they match, and the fault count of the stored record is If it is larger than the threshold value of the reference record, send the accumulated record to the central monitoring host and reset the fault count, while if the terminal address and status code in the accumulated record do not exist in the reference table. , immediately sends the accumulated record to the central monitoring host, and resets the failure count of the accumulated record in the accumulated table.

〔Example〕

以下、本方式の実施例を図面により説明する。 Examples of this system will be described below with reference to the drawings.

第１図は、本方式の一実施例を示した図であり、障害情
報制御処理１２１．参照テーブル１２２゜蓄積テーブル
１２３及び関連機構が本方式を実施するのに必要となる
手段である。FIG. 1 is a diagram showing an embodiment of this method, in which failure information control processing 121. The reference table 122, storage table 123, and related mechanisms are the means necessary to implement this method.

まず、複数個の端末１２４を接続している分散機１２が
各端末１２４に対して、ポーリングを送出することによ
り各端末にデータ転送要求があるか否かを問い合わせる
ポーリング送出は、一定時間間隔で行い、応答が有れば
、以後、当該端末には、送受信終了までポーリングの送
出を行わない。First, the distribution device 12 connected to a plurality of terminals 124 sends polling to each terminal 124 to inquire whether there is a data transfer request or not at a fixed time interval. If there is a response, polling will not be sent to the terminal from now on until the transmission/reception is completed.

ポーリングを送出してから一定時間経過後、ある端末か
ら応答が無いことを検知すると、当該端末アドレス及び
ポーリング逆比時の状態コード及び時刻をメモリ上に保
持し、障害情報制御処理１２１が第２図の蓄積テーブル
１２３の各蓄積レコードの端末アドレス及び状態コード
を順次参照し、一致する蓄積レコードが存在すれば、当
該障害カウントを１加算し、ポーリング送出時刻を更新
し、一方、一致する蓄積レコードが存在しなければ、当
該端末アドレス、状態コード及びポーリング送出時刻を
障害カウント（値１）蓄積レコードとして蓄積テーブル
１２３に書込み、また、各蓄積レコードを参照し、蓄積
レコードのポーリング送出時刻と直前のポーリング送出
時刻の差が上記一定時間間隔より大きい場合、上記蓄積
レコードの障害カウントをリセットする。When it is detected that there is no response from a certain terminal after a certain period of time has elapsed since the polling was sent, the terminal address, the status code and time at the time of polling inversion are held in memory, and the failure information control processing 121 performs the second The terminal address and status code of each stored record in the storage table 123 shown in the figure are sequentially referred to, and if a matching stored record exists, the failure count is incremented by 1, the polling sending time is updated, and the matching stored record If it does not exist, write the terminal address, status code, and polling sending time in the storage table 123 as a failure count (value 1) storage record, and refer to each storage record to find the polling sending time and the previous one of the stored record. If the difference in polling sending times is greater than the fixed time interval, the failure count of the accumulated record is reset.

つぎに、障害情報制御処理１２１は、第３図で示す参照
テーブル１２２の各参照レコードの端末アドレス及び状
態コードと順次参照し、上記蓄積レコードの端末アドレ
ス及び状態コードと一致する参照レコードが存在し、か
つ、上記蓄積レコードの障害カウントが当該参照レコー
ドのしきい値より大きければ、上記蓄積レコードを集中
監視ホスト１１へ送信し、当該障害カウントをリセット
する。一方、一致する参照レコードが存在しない場合、
直ちに上記蓄積レコードを集中監視ホスト１１へ送信す
る。なお、上記集中監視ホストは上記蓄積レコードを受
信し、表示装置１１１へ表示する。分散機から端末へポ
ーリングを送出（ＥＮＱ）Ｌ、正常応答（ＡＣＫ）ある
いは異常応答（ＮＡＫ）と障害カウントの値を第４図に
タイムチャートで、端末に４回ポーリング送出後、正常
応答があり、端末とのデータの送受信が開始し、その後
、異常応答（ＮＡＫ）が有り、障害カウントをリセット
し、ポーリンスを再送し、しきい値（障害カウント値５
）を超え、集中監視ホストへ障害情報を送信する。Next, the fault information control processing 121 sequentially refers to the terminal address and status code of each reference record in the reference table 122 shown in FIG. , and if the failure count of the accumulated record is greater than the threshold of the reference record, the accumulated record is sent to the centralized monitoring host 11 and the failure count is reset. On the other hand, if no matching reference record exists,
The accumulated record is immediately sent to the central monitoring host 11. Note that the centralized monitoring host receives the accumulated records and displays them on the display device 111. Figure 4 is a time chart showing the values of polling (ENQ) L, normal response (ACK) or abnormal response (NAK), and fault count when the distributed machine sends polling to the terminal.After sending polling to the terminal 4 times, there is a normal response. , data transmission and reception with the terminal starts, and then there is an abnormal response (NAK), the failure count is reset, polling is retransmitted, and the threshold value (failure count value 5
) and sends fault information to the central monitoring host.

〔Effect of the invention〕

本方式によれば、分散機から各端末ポーリング送出時、
無応答であっても直ちに障害情報を集中監視ホストへ送
信することなく、障害情報をカウントし、しきい値を超
えた障害情報のみ集中監視ホストへ送信することにより
、集中監視ホストへ送信する障害情報量を抑止できるの
で、回線の負荷を低減させ、また、集中監視ホストへ送
信した端末アドレス、状態コード及び障害カウントから
障害切分けを容易に行うことが可能である。According to this method, when polling is sent from the distributed machine to each terminal,
Even if there is no response, failure information is not sent to the central monitoring host immediately, but by counting failure information and sending only failure information that exceeds a threshold to the central monitoring host, the failure information is sent to the central monitoring host. Since the amount of information can be suppressed, the load on the line can be reduced, and failures can be easily isolated from the terminal address, status code, and failure count sent to the central monitoring host.

[Brief explanation of drawings]

第１図は本方式の一実施例の集中監視ホストから障害の
一元管理を行うネットワークの構成図、第２図は障害情
報を集中監視ホストへ送信するか否かを判断する為に参
照する参照テーブルの形式を示す図、第３図は障害発生
時、障害情報をカウントする蓄積テーブルの形式を示す
図、第４図はポーリング要求（ＥＮＱ）と正常応答（Ａ
ＣＫ）と異常応答（ＮＡＫ）受信時の障害カウント値を
示したタイムチャートである。１１・・・集中監視ホスト、　　　１２．１３・・・分
散機、１４・・・回線、　　　　　　　１１１・・・表
示装置、１２１・・・障害情報制御処理部、１２３・・・蓄積テーブル、１２２・・・参照テーブル、１２４・・・端末。第２記第図第回才ノぐFigure 1 is a configuration diagram of a network that performs unified fault management from a centralized monitoring host in an embodiment of this method, and Figure 2 is a reference reference for determining whether or not to send fault information to the centralized monitoring host. Figure 3 shows the format of an accumulation table that counts failure information when a failure occurs; Figure 4 shows polling requests (ENQ) and normal responses (A
11 is a time chart showing failure count values when receiving an abnormal response (NAK) and an abnormal response (NAK). DESCRIPTION OF SYMBOLS 11... Central monitoring host, 12.13... Distributor, 14... Line, 111... Display device, 121... Fault information control processing unit, 123... Accumulation table, 122... - Reference table, 124...Terminal. Part 2 Diagram 1st Sainogu

Claims

[Claims]

1. In a network system consisting of a central monitoring host, a plurality of distributed machines, and a plurality of terminals connected to the distributed machines, the distributed machine inquires about the presence or absence of a data transfer request by sending polling to the terminals, If there is no response after a certain period of time, the address of the terminal, the status code indicating the status at the time of polling, and the polling sending time are held in memory, and the failure information control processing unit stores the terminal address, status code, failure count, and Refers to an accumulation table that has multiple accumulated records consisting of polling sending times, adds 1 to the failure count of the accumulated record with the matching terminal address and status code, updates the polling sending time, and updates the polling sending time of the accumulated record. If the difference between the polling transmission time and the previous polling sending time is larger than the above fixed time interval, the failure count is reset, and then a reference table having multiple reference records consisting of terminal addresses, status codes, and failure count thresholds is referred to. and only when the terminal address and status code match and the failure count is greater than the threshold, the accumulated record is transmitted to the central monitoring host and the failure count of the accumulated record is reset. fault information control method for distributed systems.