JPH10240566A

JPH10240566A - Computer system

Info

Publication number: JPH10240566A
Application number: JP9045248A
Authority: JP
Inventors: Fujio Yokoyama; 不二夫横山; Masahito Ishii; 将人石井; Kenji Tsuji; 憲司辻
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-02-28
Filing date: 1997-02-28
Publication date: 1998-09-11

Abstract

PROBLEM TO BE SOLVED: To detect the position of a cable fault in its early stage and to improve the availability by the reduction of a fault analysis man-hour and a time and the reduction of MTTR(mean time to repair) by providing a means which records the passage confirmation of data on a network for communication between processors so as to specify a fault position in the network. SOLUTION: A terminal 108 for maintenance which is informed of abnormality of the inter-PE network from an arbitrary PE 101 during normal operation indicates a one-to-one communication with its PE to respective PE's through Etherenet 110. Each PE sends a packet to the PE. The maintenance terminal 108 after receiving a report on packet transmission completion from each PE checks whether or not there is a processor from which no packet arrives or a PE having received a packet with a wrong address. If there is sch a processor or PE, faulty communication path specifying operation is carried out. When an faulty communication path can be specified, the maintenance terminal 108 performs a faulty place specifying process.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数のプロセッサ
間で並列に処理を行う計算機システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a computer system for performing parallel processing among a plurality of processors.

【０００２】[0002]

【従来の技術】従来のプロセッサ間通信用ネットワーク
においては、特開平５−８１２２４号公報記載のよう
に、故障の検出や故障検出後の処理に関するものは多く
論じられているが、故障箇所の特定に関するものはな
く、一般の計算機システムと同様にマシーンチェックラ
ッチの記録を追跡して故障位置を特定する手段が用いら
れている。このため、多数のプロセッサ間の通信ネット
ワークの故障、特にプロセッサ間のケーブルや接続点の
故障箇所特定は人手に頼ることが多く、故障解析工数／
時間の増大、ＭＴＴＲ（Ｍean Ｔime Ｔo Ｒepair）の
増加による可用性の低下を招いていた。2. Description of the Related Art In a conventional network for communication between processors, as described in Japanese Patent Application Laid-Open No. 5-81224, there are many discussions relating to failure detection and processing after failure detection. There is no means relating to this, and means for tracing the record of the machine check latch and specifying the fault location is used as in a general computer system. For this reason, failures in communication networks between a large number of processors, in particular, failure locations of cables and connection points between processors often depend on human beings.
Availability has been reduced due to an increase in time and an increase in MTTR (Mean Time To Repair).

【０００３】また、プロセッサ間ネットワークが多重化
されているシステムでは、一つのネットワークに故障が
発生したとき、システム全体として稼働させたまま、ネ
ットワークの故障位置を特定し早期に修理しシステム性
能低下の復旧を図る必要があるが、従来技術では、ネッ
トワーク上のケーブル等の故障時には故障位置特定が困
難なため、システム稼働中の故障位置特定は困難であっ
た。Further, in a system in which networks between processors are multiplexed, when a failure occurs in one network, the fault location of the network is identified and repaired at an early stage while the system as a whole is in operation to reduce system performance. In the prior art, it is difficult to specify a failure position when a cable or the like on a network fails, and thus it is difficult to specify a failure position during operation of the system.

【０００４】[0004]

【発明が解決しようとする課題】本発明の第１の目的
は、プロセッサ間ネットワークの故障、特にプロセッサ
間を接続するケーブル故障の障害箇所を早期に検出し、
故障解析工数／時間の低減、ＭＴＴＲの低減による可用
性の向上を図ることにある。SUMMARY OF THE INVENTION It is a first object of the present invention to detect a failure in a network between processors, particularly a failure in a cable connecting between processors, at an early stage.
An object of the present invention is to improve the availability by reducing the number of failure analysis steps / time and the MTTR.

【０００５】本発明の第２の目的は、ケーブル等の故障
時にシステム全体をダウンさせることなく、故障位置特
定を行うことにある。A second object of the present invention is to specify a failure position without bringing down the entire system in the event of a failure of a cable or the like.

【０００６】[0006]

【課題を解決するための手段】上記目的は、故障位置を
特定するべき通信経路を自プロセッサへのデータ送受に
より特定する手段と、該経路へ通過確認用のデータを送
出する手段と、通過確認用のデータの通過時のみ通過を
記録する手段を設ける。SUMMARY OF THE INVENTION The object of the present invention is to provide a means for specifying a communication path for specifying a failure position by transmitting / receiving data to / from the own processor, a means for transmitting data for passage confirmation to the path, and a means for confirming passage. Means is provided for recording the passage only when the data for use passes.

【０００７】上記手段により、故障発生時には各プロセ
ッサに対して自プロセッサへの通信を指示し、受信不可
のプロセッサや誤って受信したプロセッサの位置からお
およその故障経路を推定し、次に、該経路へ通過確認用
のデータを送出し、通過確認ラッチをトレースして故障
位置を特定することができる。By the above means, when a failure occurs, each processor is instructed to communicate to its own processor, an approximate failure path is estimated from the position of the unreceivable processor or the erroneously received processor. , Data for passing confirmation can be sent out, and the passing confirmation latch can be traced to specify the fault position.

【０００８】上記手段によれば、人手による故障解析工
数を大幅に削減し、ＭＴＴＲの低減ひいては可用性の向
上を図ることができる。また、ネットワークが２重化さ
れた複数プロセッサシステムでは、システムをダウンさ
せないで故障位置の特定を行うことができ、また、シス
テム性能低下期間を低減することができる。According to the above means, the number of man-hours for failure analysis can be significantly reduced, and the MTTR can be reduced, and the availability can be improved. Further, in a multiple processor system having a duplicated network, a failure position can be specified without bringing down the system, and a period during which system performance is reduced can be reduced.

【０００９】[0009]

【発明の実施の形態】以下、本発明の一実施例を図を用
いて説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below with reference to the drawings.

【００１０】図１は、本発明の実施例の構成図である。
各プロセッサ（図中のＰＥ）１０１は、ネットワークを
通じて接続しており、自ＰＥを含む任意のＰＥと通信が
可能である。FIG. 1 is a block diagram of an embodiment of the present invention.
Each processor (PE in the figure) 101 is connected through a network, and can communicate with an arbitrary PE including its own PE.

【００１１】ネットワーク１０７はクロスバ方式であ
り、Ｚ，Ｙ，Ｘの３次元のクロスバスイッチ（ＸＢ）か
ら構成されている。各次元毎に複数の同一なクロスバス
イッチＺ−ＸＢ（１０２）、Ｙ−ＸＢ（１０３）、Ｘ−
ＸＢ（１０４）から構成される。各スイッチには８本の
通信路１０５が入力され、８本の通信路１０５が次段の
スイッチへ出力される。また、各スイッチには各入出力
に対応して通過確認ラッチ１０６が一個配備されてい
る。The network 107 is of a crossbar type, and comprises a three-dimensional Z, Y, X crossbar switch (XB). A plurality of identical crossbar switches Z-XB (102), Y-XB (103), X-
XB (104). Eight communication paths 105 are input to each switch, and the eight communication paths 105 are output to the next switch. Each switch is provided with one passage confirmation latch 106 corresponding to each input / output.

【００１２】当ネットワークの交換方式はパケット交換
方式であり、通過確認用データもパケットである。以後
通過確認用データをトレースパケットと言う。The switching system of this network is a packet switching system, and the data for passage confirmation is also a packet. Hereinafter, the passage confirmation data is referred to as a trace packet.

【００１３】前記のＰＥ１０１、ネットワーク１０７は
保守用端末とそれぞれイーサネット１１０、スキャン専
用信号１１２で接続されている。スキャン専用信号１１
１は制御信号、データ信号、アドレス信号で構成されて
いる。The PE 101 and the network 107 are connected to a maintenance terminal via an Ethernet 110 and a scan signal 112, respectively. Scan-only signal 11
Reference numeral 1 denotes a control signal, a data signal, and an address signal.

【００１４】図２は、故障解析手順の概略を示した図で
ある。２０１〜２０４で同一ＰＥ間の１対１通信を行う
ことにより障害発生経路の絞り込みを行い、２０５で故
障位置特定処理を行う（図３に詳細を示す）。該故障解
析の制御は保守用端末から行い、保守用端末は、図１の
イーサネット１１０やスキャン信号１１１を用いてＰＥ
１０１やネットワークスイッチ１０２〜１０４から故障
解析用データの収集を行う。故障解析開始は異常報告を
保守用端末が受けてから、自動的に行うことも可能であ
るし、人手により指示することも可能である。FIG. 2 is a diagram schematically showing a failure analysis procedure. In steps 201 to 204, a fault occurrence path is narrowed down by performing one-to-one communication between the same PEs, and a fault position specifying process is performed in 205 (details are shown in FIG. 3). The control of the failure analysis is performed from the maintenance terminal, and the maintenance terminal uses the Ethernet 110 and the scan signal 111 in FIG.
The failure analysis data is collected from the network switch 101 and the network switches 102 to 104. The failure analysis can be started automatically after the maintenance terminal receives the abnormality report, or can be manually instructed.

【００１５】図３は、図２の２０５に対応する故障箇所
特定処理の概要を示した図である。点線より左側は保守
用端末１０８での処理、右側はＰＥ１０１での処理を示
す。保守用端末１０８の動作は、スイッチ１０２〜１０
４のスキャン制御、ＰＥへのトレースパケット送出指
示、通信経路の特定、通過確認ラッチトレース、障害箇
所表示等の解析動作から構成される。FIG. 3 is a diagram showing an outline of a fault location specifying process corresponding to 205 in FIG. The left side of the dotted line shows the processing in the maintenance terminal 108, and the right side shows the processing in the PE 101. The operation of the maintenance terminal 108 is performed by the switches 102 to 10.
4 includes analysis operations such as scan control, instruction to send a trace packet to the PE, specification of a communication path, passage confirmation latch trace, and display of a fault location.

【００１６】図４は、パケットの構成図である。ＰＥ１
０１は保守用端末１０８から故障位置トレースパケット
の送出を指示されると、図４の２ワード目の第３ビット
を‘１’にしてトレースパケットであることを示すフラ
グ（Ｊフラグ）を設定してパケットを送出する。FIG. 4 is a configuration diagram of a packet. PE1
When the maintenance terminal 108 instructs the transmission of the failure position trace packet from the maintenance terminal 108, the third bit of the second word in FIG. 4 is set to "1" to set a flag (J flag) indicating that the packet is a trace packet. Out the packet.

【００１７】図では示していないが、スイッチ１０２〜
１０４はパケットを受信するとＪフラグが設定されてい
るかどうかをチェックし入力側の通過確認ラッチをセッ
トする。通過確認ラッチのセットは、受信バッファーに
格納されたパケットの２ワード目の３ビット（Ｊフラグ
４０３）を通過確認ラッチのＤata端子に接続し、パケ
ット受信イベントをクロックのオン条件とすることによ
り実現できる。トレースパケットをスイッチから次段の
スイッチまたはＰＥ１０１に送出する場合は送信バッフ
ァのＪフラグ４０３のデータ出力側の通過確認ラッチに
セットする。Although not shown in FIG.
104 receives the packet, checks whether the J flag is set, and sets the passage confirmation latch on the input side. The setting of the passage confirmation latch is realized by connecting the 3 bits (J flag 403) of the second word of the packet stored in the reception buffer to the Data terminal of the passage confirmation latch and setting the packet reception event as a clock ON condition. it can. When the trace packet is sent from the switch to the next-stage switch or PE 101, the trace packet is set in the pass confirmation latch on the data output side of the J flag 403 of the transmission buffer.

【００１８】通過確認ラッチのセット方法としては、ス
イッチ内の制御プログラムによりパケットのデータをチ
ェックして設定する方法もある。また、ハードウェアで
実現する方法も上記方法以外にもＳet／Ｒeset端子を用
いる方法がある。送受信バッファに格納されたデータで
なくとも通過確認ラッチのセットに用いることはでき
る。As a method of setting the passage confirmation latch, there is a method of checking and setting data of a packet by a control program in a switch. In addition to the above-described method, a method using a Set / Reset terminal may be used in hardware. Even if the data is not stored in the transmission / reception buffer, it can be used for setting the passage confirmation latch.

【００１９】図１で、Ｘ印のある通信路１０５が断線し
ている場合について故障解析動作を説明する。まず、通
常動作中に任意のＰＥ１０１からＰＥ間ネットワークの
異常報告を受けた保守用端末は、図２に基づき、各ＰＥ
に対し自ＰＥへの１対１通信をイーサネット１１０を通
じて指示する。各ＰＥは自ＰＥ宛にパケットを送出す
る。保守用端末は各ＰＥからパケット送出完了の報告を
受けた後、パケット未到着のプロセッサ、宛先誤りのパ
ケットを受信したＰＥ有無を調べる（２０２〜２０
３）。パケット未到着のプロセッサ、宛先誤りのパケッ
トを受信したＰＥがあった場合、２０４の障害通信経路
特定動作を行う。パケットを伝送して行く通信経路はこ
こでは述べないが、一定のアルゴリズムにより決定さ
れ、ルーティングテーブルに記録されている。この経路
は図４に示すようにパケット内の１〜２ワードにも記述
されている。パケット未到着のプロセッサ、宛先誤りの
パケットを受信したＰＥがない場合、つまり１対１通信
が正常に終了した場合、本発明では述べないが。別手段
により故障解析を進める。In FIG. 1, a failure analysis operation when the communication path 105 marked with X is broken will be described. First, the maintenance terminal that has received an abnormality report of the network between PEs from any PE 101 during the normal operation, based on FIG.
To the own PE via the Ethernet 110. Each PE sends a packet to its own PE. After receiving the report of the completion of the packet transmission from each PE, the maintenance terminal checks the processor that has not arrived the packet and the presence or absence of the PE that has received the packet with the destination error (202 to 20).
3). If there is a processor for which a packet has not arrived and a PE has received a packet with a destination error, the fault communication path specifying operation of 204 is performed. Although not described here, the communication path through which the packet is transmitted is determined by a certain algorithm and recorded in the routing table. This route is also described in one or two words in the packet as shown in FIG. In the case where there is no processor for which a packet has not arrived and there is no PE which has received a packet with a destination error, that is, when the one-to-one communication has been normally completed, the present invention will not be described. Perform failure analysis by another means.

【００２０】２０４により障害通信経路を特定できた場
合、保守用端末は２０５の障害箇所特定処理を実行す
る。この詳細を図３により説明する。まず、全ての通過
確認ラッチ１０６をスキャン専用信号１１１を通してリ
セットしておき（３０１）、イーサネット１１０から障
害通信経路上のＰＥに対しトレースパケットの送出を指
示する（３０２）。該指示を受けたＰＥは該当する障害
通信経路に対してトレースパケットを送出する（３０
３）。該ＰＥからトレースパケットの送出完了報告を受
けた後、保守用端末１０８は該ＰＥがパケットを受信し
ていないことを確認後、スキャン専用信号１１１を通し
てトレース経路上の通過確認ラッチを探索し、点灯して
いないラッチを検出する（３０４〜３０５）。図１の断
線の場合、クロスバスイッチ１０４（Ｘ−ＸＢ）の入力
側の通過確認ラッチが点灯しておらず、ソース側のクロ
スバスイッチ１０３（Ｙ−ＸＢ１）の出力側通過確認ラ
ッチは点灯している。このため、Ｘ−ＸＢ１０４とＹ−
ＸＢ１０３間に断線があることがわかる。When the faulty communication path can be specified by 204, the maintenance terminal executes the fault location specifying process of 205. This will be described in detail with reference to FIG. First, all the passage confirmation latches 106 are reset through the scan dedicated signal 111 (301), and the Ethernet 110 instructs the PE on the faulty communication path to transmit a trace packet (302). The PE receiving the instruction sends a trace packet to the corresponding faulty communication path (30).
3). After receiving the trace packet transmission completion report from the PE, the maintenance terminal 108 confirms that the PE has not received the packet, searches the passage confirmation latch on the trace path through the scan dedicated signal 111, and turns on the light. A latch that has not been detected is detected (304 to 305). In the case of the disconnection of FIG. 1, the input side passage confirmation latch of the crossbar switch 104 (X-XB) is not lit, and the output side passage confirmation latch of the source side crossbar switch 103 (Y-XB1) is lit. I have. Therefore, X-XB104 and Y-
It turns out that there is a disconnection between XB103.

【００２１】この結果は保守用端末のディスプレイにエ
ラー箇所が表示され（３０５）、ケーブルの取り替えが
行われる。As a result, the error location is displayed on the display of the maintenance terminal (305), and the cable is replaced.

【００２２】[0022]

【発明の効果】上記実施例によれば、人手でケーブルの
断線をチェックすることなく、故障箇所を特定できるの
で、ネットワークの故障解析工数を大幅に消滅でき、シ
ステムのＭＴＴＲを短縮できる。多重化されたプロセッ
サ間ネットワークでも、保守用端末からリモートで断線
したケーブル位置を検出できるので、正常に動作してい
るネットワークへ物理的に干渉することなく、故障解析
ができ、システム性能低下期間を短縮することができ
る。According to the above-described embodiment, since the fault location can be specified without manually checking the cable for disconnection, the number of man-hours for network fault analysis can be largely eliminated, and the MTTR of the system can be shortened. Even in a multiplexed inter-processor network, it is possible to detect the position of a disconnected cable remotely from a maintenance terminal, so that failure analysis can be performed without physically interfering with a normally operating network, and the system performance degradation period can be reduced. Can be shortened.

[Brief description of the drawings]

【図１】本発明の一実施例のケーブル切断時のトレース
方法を示す図である。FIG. 1 is a diagram showing a tracing method at the time of cutting a cable according to an embodiment of the present invention.

【図２】障害解析手順概略を示す。FIG. 2 shows an outline of a failure analysis procedure.

【図３】障害箇所特定手順を示す。FIG. 3 shows a fault location identification procedure.

【図４】当実施例のパケットフォーマットを示す。FIG. 4 shows a packet format of the present embodiment.

[Explanation of symbols]

１０１…プロセッサ（ＰＥ）、１０２〜１０４…ク
ロスバスイッチ、１０５…通信経路、１
０６…通過確認ラッチ、１０７…ＰＥ間ネットワーク、
１０８…保守用端末、１０９…スキャン制御回路、
１１０…イーサネット、１１１…スキャン専用信
号、２０１〜２０４…プロセッサ間通信による障害経路
特定処理、２０５…障害箇所特定処理、３０１〜３０３
…障害箇所特定用トレースパケット送出処理、３０４…
通過確認ラッチトレース処理、３０５…障害箇所表
示処理、４０１…パケットＩＤ、４０２…ル
ーティング情報フィールド、４０３…トレースパケット
フラグ。101: Processor (PE), 102 to 104: Crossbar switch, 105: Communication path, 1
06: passage confirmation latch, 107: network between PEs,
108: maintenance terminal, 109: scan control circuit,
110: Ethernet; 111: Scan-only signal; 201-204: Failure path identification processing by inter-processor communication; 205: Failure point identification processing; 301-303
… Trouble packet transmission processing for fault location identification, 304…
Passage confirmation latch trace processing, 305: Fault location display processing, 401: Packet ID, 402: Routing information field, 403: Trace packet flag.

Claims

[Claims]

1. A computer system in which a plurality of processors perform processing in parallel, characterized in that the computer system includes means for recording a data passage confirmation on a network for identifying a fault location of a plurality of inter-processor communication networks. Computer system.

2. The computer system according to claim 1, further comprising means for designating, on the data to be transmitted to the network, data for passing confirmation, said passing confirmation only when said designated data passes. A computer system characterized by recording on a recording means.