JP2004038654A

JP2004038654A - System for detecting abnormality in parallel computer, and method for detecting the same

Info

Publication number: JP2004038654A
Application number: JP2002195945A
Authority: JP
Inventors: Toshihiro Sato; 佐藤　敏浩; Hideo Fukuda; 福田　秀郎; Shigeyuki Nishijima; 西島　茂行; Takeshi Arikawa; 有川　毅; Yasuhiko Sato; 佐藤　康彦
Original assignee: RYOKEN TEKKU KK; SYSTEM FIVE KK; Mitsubishi Heavy Industries Ltd
Current assignee: RYOKEN TEKKU KK; SYSTEM FIVE KK; Mitsubishi Heavy Industries Ltd
Priority date: 2002-07-04
Filing date: 2002-07-04
Publication date: 2004-02-05

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a prompt and adequate response by specifying an abnormal part when abnormality occurs in a parallel computer. <P>SOLUTION: An abnormality detecting system of the parallel computer is used by connecting a plurality of nodes individually including a CPU board to a server via a network. The server comprises an operation state monitor program for monitoring the operation state of each node by fixed period via the network, and is set to output the operation state monitor result signal of each node. Each node comprises a function for normally performing self diagnosis, and is set to normally output the self diagnostic result signal of the operation state in every node by fixed time. The system includes: a switching device for switching the operation state monitor result signal of each node into the self diagnostic result signal of the operation state in each node; and a monitor for displaying the output of the switching device. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、複数のＣＰＵ（　ｃｅｎｔｒａｌ　ｐｒｏｃｅｓｓｉｎｇ　ｕｎｉｔ：　中央処理装置）をネットワークでサーバに接続して用いる並列計算機において異状箇所を検出できる異常検出システムおよび異常検出方法に関する。
【０００２】
【従来の技術】
従来の並列計算機の異常検出システムについて、図２および図３に基づき説明する。図２は従来の並列計算機の異常検出システムの構成概要図であり、図３（ａ）は図２における異常検出用のモニタ画面の例、（ｂ）は（ａ）中の一部の拡大図である。
【０００３】
近年、科学技術計算用等において、従来用いられてきたスーパーコンピュータ等の大型計算機に代えて、複数の小型のコンピュータを並列に運転する並列計算機が用いられるようになった。
【０００４】
複数のＣＰＵをネットワークでサーバに接続して用いる並列計算機（Ｂｅｏｗｏｌｆ　型）は、一般市場に販売されているオフィス用途のパーソナルコンピュータをネットワーク接続し、管理サーバと共にシステム化して構成する場合、所要スペースが増大するという問題がある。
【０００５】
そこで、通常個別に組み上げられている工業用の単体のＣＰＵセグメントを複数、ハブ２を介して接続し、サーバ１で管理するように構成することが行なわれる。
【０００６】
図２はそのような並列計算機の構成概要を示すものであり、図２において３ａ〜３ｎは、並列計算機を構成する複数のノードであり、各ノード３ａ〜３ｎは、個々にＣＰＵボード、ＬＡＮ（ｌｏｃａｌ　ａｒｅａ　ｎｅｔｗｏｒｋ）信号ボード、ＨＤＤ（ハードディスクドライブ）等を有するＣＰＵセグメントである。また各ノード３ａ〜３ｎは、数個（通常４個程度まで）毎に１つのボックス内のバックプレーンに搭載され、総計数十個になることがある。
【０００７】
各ノード３ａ〜３ｎは、それぞれがネットワーク４によりハブ２と結ばれており、ハブ２を介してサーバ１に接続し、サーバ１の管理、指令下に置かれる。
【０００８】
並列計算機の異常検出については、一般に公開されているソフト的な管理ツールもあるが、ＣＰＵの個別状態を監視を実現するものは、その動作に各ＣＰＵ時間を使ってしまい、本来の計算などの主プログラムの動作を阻害する恐れもある。
【０００９】
そのため、通常図２に示されるような並列計算機における異常検出システムは、サーバ１の稼働状況監視プログラムにより、サーバ１からハブ２、ネットワーク４を介して各ノード３ａ〜３ｎに一定周期（例えば、数秒毎）でポーリング（稼働状況監視）を行い、サーバ１はポーリング結果信号Ｘを出力し、サーバ１に接続したモニタ５のモニタ画面５ａに各ノード３ａ〜３ｎの稼働状況を表示している。なお、このプログラムソフトは、例えばＷＷＷ　ブラウザで稼動する。
【００１０】
図３（ａ）はモニタ画面５ａの例であり、各ノード３ａ〜３ｎはＡ〜Ｐのボックス内に搭載されたことを示すように、ボックスＡ〜Ｐ毎に各ノードのポーリング（稼働状況監視）結果を示すノード稼動状況表示１３が設けられる。
【００１１】
図３（ｂ）に１つのボックスＡの４個のノード稼動状況表示１３を拡大して例示するように、各ノード毎に、ＣＰＵボード、ＬＡＮ信号ボード、ＨＤＤの稼動状況が、ＬＥＤ（ｌｉｇｈｔ　ｅｍｉｔｔｉｎｇ　ｄｉｏｄｅ：　発光ダイオード）により発光表示される。ＬＥＤの発光表示は、例えば、稼働率５０％未満は青、５０〜８０％は黄、８０％以上は赤、システムダウンは黒、というようになされる。なお、稼動状況表示１３の単位は、さらに細かくあるいは大きく１ノード全体で、といようにシステム構成レベルにしたがって設定される。
【００１２】
【発明が解決しようとする課題】
しかしながら、上記のような従来の並列計算機の異常検出システムは、サーバ１側からのアクセスによるものであるため、あるノードにシステムダウンした表示（例えばＬＥＤの発光表示：黒）があっても、そのノードのＣＰＵセグメントの各ボードやＨＤＤのダウンに起因するものか、そのノードとハブ２を結ぶネットワーク４の異常に起因するものかは、特定できない。そのため異常発生時の迅速、適切な対処が困難となるという問題があった。
【００１３】
本発明は、かかる従来の並列計算機の異常検出システムにおける問題を解消し、並列計算機の異常時に異常箇所を特定でき、迅速、適切な対処を可能とする並列計算機の異常検出システムおよび異常検出方法を提供することを課題とするものである。
【００１４】
【課題を解決するための手段】
（１）本発明はかかる課題を解決するためになされたものであり、その第１の手段として、個々にＣＰＵボードを有する複数のノードをネットワークでサーバに接続して用いる並列計算機における異常検出システムにおいて、前記サーバは前記ネットワークを介して前記各ノードに対し一定周期で稼働状況監視を行う稼働状況監視プログラムを備え同各ノードの稼働状況監視結果信号を出力するように設定され、前記ノードは常時自己診断を行なう機能を備え個々の同ノードの稼動状況の自己診断結果信号を常時一定時間毎に出力するように設定されるとともに、前記各ノードの稼働状況監視結果信号と前記個々のノードの稼動状況の自己診断結果信号とを切り換える切換装置と、同切換装置の出力を表示する前記モニタとを備えてなることを特徴とする並列計算機の異常検出システムを提供する。
【００１５】
第１の手段によれば、通常はサーバの稼働状況監視結果による全ノードの概括的な稼動状況を知ることができ、特定のノードを指定して切換装置で切り換えればその稼動状況の自己診断結果の具体的状態を知ることができるので、並列計算機全体の稼動状態を監視しつつ、一旦異常を検知したときは、そのより詳しい状態の把握と、異常箇所の特定を容易にすることができる。
【００１６】
（２）第２の手段としては、第１の手段の並列計算機の異常検出システムを用い、通常は前記切換装置により前記各ノードの稼働状況監視結果信号を前記モニタに出力し、同各ノードの稼働状況監視結果信号において異常を示すノードが現れた場合、前記切換装置により同異常を示すノードの稼動状況の前記自己診断結果信号を前記モニタに出力することを特徴とする並列計算機の異常検出方法を提供する。
【００１７】
第２の手段によれば、第１の手段の作用を奏するとともに、サーバの稼働状況監視結果により特定のノードに異常が発見された時、切換装置によりそのノードの自己診断結果を見て比較することにより、異常箇所がノード自体にあるのか、ネットワークにあるのか特定することが可能となる。
【００１８】
【発明の実施の形態】
図１に基づき、本発明の実施の一形態にかかる並列計算機の異常検出システムおよび異常検出方法を説明する。図１は本実施の形態の並列計算機の異常検出システムの構成概要図である。
【００１９】
図１において、前述の従来例を説明する図２、図３と同様の部分には同じ符号を付して説明を省略し、異なる点を主に以下説明する。
【００２０】
本実施の形態の並列計算機の異常検出システムにおいては、ＣＰＵボード、ＬＡＮ（ｌｏｃａｌ　ａｒｅａ　ｎｅｔｗｏｒｋ）信号ボード、ＨＤＤ（ハードディスクドライブ）等を有するＣＰＵセグメントである各ノード３ａ〜３ｎは、工業用ＣＰＵボードに標準的に搭載されているウォッチドッグタイマ（Ｗａｔｃｈ　Ｄｏｇ　Ｔｉｍｅｒ　）機能（常時割り込みをかけて自己診断を行なう機能）を有しており、ウォッチドッグタイマ機能により個々のノード３ａ〜３ｎ（ＣＰＵセグメント）の稼動状況の自己診断結果を一定時間毎（例えば、数秒毎）に常時モニタ５に出力するように、割り込みプログラムを作成してある。
【００２１】
図１において、Ｙａ、Ｙｂ、Ｙｃ〜Ｙｎはそれぞれ、ノード３ａ、３ｂ、３ｃ〜３ｎがウォッチドッグタイマ機能により出力する個々の自己診断結果信号である。
【００２２】
Ｘは図１、図２で説明したサーバ１による各ノード３ａ、３ｂ、３ｃ〜３ｎに対するポーリング（稼動状況監視）の結果出力されるポーリング結果信号（稼動状況監視結果信号）である。なお、本実施の形態においてポーリングの監視用プロコトルには、ネットワーク上に負荷が比較的軽い　ＨＴＴＰ　を採用し、監視タイミングあたり１ショットでデータ通信を完了するように構成すると好ましい。
【００２３】
ポーリング結果信号Ｘと、自己診断結果信号Ｙａ、Ｙｂ、Ｙｃ〜Ｙｎとは、切換装置６に入力され、切換装置６は通常はポーリング結果信号Ｘを選択しモニタ５に出力Ｚし、モニタ５は前述の図２のモニタ画面５ａにより、全ノード３ａ、３ｂ、３ｃ〜３ｎの稼動状況をノード稼動状況表示１３のＬＥＤで発光表示する。
【００２４】
切換装置６は、特定のノード３ｉを指定し、出力Ｚの切り換えをおこなうものであり、ノード３ｉが指定され切換指示を受けると、ノード３ｉのウォッチドッグタイマ機能による自己診断結果信号Ｙｉが出力Ｚされ、モニタ画面５ａが切り換えられて、自己診断結果信号Ｙｉの内容がモニタ画面５ａに表示される。
【００２５】
自己診断結果信号Ｙｉの表示内容は各ノードのウォッチドッグタイマ機能によって設定されるものとなるが、単に概括的にＣＰＵボード、ＬＡＮ信号ボード、ＨＤＤ等の稼動状況を示すだけでなく、個々のより具体的な稼動状態、またはデータを表示するものとできる。
【００２６】
切換装置６は、いずれかのノードのポーリング結果信号Ｘが一定の範囲を越えた時等の設定条件により、そのノードに関して自己診断結果信号Ｙｉへ自動切り換えを行なうような自動切換装置でもよく、またモニタ画面５ａを見たオペレータが随時操作できるモニタ５近傍の、ないしはモニタ５付属の切換スイッチでもよく、異常検出システムの制御レベルによって設定すればよい。
【００２７】
上記のような本実施の形態の体の並列計算機の異常検出システムによれば、通常はモニタ画面５ａには、サーバ１のポーリング結果による全ノード３ａ〜３ｎのノード稼動状況表示１３が表示され、全ノード３ａ〜３ｎの概括的な稼動状況を知ることができ、特定のノード３ｉを指定して切換装置６を切換操作すれば、その自己診断結果Ｙｉの具体的状態を知ることができる。
【００２８】
そして、サーバ１のポーリング結果により特定のノード３ｉのノード稼動状況表示１３に異常が発見された時、切換装置６によりノード３ｉの自己診断結果Ｙｉを見て比較することにより、異常箇所を特定することが可能となる。
【００２９】
すなわち、特定のノード３ｉのポーリング結果が異常を示した時、ノード３ｉの自己診断結果Ｙｉが正常であれば、ネットワーク４に異常が発生したことが分かる。
【００３０】
特定のノード３ｉのポーリング結果が異常を示し、ノード３ｉの自己診断結果Ｙｉも異常であれば、先ず当該ノード３ｉが異常であることが分かる。
【００３１】
したがって、全ノードのポーリング結果のノード稼動状況表示１３により並列計算機全体の稼動状態を監視しつつ、一旦異常を検知したときは、そのより詳しい状態の把握と、異常箇所の特定を容易にすることができ、異常に対する対処を迅速に且つ適切に行なうことができる。
【００３２】
以上、本発明の実施の形態を説明したが、上記の実施の形態に限定されるものではなく、本発明の範囲内でその具体的構造、構成に種々の変更を加えてもよいことは勿論である。
【００３３】
【発明の効果】
（１）請求項１の発明によれば、並列計算機の異常検出システムを、個々にＣＰＵボードを有する複数のノードをネットワークでサーバに接続して用いる並列計算機における異常検出システムにおいて、前記サーバは前記ネットワークを介して前記各ノードに対し一定周期で稼働状況監視を行う稼働状況監視プログラムを備え同各ノードの稼働状況監視結果信号を出力するように設定され、前記ノードは常時自己診断を行なう機能を備え個々の同ノードの稼動状況の自己診断結果信号を常時一定時間毎に出力するように設定されるとともに、前記各ノードの稼働状況監視結果信号と前記個々のノードの稼動状況の自己診断結果信号とを切り換える切換装置と、同切換装置の出力を表示する前記モニタとを備えてなるように構成したので、通常はサーバの稼働状況監視結果による全ノードの概括的な稼動状況を知ることができ、特定のノードを指定して切換装置で切り換えればその稼動状況の自己診断結果の具体的状態を知ることができるため、並列計算機全体の稼動状態を監視しつつ、一旦異常を検知したときは、そのより詳しい状態の把握と、異常箇所の特定を容易にすることができ、異常に対する対処を迅速に且つ適切に行なうことができる。
【００３４】
（２）請求項２の発明によれば、並列計算機の異常検出方法を、請求項１に記載の並列計算機の異常検出システムを用い、通常は前記切換装置により前記各ノードの稼働状況監視結果信号を前記モニタに出力し、同各ノードの稼働状況監視結果信号において異常を示すノードが現れた場合、前記切換装置により同異常を示すノードの稼動状況の前記自己診断結果信号を前記モニタに出力するように構成したので、請求項１の効果を奏するとともに、サーバの稼働状況監視結果により特定のノードに異常が発見された時、切換装置によりそのノードの自己診断結果を見て、比較することにより、異常箇所がノード自体にあるのか、ネットワークにあるのか特定することが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の一形態に係る並列計算機の異常検出システムの構成概要図である。
【図２】従来の並列計算機の異常検出システムの構成概要図である。
【図３】（ａ）は図２における異常検出用のモニタ画面の例であり、（ｂ）は（ａ）中の一部の拡大図である。
【符号の説明】
１　　　　　　　　　　　　サーバ
２　　　　　　　　　　　　ハブ
３ａ、３ｂ、３ｃ〜３ｎ　　ノード
３ｉ　　　　　　　　　　　ノード
４　　　　　　　　　　　　ネットワーク
５　　　　　　　　　　　　モニタ
５ａ　　　　　　　　　　　モニタ画面
６　　　　　　　　　　　　切換装置
１３　　　　　　　　　　　ノード稼動状況表示[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an abnormality detection system and an abnormality detection method capable of detecting an abnormal part in a parallel computer using a plurality of CPUs (central processing units) connected to a server via a network.
[0002]
[Prior art]
A conventional abnormality detection system for a parallel computer will be described with reference to FIGS. FIG. 2 is a schematic diagram of a configuration of a conventional parallel computer abnormality detection system. FIG. 3A is an example of a monitor screen for abnormality detection in FIG. 2, and FIG. 2B is an enlarged view of a part of FIG. It is.
[0003]
2. Description of the Related Art In recent years, for science and technology calculations and the like, parallel computers that operate a plurality of small computers in parallel have been used instead of conventionally used large computers such as supercomputers.
[0004]
A parallel computer (Beowolf type) using a plurality of CPUs connected to a server via a network requires a small space when a personal computer for office use, which is sold in the general market, is connected to a network and systemized with a management server. There is a problem of increasing.
[0005]
Therefore, a configuration is adopted in which a plurality of single CPU segments for industrial use, which are usually individually assembled, are connected via the hub 2 and managed by the server 1.
[0006]
FIG. 2 shows an outline of the configuration of such a parallel computer. In FIG. 2, reference numerals 3a to 3n denote a plurality of nodes constituting the parallel computer, and each of the nodes 3a to 3n individually includes a CPU board and a LAN ( It is a CPU segment having a local area network (HDD) signal board, a hard disk drive (HDD), and the like. Each of the nodes 3a to 3n is mounted on a backplane in one box for every several nodes (usually up to about four nodes), and the total number may be ten.
[0007]
Each of the nodes 3a to 3n is connected to the hub 2 by a network 4, and is connected to the server 1 via the hub 2 and is placed under management and control of the server 1.
[0008]
There are software management tools that are open to the public for detecting abnormalities in parallel computers, but those that monitor the individual status of CPUs use each CPU time for their operation, and do The operation of the main program may be hindered.
[0009]
For this reason, an abnormality detection system in a parallel computer as shown in FIG. 2 normally sends a fixed period (for example, several seconds) to each of the nodes 3a to 3n from the server 1 via the hub 2 and the network 4 by the operation status monitoring program of the server 1. The server 1 outputs a polling result signal X, and displays the operation status of each of the nodes 3a to 3n on a monitor screen 5a of a monitor 5 connected to the server 1. This program software runs on, for example, a WWW browser.
[0010]
FIG. 3A shows an example of the monitor screen 5a. The nodes 3a to 3n are polled (operation status monitoring) for each of the boxes A to P to indicate that the nodes 3a to 3n are mounted in the boxes A to P. 3.) A node operation status display 13 showing the result is provided.
[0011]
As shown in FIG. 3B by enlarging and illustrating the four node operation status displays 13 of one box A, the operation status of the CPU board, the LAN signal board, and the HDD for each node is indicated by an LED (light emitting). (light emitting diode). The light emission display of the LED is, for example, blue when the operation rate is less than 50%, yellow when 50 to 80%, red when 80% or more, and black when the system is down. The unit of the operation status display 13 is set finer or larger for one node as a whole according to the system configuration level.
[0012]
[Problems to be solved by the invention]
However, the conventional parallel computer abnormality detection system as described above is based on access from the server 1 side. Therefore, even if a certain node has a system down display (for example, an LED light emission display: black), it is not It cannot be specified whether the problem is caused by the down of each board or HDD in the CPU segment of the node or by the abnormality of the network 4 connecting the node and the hub 2. Therefore, there has been a problem that it is difficult to quickly and appropriately deal with the occurrence of an abnormality.
[0013]
The present invention solves such a problem in the conventional parallel computer abnormality detection system, and can specify an abnormality portion when a parallel computer is abnormal, and can provide a parallel computer abnormality detection system and abnormality detection method capable of promptly and appropriately coping. The task is to provide.
[0014]
[Means for Solving the Problems]
(1) The present invention has been made to solve such a problem, and as a first means, an abnormality detection system in a parallel computer that uses a plurality of nodes each having a CPU board connected to a server via a network. Wherein the server is provided with an operating status monitoring program for monitoring the operating status of the nodes at regular intervals via the network, and is set to output an operating status monitoring result signal of each of the nodes; A self-diagnosis function is set so that a self-diagnosis result signal of the operation status of each individual node is always output at regular intervals, and the operation status monitoring result signal of each node and the operation of each individual node are set. A switching device for switching a self-diagnosis result signal of a situation; and the monitor for displaying an output of the switching device. Provides an abnormality detection system of a parallel computer according to claim.
[0015]
According to the first means, the general operation status of all nodes can be generally known from the result of monitoring the operation status of the server. If a specific node is designated and switched by the switching device, the self-diagnosis of the operation status is performed. Since it is possible to know the specific state of the result, it is possible to monitor the operating state of the entire parallel computer and, when an abnormality is detected once, to grasp the more detailed state and specify the abnormal part easily. .
[0016]
(2) As the second means, the abnormality detection system of the parallel computer of the first means is used. Normally, the switching device outputs an operation status monitoring result signal of each node to the monitor, and When a node indicating an abnormality appears in the operation status monitoring result signal, the switching device outputs the self-diagnosis result signal of the operation status of the node indicating the abnormality to the monitor. I will provide a.
[0017]
According to the second means, the function of the first means is exhibited, and when an abnormality is found in a specific node based on the operation status monitoring result of the server, the switching device checks and compares the self-diagnosis result of the node. This makes it possible to specify whether the abnormal part is in the node itself or in the network.
[0018]
BEST MODE FOR CARRYING OUT THE INVENTION
An abnormality detection system and an abnormality detection method for a parallel computer according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a schematic configuration diagram of the abnormality detection system for a parallel computer according to the present embodiment.
[0019]
In FIG. 1, the same parts as those in FIGS. 2 and 3 for explaining the above-described conventional example are denoted by the same reference numerals, and description thereof will be omitted. Differences will be mainly described below.
[0020]
In the abnormality detection system for a parallel computer according to the present embodiment, the nodes 3a to 3n, which are CPU segments including a CPU board, a LAN (local area network) signal board, and an HDD (hard disk drive), are connected to an industrial CPU board. The watchdog timer (Watch Dog Timer) function (a function of performing a self-diagnosis by constantly interrupting) is provided as a standard feature. The watchdog timer function is used to control each of the nodes 3a to 3n (CPU segment). An interrupt program is created so that the result of the self-diagnosis of the operating condition is constantly output to the monitor 5 at regular intervals (for example, every few seconds).
[0021]
In FIG. 1, Ya, Yb, and Yc to Yn are individual self-diagnosis result signals output from the nodes 3a, 3b, 3c to 3n by the watchdog timer function.
[0022]
X is a polling result signal (operation status monitoring result signal) output as a result of polling (operation status monitoring) of each of the nodes 3a, 3b, 3c to 3n by the server 1 described with reference to FIGS. In this embodiment, it is preferable that the protocol for monitoring polling employs HTTP, which has a relatively light load on the network, so that data communication is completed in one shot per monitoring timing.
[0023]
The polling result signal X and the self-diagnosis result signals Ya, Yb, Yc to Yn are input to the switching device 6, which normally selects the polling result signal X and outputs Z to the monitor 5, and the monitor 5 The operating statuses of all the nodes 3a, 3b, 3c to 3n are displayed on the monitor screen 5a of FIG.
[0024]
The switching device 6 designates a specific node 3i and switches the output Z. When the node 3i is designated and receives a switching instruction, the self-diagnosis result signal Yi by the watchdog timer function of the node 3i is output to the output Z. Then, the monitor screen 5a is switched, and the content of the self-diagnosis result signal Yi is displayed on the monitor screen 5a.
[0025]
The display contents of the self-diagnosis result signal Yi are set by the watchdog timer function of each node. However, the display contents do not merely indicate the operation status of the CPU board, the LAN signal board, the HDD, etc. A specific operation state or data can be displayed.
[0026]
The switching device 6 may be an automatic switching device that automatically switches to the self-diagnosis result signal Yi for a node according to a set condition such as when the polling result signal X of any node exceeds a certain range. It may be a switch near the monitor 5 which can be operated by the operator at any time while watching the monitor screen 5a, or a changeover switch attached to the monitor 5, and may be set according to the control level of the abnormality detection system.
[0027]
According to the abnormality detection system for a parallel computer according to the present embodiment as described above, the node operating status displays 13 of all the nodes 3a to 3n based on the polling result of the server 1 are normally displayed on the monitor screen 5a. It is possible to know the general operating status of all the nodes 3a to 3n, and to know the specific state of the self-diagnosis result Yi by specifying the specific node 3i and switching the switching device 6.
[0028]
Then, when an abnormality is found in the node operation status display 13 of the specific node 3i as a result of the polling of the server 1, the switching unit 6 identifies the abnormal point by comparing the self-diagnosis results Yi of the node 3i with the self-diagnosis result. It becomes possible.
[0029]
That is, when the self-diagnosis result Yi of the node 3i is normal when the polling result of the specific node 3i indicates an abnormality, it can be understood that an abnormality has occurred in the network 4.
[0030]
If the polling result of the specific node 3i indicates an abnormality and the self-diagnosis result Yi of the node 3i is also abnormal, it is first known that the node 3i is abnormal.
[0031]
Therefore, while monitoring the operation status of the entire parallel computer by the node operation status display 13 of the polling result of all nodes, once an abnormality is detected, it is easy to grasp the more detailed status and easily identify the location of the abnormality. Thus, it is possible to quickly and appropriately deal with the abnormality.
[0032]
The embodiments of the present invention have been described above. However, the present invention is not limited to the above-described embodiments, and various changes may be made to the specific structure and configuration within the scope of the present invention. It is.
[0033]
【The invention's effect】
(1) According to the first aspect of the present invention, in the abnormality detection system for a parallel computer in which a plurality of nodes each having a CPU board are connected to a server via a network, the server detects the abnormality of the parallel computer. An operation status monitoring program is provided to monitor the operation status of the nodes at regular intervals via a network, and is set to output an operation status monitoring result signal of each node. The node has a function of constantly performing a self-diagnosis. The self-diagnosis result signal of the operation status of each individual node is set to be always output at regular time intervals, and the operation status monitoring result signal of each node and the self-diagnosis result signal of the operation status of each node are provided. And a monitor for displaying the output of the switching device. It is possible to know the general operating status of all nodes based on the server operating status monitoring results, and to know the specific status of the self-diagnosis result of the operating status if a specific node is designated and switched by the switching device. Therefore, once an abnormality is detected while monitoring the operating state of the parallel computer as a whole, it is possible to understand the state of the abnormality in detail and to easily identify the location of the abnormality, and to quickly and appropriately deal with the abnormality. Can do it.
[0034]
(2) According to the second aspect of the present invention, an abnormality detection method for a parallel computer is provided by using the abnormality detection system for a parallel computer according to the first aspect of the present invention. Is output to the monitor, and when a node indicating an abnormality appears in the operation status monitoring result signal of each node, the switching device outputs the self-diagnosis result signal of the operation status of the node indicating the abnormality to the monitor. With such a configuration, the effect of claim 1 is exhibited, and when an abnormality is found in a specific node based on the operation status monitoring result of the server, the switching device looks at the self-diagnosis result of the node and compares the results. It is possible to specify whether the abnormal part is in the node itself or in the network.
[Brief description of the drawings]
FIG. 1 is a schematic configuration diagram of an abnormality detection system for a parallel computer according to an embodiment of the present invention.
FIG. 2 is a configuration schematic diagram of a conventional parallel computer abnormality detection system.
3A is an example of a monitor screen for detecting an abnormality in FIG. 2, and FIG. 3B is an enlarged view of a part of FIG.
[Explanation of symbols]
1 server 2 hub 3a, 3b, 3c to 3n node 3i node 4 network 5 monitor 5a monitor screen 6 switching device 13 node operation status display

Claims

In an abnormality detection system in a parallel computer in which a plurality of nodes each having a CPU board are connected to a server via a network, the server monitors an operation status of each of the nodes at regular intervals via the network. A program is set to output an operation status monitoring result signal of each node, and the node has a function of always performing self-diagnosis, and always outputs a self-diagnosis result signal of the operation status of each of the same nodes at regular time intervals. And a switching device for switching between an operation status monitoring result signal of each node and a self-diagnosis result signal of the operation status of each individual node, and the monitor for displaying an output of the switching device. An abnormality detection system for a parallel computer, comprising:

2. An abnormality detection system for a parallel computer according to claim 1, wherein the switching device normally outputs an operation status monitoring result signal of each node to the monitor, and indicates an abnormality in the operation status monitoring result signal of each node. When the node appears, the switching device outputs the self-diagnosis result signal of the operation status of the node indicating the abnormality to the monitor, wherein the abnormality detection method for the parallel computer is provided.