JP2004038654A - System for detecting abnormality in parallel computer, and method for detecting the same - Google Patents

System for detecting abnormality in parallel computer, and method for detecting the same Download PDF

Info

Publication number
JP2004038654A
JP2004038654A JP2002195945A JP2002195945A JP2004038654A JP 2004038654 A JP2004038654 A JP 2004038654A JP 2002195945 A JP2002195945 A JP 2002195945A JP 2002195945 A JP2002195945 A JP 2002195945A JP 2004038654 A JP2004038654 A JP 2004038654A
Authority
JP
Japan
Prior art keywords
node
result signal
operation status
parallel computer
abnormality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
JP2002195945A
Other languages
Japanese (ja)
Inventor
Toshihiro Sato
佐藤 敏浩
Hideo Fukuda
福田 秀郎
Shigeyuki Nishijima
西島 茂行
Takeshi Arikawa
有川 毅
Yasuhiko Sato
佐藤 康彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
RYOKEN TEKKU KK
SYSTEM FIVE KK
Mitsubishi Heavy Industries Ltd
Original Assignee
RYOKEN TEKKU KK
SYSTEM FIVE KK
Mitsubishi Heavy Industries Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by RYOKEN TEKKU KK, SYSTEM FIVE KK, Mitsubishi Heavy Industries Ltd filed Critical RYOKEN TEKKU KK
Priority to JP2002195945A priority Critical patent/JP2004038654A/en
Publication of JP2004038654A publication Critical patent/JP2004038654A/en
Withdrawn legal-status Critical Current

Links

Images

Landscapes

  • Computer And Data Communications (AREA)
  • Debugging And Monitoring (AREA)
  • Multi Processors (AREA)

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a prompt and adequate response by specifying an abnormal part when abnormality occurs in a parallel computer. <P>SOLUTION: An abnormality detecting system of the parallel computer is used by connecting a plurality of nodes individually including a CPU board to a server via a network. The server comprises an operation state monitor program for monitoring the operation state of each node by fixed period via the network, and is set to output the operation state monitor result signal of each node. Each node comprises a function for normally performing self diagnosis, and is set to normally output the self diagnostic result signal of the operation state in every node by fixed time. The system includes: a switching device for switching the operation state monitor result signal of each node into the self diagnostic result signal of the operation state in each node; and a monitor for displaying the output of the switching device. <P>COPYRIGHT: (C)2004,JPO

Description

【0001】
【発明の属する技術分野】
本発明は、複数のCPU( central processing unit: 中央処理装置)をネットワークでサーバに接続して用いる並列計算機において異状箇所を検出できる異常検出システムおよび異常検出方法に関する。
【0002】
【従来の技術】
従来の並列計算機の異常検出システムについて、図2および図3に基づき説明する。図2は従来の並列計算機の異常検出システムの構成概要図であり、図3(a)は図2における異常検出用のモニタ画面の例、(b)は(a)中の一部の拡大図である。
【0003】
近年、科学技術計算用等において、従来用いられてきたスーパーコンピュータ等の大型計算機に代えて、複数の小型のコンピュータを並列に運転する並列計算機が用いられるようになった。
【0004】
複数のCPUをネットワークでサーバに接続して用いる並列計算機(Beowolf 型)は、一般市場に販売されているオフィス用途のパーソナルコンピュータをネットワーク接続し、管理サーバと共にシステム化して構成する場合、所要スペースが増大するという問題がある。
【0005】
そこで、通常個別に組み上げられている工業用の単体のCPUセグメントを複数、ハブ2を介して接続し、サーバ1で管理するように構成することが行なわれる。
【0006】
図2はそのような並列計算機の構成概要を示すものであり、図2において3a〜3nは、並列計算機を構成する複数のノードであり、各ノード3a〜3nは、個々にCPUボード、LAN(local area network)信号ボード、HDD(ハードディスクドライブ)等を有するCPUセグメントである。また各ノード3a〜3nは、数個(通常4個程度まで)毎に1つのボックス内のバックプレーンに搭載され、総計数十個になることがある。
【0007】
各ノード3a〜3nは、それぞれがネットワーク4によりハブ2と結ばれており、ハブ2を介してサーバ1に接続し、サーバ1の管理、指令下に置かれる。
【0008】
並列計算機の異常検出については、一般に公開されているソフト的な管理ツールもあるが、CPUの個別状態を監視を実現するものは、その動作に各CPU時間を使ってしまい、本来の計算などの主プログラムの動作を阻害する恐れもある。
【0009】
そのため、通常図2に示されるような並列計算機における異常検出システムは、サーバ1の稼働状況監視プログラムにより、サーバ1からハブ2、ネットワーク4を介して各ノード3a〜3nに一定周期(例えば、数秒毎)でポーリング(稼働状況監視)を行い、サーバ1はポーリング結果信号Xを出力し、サーバ1に接続したモニタ5のモニタ画面5aに各ノード3a〜3nの稼働状況を表示している。なお、このプログラムソフトは、例えばWWW ブラウザで稼動する。
【0010】
図3(a)はモニタ画面5aの例であり、各ノード3a〜3nはA〜Pのボックス内に搭載されたことを示すように、ボックスA〜P毎に各ノードのポーリング(稼働状況監視)結果を示すノード稼動状況表示13が設けられる。
【0011】
図3(b)に1つのボックスAの4個のノード稼動状況表示13を拡大して例示するように、各ノード毎に、CPUボード、LAN信号ボード、HDDの稼動状況が、LED(light emitting diode: 発光ダイオード)により発光表示される。LEDの発光表示は、例えば、稼働率50%未満は青、50〜80%は黄、80%以上は赤、システムダウンは黒、というようになされる。なお、稼動状況表示13の単位は、さらに細かくあるいは大きく1ノード全体で、といようにシステム構成レベルにしたがって設定される。
【0012】
【発明が解決しようとする課題】
しかしながら、上記のような従来の並列計算機の異常検出システムは、サーバ1側からのアクセスによるものであるため、あるノードにシステムダウンした表示(例えばLEDの発光表示:黒)があっても、そのノードのCPUセグメントの各ボードやHDDのダウンに起因するものか、そのノードとハブ2を結ぶネットワーク4の異常に起因するものかは、特定できない。そのため異常発生時の迅速、適切な対処が困難となるという問題があった。
【0013】
本発明は、かかる従来の並列計算機の異常検出システムにおける問題を解消し、並列計算機の異常時に異常箇所を特定でき、迅速、適切な対処を可能とする並列計算機の異常検出システムおよび異常検出方法を提供することを課題とするものである。
【0014】
【課題を解決するための手段】
(1)本発明はかかる課題を解決するためになされたものであり、その第1の手段として、個々にCPUボードを有する複数のノードをネットワークでサーバに接続して用いる並列計算機における異常検出システムにおいて、前記サーバは前記ネットワークを介して前記各ノードに対し一定周期で稼働状況監視を行う稼働状況監視プログラムを備え同各ノードの稼働状況監視結果信号を出力するように設定され、前記ノードは常時自己診断を行なう機能を備え個々の同ノードの稼動状況の自己診断結果信号を常時一定時間毎に出力するように設定されるとともに、前記各ノードの稼働状況監視結果信号と前記個々のノードの稼動状況の自己診断結果信号とを切り換える切換装置と、同切換装置の出力を表示する前記モニタとを備えてなることを特徴とする並列計算機の異常検出システムを提供する。
【0015】
第1の手段によれば、通常はサーバの稼働状況監視結果による全ノードの概括的な稼動状況を知ることができ、特定のノードを指定して切換装置で切り換えればその稼動状況の自己診断結果の具体的状態を知ることができるので、並列計算機全体の稼動状態を監視しつつ、一旦異常を検知したときは、そのより詳しい状態の把握と、異常箇所の特定を容易にすることができる。
【0016】
(2)第2の手段としては、第1の手段の並列計算機の異常検出システムを用い、通常は前記切換装置により前記各ノードの稼働状況監視結果信号を前記モニタに出力し、同各ノードの稼働状況監視結果信号において異常を示すノードが現れた場合、前記切換装置により同異常を示すノードの稼動状況の前記自己診断結果信号を前記モニタに出力することを特徴とする並列計算機の異常検出方法を提供する。
【0017】
第2の手段によれば、第1の手段の作用を奏するとともに、サーバの稼働状況監視結果により特定のノードに異常が発見された時、切換装置によりそのノードの自己診断結果を見て比較することにより、異常箇所がノード自体にあるのか、ネットワークにあるのか特定することが可能となる。
【0018】
【発明の実施の形態】
図1に基づき、本発明の実施の一形態にかかる並列計算機の異常検出システムおよび異常検出方法を説明する。図1は本実施の形態の並列計算機の異常検出システムの構成概要図である。
【0019】
図1において、前述の従来例を説明する図2、図3と同様の部分には同じ符号を付して説明を省略し、異なる点を主に以下説明する。
【0020】
本実施の形態の並列計算機の異常検出システムにおいては、CPUボード、LAN(local area network)信号ボード、HDD(ハードディスクドライブ)等を有するCPUセグメントである各ノード3a〜3nは、工業用CPUボードに標準的に搭載されているウォッチドッグタイマ(Watch Dog Timer )機能(常時割り込みをかけて自己診断を行なう機能)を有しており、ウォッチドッグタイマ機能により個々のノード3a〜3n(CPUセグメント)の稼動状況の自己診断結果を一定時間毎(例えば、数秒毎)に常時モニタ5に出力するように、割り込みプログラムを作成してある。
【0021】
図1において、Ya、Yb、Yc〜Ynはそれぞれ、ノード3a、3b、3c〜3nがウォッチドッグタイマ機能により出力する個々の自己診断結果信号である。
【0022】
Xは図1、図2で説明したサーバ1による各ノード3a、3b、3c〜3nに対するポーリング(稼動状況監視)の結果出力されるポーリング結果信号(稼動状況監視結果信号)である。なお、本実施の形態においてポーリングの監視用プロコトルには、ネットワーク上に負荷が比較的軽い HTTP を採用し、監視タイミングあたり1ショットでデータ通信を完了するように構成すると好ましい。
【0023】
ポーリング結果信号Xと、自己診断結果信号Ya、Yb、Yc〜Ynとは、切換装置6に入力され、切換装置6は通常はポーリング結果信号Xを選択しモニタ5に出力Zし、モニタ5は前述の図2のモニタ画面5aにより、全ノード3a、3b、3c〜3nの稼動状況をノード稼動状況表示13のLEDで発光表示する。
【0024】
切換装置6は、特定のノード3iを指定し、出力Zの切り換えをおこなうものであり、ノード3iが指定され切換指示を受けると、ノード3iのウォッチドッグタイマ機能による自己診断結果信号Yiが出力Zされ、モニタ画面5aが切り換えられて、自己診断結果信号Yiの内容がモニタ画面5aに表示される。
【0025】
自己診断結果信号Yiの表示内容は各ノードのウォッチドッグタイマ機能によって設定されるものとなるが、単に概括的にCPUボード、LAN信号ボード、HDD等の稼動状況を示すだけでなく、個々のより具体的な稼動状態、またはデータを表示するものとできる。
【0026】
切換装置6は、いずれかのノードのポーリング結果信号Xが一定の範囲を越えた時等の設定条件により、そのノードに関して自己診断結果信号Yiへ自動切り換えを行なうような自動切換装置でもよく、またモニタ画面5aを見たオペレータが随時操作できるモニタ5近傍の、ないしはモニタ5付属の切換スイッチでもよく、異常検出システムの制御レベルによって設定すればよい。
【0027】
上記のような本実施の形態の体の並列計算機の異常検出システムによれば、通常はモニタ画面5aには、サーバ1のポーリング結果による全ノード3a〜3nのノード稼動状況表示13が表示され、全ノード3a〜3nの概括的な稼動状況を知ることができ、特定のノード3iを指定して切換装置6を切換操作すれば、その自己診断結果Yiの具体的状態を知ることができる。
【0028】
そして、サーバ1のポーリング結果により特定のノード3iのノード稼動状況表示13に異常が発見された時、切換装置6によりノード3iの自己診断結果Yiを見て比較することにより、異常箇所を特定することが可能となる。
【0029】
すなわち、特定のノード3iのポーリング結果が異常を示した時、ノード3iの自己診断結果Yiが正常であれば、ネットワーク4に異常が発生したことが分かる。
【0030】
特定のノード3iのポーリング結果が異常を示し、ノード3iの自己診断結果Yiも異常であれば、先ず当該ノード3iが異常であることが分かる。
【0031】
したがって、全ノードのポーリング結果のノード稼動状況表示13により並列計算機全体の稼動状態を監視しつつ、一旦異常を検知したときは、そのより詳しい状態の把握と、異常箇所の特定を容易にすることができ、異常に対する対処を迅速に且つ適切に行なうことができる。
【0032】
以上、本発明の実施の形態を説明したが、上記の実施の形態に限定されるものではなく、本発明の範囲内でその具体的構造、構成に種々の変更を加えてもよいことは勿論である。
【0033】
【発明の効果】
(1)請求項1の発明によれば、並列計算機の異常検出システムを、個々にCPUボードを有する複数のノードをネットワークでサーバに接続して用いる並列計算機における異常検出システムにおいて、前記サーバは前記ネットワークを介して前記各ノードに対し一定周期で稼働状況監視を行う稼働状況監視プログラムを備え同各ノードの稼働状況監視結果信号を出力するように設定され、前記ノードは常時自己診断を行なう機能を備え個々の同ノードの稼動状況の自己診断結果信号を常時一定時間毎に出力するように設定されるとともに、前記各ノードの稼働状況監視結果信号と前記個々のノードの稼動状況の自己診断結果信号とを切り換える切換装置と、同切換装置の出力を表示する前記モニタとを備えてなるように構成したので、通常はサーバの稼働状況監視結果による全ノードの概括的な稼動状況を知ることができ、特定のノードを指定して切換装置で切り換えればその稼動状況の自己診断結果の具体的状態を知ることができるため、並列計算機全体の稼動状態を監視しつつ、一旦異常を検知したときは、そのより詳しい状態の把握と、異常箇所の特定を容易にすることができ、異常に対する対処を迅速に且つ適切に行なうことができる。
【0034】
(2)請求項2の発明によれば、並列計算機の異常検出方法を、請求項1に記載の並列計算機の異常検出システムを用い、通常は前記切換装置により前記各ノードの稼働状況監視結果信号を前記モニタに出力し、同各ノードの稼働状況監視結果信号において異常を示すノードが現れた場合、前記切換装置により同異常を示すノードの稼動状況の前記自己診断結果信号を前記モニタに出力するように構成したので、請求項1の効果を奏するとともに、サーバの稼働状況監視結果により特定のノードに異常が発見された時、切換装置によりそのノードの自己診断結果を見て、比較することにより、異常箇所がノード自体にあるのか、ネットワークにあるのか特定することが可能となる。
【図面の簡単な説明】
【図1】本発明の実施の一形態に係る並列計算機の異常検出システムの構成概要図である。
【図2】従来の並列計算機の異常検出システムの構成概要図である。
【図3】(a)は図2における異常検出用のモニタ画面の例であり、(b)は(a)中の一部の拡大図である。
【符号の説明】
1            サーバ
2            ハブ
3a、3b、3c〜3n  ノード
3i           ノード
4            ネットワーク
5            モニタ
5a           モニタ画面
6            切換装置
13           ノード稼動状況表示
[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an abnormality detection system and an abnormality detection method capable of detecting an abnormal part in a parallel computer using a plurality of CPUs (central processing units) connected to a server via a network.
[0002]
[Prior art]
A conventional abnormality detection system for a parallel computer will be described with reference to FIGS. FIG. 2 is a schematic diagram of a configuration of a conventional parallel computer abnormality detection system. FIG. 3A is an example of a monitor screen for abnormality detection in FIG. 2, and FIG. 2B is an enlarged view of a part of FIG. It is.
[0003]
2. Description of the Related Art In recent years, for science and technology calculations and the like, parallel computers that operate a plurality of small computers in parallel have been used instead of conventionally used large computers such as supercomputers.
[0004]
A parallel computer (Beowolf type) using a plurality of CPUs connected to a server via a network requires a small space when a personal computer for office use, which is sold in the general market, is connected to a network and systemized with a management server. There is a problem of increasing.
[0005]
Therefore, a configuration is adopted in which a plurality of single CPU segments for industrial use, which are usually individually assembled, are connected via the hub 2 and managed by the server 1.
[0006]
FIG. 2 shows an outline of the configuration of such a parallel computer. In FIG. 2, reference numerals 3a to 3n denote a plurality of nodes constituting the parallel computer, and each of the nodes 3a to 3n individually includes a CPU board and a LAN ( It is a CPU segment having a local area network (HDD) signal board, a hard disk drive (HDD), and the like. Each of the nodes 3a to 3n is mounted on a backplane in one box for every several nodes (usually up to about four nodes), and the total number may be ten.
[0007]
Each of the nodes 3a to 3n is connected to the hub 2 by a network 4, and is connected to the server 1 via the hub 2 and is placed under management and control of the server 1.
[0008]
There are software management tools that are open to the public for detecting abnormalities in parallel computers, but those that monitor the individual status of CPUs use each CPU time for their operation, and do The operation of the main program may be hindered.
[0009]
For this reason, an abnormality detection system in a parallel computer as shown in FIG. 2 normally sends a fixed period (for example, several seconds) to each of the nodes 3a to 3n from the server 1 via the hub 2 and the network 4 by the operation status monitoring program of the server 1. The server 1 outputs a polling result signal X, and displays the operation status of each of the nodes 3a to 3n on a monitor screen 5a of a monitor 5 connected to the server 1. This program software runs on, for example, a WWW browser.
[0010]
FIG. 3A shows an example of the monitor screen 5a. The nodes 3a to 3n are polled (operation status monitoring) for each of the boxes A to P to indicate that the nodes 3a to 3n are mounted in the boxes A to P. 3.) A node operation status display 13 showing the result is provided.
[0011]
As shown in FIG. 3B by enlarging and illustrating the four node operation status displays 13 of one box A, the operation status of the CPU board, the LAN signal board, and the HDD for each node is indicated by an LED (light emitting). (light emitting diode). The light emission display of the LED is, for example, blue when the operation rate is less than 50%, yellow when 50 to 80%, red when 80% or more, and black when the system is down. The unit of the operation status display 13 is set finer or larger for one node as a whole according to the system configuration level.
[0012]
[Problems to be solved by the invention]
However, the conventional parallel computer abnormality detection system as described above is based on access from the server 1 side. Therefore, even if a certain node has a system down display (for example, an LED light emission display: black), it is not It cannot be specified whether the problem is caused by the down of each board or HDD in the CPU segment of the node or by the abnormality of the network 4 connecting the node and the hub 2. Therefore, there has been a problem that it is difficult to quickly and appropriately deal with the occurrence of an abnormality.
[0013]
The present invention solves such a problem in the conventional parallel computer abnormality detection system, and can specify an abnormality portion when a parallel computer is abnormal, and can provide a parallel computer abnormality detection system and abnormality detection method capable of promptly and appropriately coping. The task is to provide.
[0014]
[Means for Solving the Problems]
(1) The present invention has been made to solve such a problem, and as a first means, an abnormality detection system in a parallel computer that uses a plurality of nodes each having a CPU board connected to a server via a network. Wherein the server is provided with an operating status monitoring program for monitoring the operating status of the nodes at regular intervals via the network, and is set to output an operating status monitoring result signal of each of the nodes; A self-diagnosis function is set so that a self-diagnosis result signal of the operation status of each individual node is always output at regular intervals, and the operation status monitoring result signal of each node and the operation of each individual node are set. A switching device for switching a self-diagnosis result signal of a situation; and the monitor for displaying an output of the switching device. Provides an abnormality detection system of a parallel computer according to claim.
[0015]
According to the first means, the general operation status of all nodes can be generally known from the result of monitoring the operation status of the server. If a specific node is designated and switched by the switching device, the self-diagnosis of the operation status is performed. Since it is possible to know the specific state of the result, it is possible to monitor the operating state of the entire parallel computer and, when an abnormality is detected once, to grasp the more detailed state and specify the abnormal part easily. .
[0016]
(2) As the second means, the abnormality detection system of the parallel computer of the first means is used. Normally, the switching device outputs an operation status monitoring result signal of each node to the monitor, and When a node indicating an abnormality appears in the operation status monitoring result signal, the switching device outputs the self-diagnosis result signal of the operation status of the node indicating the abnormality to the monitor. I will provide a.
[0017]
According to the second means, the function of the first means is exhibited, and when an abnormality is found in a specific node based on the operation status monitoring result of the server, the switching device checks and compares the self-diagnosis result of the node. This makes it possible to specify whether the abnormal part is in the node itself or in the network.
[0018]
BEST MODE FOR CARRYING OUT THE INVENTION
An abnormality detection system and an abnormality detection method for a parallel computer according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a schematic configuration diagram of the abnormality detection system for a parallel computer according to the present embodiment.
[0019]
In FIG. 1, the same parts as those in FIGS. 2 and 3 for explaining the above-described conventional example are denoted by the same reference numerals, and description thereof will be omitted. Differences will be mainly described below.
[0020]
In the abnormality detection system for a parallel computer according to the present embodiment, the nodes 3a to 3n, which are CPU segments including a CPU board, a LAN (local area network) signal board, and an HDD (hard disk drive), are connected to an industrial CPU board. The watchdog timer (Watch Dog Timer) function (a function of performing a self-diagnosis by constantly interrupting) is provided as a standard feature. The watchdog timer function is used to control each of the nodes 3a to 3n (CPU segment). An interrupt program is created so that the result of the self-diagnosis of the operating condition is constantly output to the monitor 5 at regular intervals (for example, every few seconds).
[0021]
In FIG. 1, Ya, Yb, and Yc to Yn are individual self-diagnosis result signals output from the nodes 3a, 3b, 3c to 3n by the watchdog timer function.
[0022]
X is a polling result signal (operation status monitoring result signal) output as a result of polling (operation status monitoring) of each of the nodes 3a, 3b, 3c to 3n by the server 1 described with reference to FIGS. In this embodiment, it is preferable that the protocol for monitoring polling employs HTTP, which has a relatively light load on the network, so that data communication is completed in one shot per monitoring timing.
[0023]
The polling result signal X and the self-diagnosis result signals Ya, Yb, Yc to Yn are input to the switching device 6, which normally selects the polling result signal X and outputs Z to the monitor 5, and the monitor 5 The operating statuses of all the nodes 3a, 3b, 3c to 3n are displayed on the monitor screen 5a of FIG.
[0024]
The switching device 6 designates a specific node 3i and switches the output Z. When the node 3i is designated and receives a switching instruction, the self-diagnosis result signal Yi by the watchdog timer function of the node 3i is output to the output Z. Then, the monitor screen 5a is switched, and the content of the self-diagnosis result signal Yi is displayed on the monitor screen 5a.
[0025]
The display contents of the self-diagnosis result signal Yi are set by the watchdog timer function of each node. However, the display contents do not merely indicate the operation status of the CPU board, the LAN signal board, the HDD, etc. A specific operation state or data can be displayed.
[0026]
The switching device 6 may be an automatic switching device that automatically switches to the self-diagnosis result signal Yi for a node according to a set condition such as when the polling result signal X of any node exceeds a certain range. It may be a switch near the monitor 5 which can be operated by the operator at any time while watching the monitor screen 5a, or a changeover switch attached to the monitor 5, and may be set according to the control level of the abnormality detection system.
[0027]
According to the abnormality detection system for a parallel computer according to the present embodiment as described above, the node operating status displays 13 of all the nodes 3a to 3n based on the polling result of the server 1 are normally displayed on the monitor screen 5a. It is possible to know the general operating status of all the nodes 3a to 3n, and to know the specific state of the self-diagnosis result Yi by specifying the specific node 3i and switching the switching device 6.
[0028]
Then, when an abnormality is found in the node operation status display 13 of the specific node 3i as a result of the polling of the server 1, the switching unit 6 identifies the abnormal point by comparing the self-diagnosis results Yi of the node 3i with the self-diagnosis result. It becomes possible.
[0029]
That is, when the self-diagnosis result Yi of the node 3i is normal when the polling result of the specific node 3i indicates an abnormality, it can be understood that an abnormality has occurred in the network 4.
[0030]
If the polling result of the specific node 3i indicates an abnormality and the self-diagnosis result Yi of the node 3i is also abnormal, it is first known that the node 3i is abnormal.
[0031]
Therefore, while monitoring the operation status of the entire parallel computer by the node operation status display 13 of the polling result of all nodes, once an abnormality is detected, it is easy to grasp the more detailed status and easily identify the location of the abnormality. Thus, it is possible to quickly and appropriately deal with the abnormality.
[0032]
The embodiments of the present invention have been described above. However, the present invention is not limited to the above-described embodiments, and various changes may be made to the specific structure and configuration within the scope of the present invention. It is.
[0033]
【The invention's effect】
(1) According to the first aspect of the present invention, in the abnormality detection system for a parallel computer in which a plurality of nodes each having a CPU board are connected to a server via a network, the server detects the abnormality of the parallel computer. An operation status monitoring program is provided to monitor the operation status of the nodes at regular intervals via a network, and is set to output an operation status monitoring result signal of each node. The node has a function of constantly performing a self-diagnosis. The self-diagnosis result signal of the operation status of each individual node is set to be always output at regular time intervals, and the operation status monitoring result signal of each node and the self-diagnosis result signal of the operation status of each node are provided. And a monitor for displaying the output of the switching device. It is possible to know the general operating status of all nodes based on the server operating status monitoring results, and to know the specific status of the self-diagnosis result of the operating status if a specific node is designated and switched by the switching device. Therefore, once an abnormality is detected while monitoring the operating state of the parallel computer as a whole, it is possible to understand the state of the abnormality in detail and to easily identify the location of the abnormality, and to quickly and appropriately deal with the abnormality. Can do it.
[0034]
(2) According to the second aspect of the present invention, an abnormality detection method for a parallel computer is provided by using the abnormality detection system for a parallel computer according to the first aspect of the present invention. Is output to the monitor, and when a node indicating an abnormality appears in the operation status monitoring result signal of each node, the switching device outputs the self-diagnosis result signal of the operation status of the node indicating the abnormality to the monitor. With such a configuration, the effect of claim 1 is exhibited, and when an abnormality is found in a specific node based on the operation status monitoring result of the server, the switching device looks at the self-diagnosis result of the node and compares the results. It is possible to specify whether the abnormal part is in the node itself or in the network.
[Brief description of the drawings]
FIG. 1 is a schematic configuration diagram of an abnormality detection system for a parallel computer according to an embodiment of the present invention.
FIG. 2 is a configuration schematic diagram of a conventional parallel computer abnormality detection system.
3A is an example of a monitor screen for detecting an abnormality in FIG. 2, and FIG. 3B is an enlarged view of a part of FIG.
[Explanation of symbols]
1 server 2 hub 3a, 3b, 3c to 3n node 3i node 4 network 5 monitor 5a monitor screen 6 switching device 13 node operation status display

Claims (2)

個々にCPUボードを有する複数のノードをネットワークでサーバに接続して用いる並列計算機における異常検出システムにおいて、前記サーバは前記ネットワークを介して前記各ノードに対し一定周期で稼働状況監視を行う稼働状況監視プログラムを備え同各ノードの稼働状況監視結果信号を出力するように設定され、前記ノードは常時自己診断を行なう機能を備え個々の同ノードの稼動状況の自己診断結果信号を常時一定時間毎に出力するように設定されるとともに、前記各ノードの稼働状況監視結果信号と前記個々のノードの稼動状況の自己診断結果信号とを切り換える切換装置と、同切換装置の出力を表示する前記モニタとを備えてなることを特徴とする並列計算機の異常検出システム。In an abnormality detection system in a parallel computer in which a plurality of nodes each having a CPU board are connected to a server via a network, the server monitors an operation status of each of the nodes at regular intervals via the network. A program is set to output an operation status monitoring result signal of each node, and the node has a function of always performing self-diagnosis, and always outputs a self-diagnosis result signal of the operation status of each of the same nodes at regular time intervals. And a switching device for switching between an operation status monitoring result signal of each node and a self-diagnosis result signal of the operation status of each individual node, and the monitor for displaying an output of the switching device. An abnormality detection system for a parallel computer, comprising: 請求項1に記載の並列計算機の異常検出システムを用い、通常は前記切換装置により前記各ノードの稼働状況監視結果信号を前記モニタに出力し、同各ノードの稼働状況監視結果信号において異常を示すノードが現れた場合、前記切換装置により同異常を示すノードの稼動状況の前記自己診断結果信号を前記モニタに出力することを特徴とする並列計算機の異常検出方法。2. An abnormality detection system for a parallel computer according to claim 1, wherein the switching device normally outputs an operation status monitoring result signal of each node to the monitor, and indicates an abnormality in the operation status monitoring result signal of each node. When the node appears, the switching device outputs the self-diagnosis result signal of the operation status of the node indicating the abnormality to the monitor, wherein the abnormality detection method for the parallel computer is provided.
JP2002195945A 2002-07-04 2002-07-04 System for detecting abnormality in parallel computer, and method for detecting the same Withdrawn JP2004038654A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2002195945A JP2004038654A (en) 2002-07-04 2002-07-04 System for detecting abnormality in parallel computer, and method for detecting the same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2002195945A JP2004038654A (en) 2002-07-04 2002-07-04 System for detecting abnormality in parallel computer, and method for detecting the same

Publications (1)

Publication Number Publication Date
JP2004038654A true JP2004038654A (en) 2004-02-05

Family

ID=31704186

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2002195945A Withdrawn JP2004038654A (en) 2002-07-04 2002-07-04 System for detecting abnormality in parallel computer, and method for detecting the same

Country Status (1)

Country Link
JP (1) JP2004038654A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7870424B2 (en) 2006-11-14 2011-01-11 Honda Motor Co., Ltd. Parallel computer system
WO2016020815A1 (en) * 2014-08-04 2016-02-11 Yogitech S.P.A. Method of executing programs in an electronic system for applications with functional safety comprising a plurality of processors, corresponding system and computer program product
ITUB20154590A1 (en) * 2015-10-13 2017-04-13 Yogitech S P A PROCEDURE FOR THE EXECUTION OF PROGRAMS IN AN ELECTRONIC SYSTEM FOR FUNCTIONAL SAFETY APPLICATIONS INCLUDING A PLURALITY OF PROCESSORS, ITS RELATED SYSTEM AND IT PRODUCT

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7870424B2 (en) 2006-11-14 2011-01-11 Honda Motor Co., Ltd. Parallel computer system
WO2016020815A1 (en) * 2014-08-04 2016-02-11 Yogitech S.P.A. Method of executing programs in an electronic system for applications with functional safety comprising a plurality of processors, corresponding system and computer program product
US10248492B2 (en) 2014-08-04 2019-04-02 Intel Corporation Method of executing programs in an electronic system for applications with functional safety comprising a plurality of processors, corresponding system and computer program product
ITUB20154590A1 (en) * 2015-10-13 2017-04-13 Yogitech S P A PROCEDURE FOR THE EXECUTION OF PROGRAMS IN AN ELECTRONIC SYSTEM FOR FUNCTIONAL SAFETY APPLICATIONS INCLUDING A PLURALITY OF PROCESSORS, ITS RELATED SYSTEM AND IT PRODUCT
WO2017064623A1 (en) * 2015-10-13 2017-04-20 Yogitech S.P.A. Method for executing programs in an electronic system for applications with functional safety comprising a plurality of processors, corresponding system and computer program product
US10761916B2 (en) 2015-10-13 2020-09-01 Intel Corporation Method for executing programs in an electronic system for applications with functional safety comprising a plurality of processors, corresponding system and computer program product

Similar Documents

Publication Publication Date Title
US6532151B2 (en) Method and apparatus for clearing obstructions from computer system cooling fans
US10698788B2 (en) Method for monitoring server, and monitoring device and monitoring system using the same
US10430260B2 (en) Troubleshooting method, computer system, baseboard management controller, and system
JP2006277696A (en) Job execution monitoring system, job control device and program, and job execution method
JP4655718B2 (en) Computer system and control method thereof
JP2004038654A (en) System for detecting abnormality in parallel computer, and method for detecting the same
JP2010231293A (en) Monitoring device
CN112131048A (en) Control method and device for server indicator lamp
TWM509371U (en) Monitoring apparatus and computer apparatus
JP2010003092A (en) Screen display system
KR102137891B1 (en) Server managing Method, Server, and Recording medium using User Specialized Operating Mechanism on BMC environment
JP2006285321A (en) Safe instrumentation system
JP2006072545A (en) Power supply control method, power supply control device, and information processor
JPH09288601A (en) System monitoring device
KR100750955B1 (en) System for managing a projector remotely and method therefor
JP7034975B2 (en) Monitoring control system and monitoring control device
JP2008293441A (en) Method and apparatus for predicting device fault
JP2008306449A (en) Apparatus and method for monitoring network device
JP2008059531A (en) Computer system failure notification method
JP2007122505A (en) Server with state display function
JPH04332227A (en) Failure information transmission destination control system
JP2000076570A (en) Video alarm display device and video alarm display method
CN103810081A (en) Abnormal condition warning method
US20190171593A1 (en) Method for remotely triggered reset of a baseboard management controller of a computer system, and computer system using the same
JP2006113773A (en) Console switcher

Legal Events

Date Code Title Description
A711 Notification of change in applicant

Free format text: JAPANESE INTERMEDIATE CODE: A711

Effective date: 20040929

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A821

Effective date: 20040929

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20050513

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20050623

A300 Application deemed to be withdrawn because no request for examination was validly filed

Free format text: JAPANESE INTERMEDIATE CODE: A300

Effective date: 20050906