JP2008165531A

JP2008165531A - Method for failover (restoration) of defective node in computer system having a plurality of nodes

Info

Publication number: JP2008165531A
Application number: JP2006355054A
Authority: JP
Inventors: Yoichi Miwa; 洋一三輪; Aya Minami; 彩南; Takeshi Inagaki; 猛稲垣
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2006-12-28
Filing date: 2006-12-28
Publication date: 2008-07-17
Anticipated expiration: 2026-12-28
Also published as: CN101211282A; JP5078347B2; CN101211282B

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for restoration of a defective node in a torus type computer system. <P>SOLUTION: When one of calculation nodes is broken during execution of calculation in a computer system including a torus network and a tree network including a plurality of IO nodes, each of the calculation nodes forming a link with a terminal IO node of the tree network, an IO node which forms a link with the defective node is reported as an alternate node specified by a one-dimension increased address of the address of the defective node to a node adjacent to the defective node. When the adjacent node receives a packet for the defective node, the packet is routed to the alternate node, and a job designated by the packet which reached the alternate node is processed in the alternate node. The alternate node confirms an address of calculation node to which a packet including the processing result of the job is transmitted, and transmits the packet from the calculation node connected to the alternate node to a calculation node closest to the address. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、複数のノードを有するコンピュータ・システムに関する。特に、本発明は、トーラス・ネットワークの故障ノードの修復方法（いわゆるフェイルオーバー）に関する。
ここで、フェイル・オーバーとは、ノードに障害が発生した場合に、代替ノードに処理を継ぐ機能を言う。 The present invention relates to a computer system having a plurality of nodes. In particular, the present invention relates to a method for repairing a fault node in a torus network (so-called failover).
Here, “failover” refers to a function of continuing processing to an alternative node when a failure occurs in a node.

従来、大規模な並列計算システム（コンピュータ・システム）において、トーラス・ネットワーク（以下「トーラス」（ＴＯＲＵＳ）ともいう。）が用いられている。トーラス・ネットワークは、ある立体形状をなす３次元空間の各格子点に通信ノード（以下「計算ノード」「ＩＯノード」とも言う）を配置した場合において互いに隣接して配置される通信ノードを互いに接続した通信ネットワークをいう。トーラス・ネットワークは、２次元トーラスであれば正方形、３次元トーラスであれば立方体に構成することが最も望ましい。 Conventionally, a torus network (hereinafter also referred to as “TORUS”) is used in a large-scale parallel computing system (computer system). The torus network connects communication nodes arranged adjacent to each other when communication nodes (hereinafter also referred to as “calculation nodes” and “IO nodes”) are arranged at each lattice point in a three-dimensional space having a certain three-dimensional shape. Communication network. Most preferably, the torus network is a square if it is a two-dimensional torus and a cube if it is a three-dimensional torus.

図１は、コンピュータ・システム１０および情報処理装置２０の全体構成を示す。コンピュータ・システム１０は、複数の通信ノード（後述の例では通信ノードは３次元格子点（ｘ、ｙ、ｚ、）に配置される）を有する。そして、コンピュータ・システム１０は、複数の通信ノードのそれぞれにおいて、例えば数値計算等のためのプログラムを実行する。情報処理装置２０は、コンピュータ・システム１０中の各通信ノードに対し、プログラムの実行要求を送信する。この実行要求には、実行すべき処理内容のみならず、他の何れの通信ノードから受け取った実行結果を用いてプログラムを実行するか、または、実行結果を他の何れの通信ノードに対して実行するかの指示が含まれる。即ち、コンピュータ・システム１０は、情報処理装置２０からの要求に応じてプログラムを並列実行して、その実行結果を情報処理装置２０に返信する。これにより、単一の通信ノードによってプログラムを実行するよりも極めて効率的にプログラムを実行することができる。 FIG. 1 shows the overall configuration of the computer system 10 and the information processing apparatus 20. The computer system 10 has a plurality of communication nodes (in the example described later, the communication nodes are arranged at three-dimensional lattice points (x, y, z)). The computer system 10 executes a program for numerical calculation, for example, in each of the plurality of communication nodes. The information processing apparatus 20 transmits a program execution request to each communication node in the computer system 10. In this execution request, not only the processing contents to be executed but also the execution result received from any other communication node is used to execute the program, or the execution result is executed to any other communication node. Includes instructions on what to do. That is, the computer system 10 executes the program in parallel in response to a request from the information processing apparatus 20 and returns the execution result to the information processing apparatus 20. As a result, the program can be executed more efficiently than when the program is executed by a single communication node.

図２は、コンピュータ・システム１０のトーラス・ネットワークの構成部分を示す。コンピュータ・システム１０は、通信ノード（以下、「計算ノード」、「トーラスノード」という）１２と、通信リンク（以下「リンク」という）１３とを有する。通信ノード１２のそれぞれは、他のそれぞれの通信ノードと並列にプログラムを実行する。通信ノード１２のそれぞれは、典型的には、プロセッサ（ＣＰＵ、ＭＰＵまたは中央処理装置）である。また、通信ノード１２のそれぞれは、ＤＲＡＭなどの記憶装置であってもよいし、プロセッサを同時に併設してもよい。 FIG. 2 shows the components of the torus network of computer system 10. The computer system 10 includes a communication node (hereinafter referred to as “calculation node” and “torus node”) 12 and a communication link (hereinafter referred to as “link”) 13. Each of the communication nodes 12 executes a program in parallel with each of the other communication nodes. Each of the communication nodes 12 is typically a processor (CPU, MPU or central processing unit). Each of the communication nodes 12 may be a storage device such as a DRAM, or a processor may be provided at the same time.

情報処理装置２０は、ＣＰＵ及びハードディスクを有する。従来の複数のノードを有するコンピュータ・システム１０は、複数の計算ノード１２からなるトーラス・ネットワーク１０と１つのＩＯノードを含む。コンピュータ・システムの主要部分を構成する各計算ノードはトーラス・ネットワークのリンク１３（図２）とは別に、ツリー・ネットワークのリンク１５（図３の計算ノード１２の接続関係）と最上位の１つのＩＯノード１４でツリー・ネットワークのリンクを形成する。コンピュータ・システム１０は、このツリー・ネットワークにより情報処理装置２０に接続される。 The information processing apparatus 20 includes a CPU and a hard disk. A conventional computer system 10 having a plurality of nodes includes a torus network 10 composed of a plurality of computing nodes 12 and one IO node. In addition to the torus network link 13 (FIG. 2), each computation node constituting the main part of the computer system is connected to the tree network link 15 (connection relation of the computation nodes 12 in FIG. 3) and the top one. The IO node 14 forms a tree network link. The computer system 10 is connected to the information processing apparatus 20 by this tree network.

トーラス・ネットワークは、隣接された計算ノード（最近接の格子点間のノード）とのみ接続されるため、個々のノードに置けるルーティングのオーバーヘッドが小さく構成も簡単であるためハードウエア・システムを実現することが容易である。またネットワーク自体がスケーラブルであるため、ＩＢＭＢｌｕｅＧｅｎｅ／ＬをはじめとするＭａｓｓｉｖｅＰａｒｒａｌｌｅｌコンピュータ・システムによく利用される。しかしながら、トーラス・ネットワークでは、隣接された計算ノードとしか接続されていないため、１つの計算ノードが故障した場合、そのノードの代替ノードを持つことが困難である。 Since the torus network is connected only to the adjacent computing nodes (nodes between the nearest grid points), the routing overhead at each node is small and the configuration is simple, thus realizing a hardware system. Is easy. Since the network itself is scalable, it is often used in Massive Parallel computer systems such as IBM BlueGene / L. However, in the torus network, since it is connected only to an adjacent calculation node, when one calculation node fails, it is difficult to have an alternative node for that node.

一般に冗長性を考慮したシステムでは、あるノードが故障した場合、そのノードを代替するノードが割当てられる。以降の処理は故障ノードに代わりに代替ノードで行われることになる。図４は、２次元格子点位置（ｘ、ｙ）配置されたノードの一つが故障した場合を示す。トーラス・ネットワーク自体は隣接ノードとしか接続されていないため、代替ノードをトーラスの論理的に同じ３次元格子点位置（ｘ、ｙ、ｚ）に配置することはできない。代替ノードをアサインすることが出来ないか、アサイン出来たとしても、代替ノードへのルーティングが非常に複雑になる。そのため、オーバーヘッドが大きくパフォーマンスを著しく低下させる。このような問題点は、３次元格子点に配置されたトーラスノード（計算ノード）が故障した場合にその代替ノードを与える際に顕著になる。 In general, in a system that considers redundancy, when a node fails, a node that replaces the node is assigned. Subsequent processing is performed at the alternative node instead of the failed node. FIG. 4 shows a case where one of the nodes arranged at the two-dimensional lattice point position (x, y) fails. Since the torus network itself is connected only to adjacent nodes, it is not possible to place an alternative node at the logically same three-dimensional lattice point position (x, y, z) of the torus. Even if an alternative node cannot be assigned or can be assigned, routing to the alternative node becomes very complicated. Therefore, the overhead is large and the performance is significantly reduced. Such a problem becomes conspicuous when a replacement node is provided when a torus node (calculation node) arranged at a three-dimensional lattice point fails.

この課題にして、ＩＢＭの並列コンピュータ・システムでは以下のようにシステム運用を行っている。例えば、ＩＢＭＢｌｕｅＧｅｎｅ／Ｌ等では複数のノードを有する大規模集積されたシステムである。多数のノードを有する並列コンピュータ・システムにおいては、スカラブル（ｓｃａｂｌｅ）にノードを拡張できるハードウエアであるが、ノードが多くなれば故障は発生する確率が増す。特定のノードが故障した場合には電源を落としてノードを交換し、その後最後にハードディスク（ＨＤＤ）に書かれた（バックアップしている）チェックポイントから計算を再開するという方法がとられている。ノード数が増えれば増えるほど故障率も上がり、このことがシステム全体のスループットを大きく下げる原因（問題）になってしまう。 In order to solve this problem, the IBM parallel computer system is operated as follows. For example, IBM BlueGene / L is a large-scale integrated system having a plurality of nodes. In a parallel computer system having a large number of nodes, it is hardware that can be expanded into a scalable node. However, if the number of nodes increases, the probability that a failure will occur increases. When a specific node fails, the power is turned off, the node is replaced, and then the calculation is restarted from the checkpoint written (backed up) on the hard disk (HDD). As the number of nodes increases, the failure rate increases, which causes a problem (problem) that greatly reduces the throughput of the entire system.

特許文献１は、マルチプロセッサ並列ネットワークにおいてハードウエア障害が生じた場合にどのようにして並列ネットワークを構築し直す方法を提供する。この方法は、マルチプロセッサ並列ネットワークにおいてハードウエア障害が生じたネットワークを回復させるとう課題を解決することを目的としている。多数のノードからなる並列コンピュータ・システムにおいて故障の生じたプロセッサを含むグループを冗長なプロセッサを含むグループにより交換して、ハードウエア障害から回復できるようにしている。そのために、特許文献１は、スイッチモジュールを用いてトーラスを結線のやり直しなしに動的に分割する。例えば、４×４×４の３Ｄトーラスの１ノードでエラーが発生した場合、１×４×４と３×４×４に（１×４×４に故障ノードが含まれるように）分割し、３×４×４で計算をやり直すという方法である。この方法では、ノード数がかわってしまう。また、この方法は、はじめから５×４×４を１×４×４＋４×４×４と分割しておいて、エラーがおこったら故障ノードが１×４×４に含まれるように再分割する。これらの方法では、並列ネットワークにおいて計算実行途中における故障ノードを回復して、途中までの計算が無駄になる。 Patent Document 1 provides a method for reconstructing a parallel network when a hardware failure occurs in a multiprocessor parallel network. This method is intended to solve the problem of recovering a network in which a hardware failure has occurred in a multiprocessor parallel network. In a parallel computer system including a large number of nodes, a group including a failed processor is replaced with a group including a redundant processor so that a hardware failure can be recovered. For this purpose, Patent Document 1 uses a switch module to dynamically divide the torus without reconnecting. For example, when an error occurs in one node of a 4 × 4 × 4 3D torus, it is divided into 1 × 4 × 4 and 3 × 4 × 4 (so that a failed node is included in 1 × 4 × 4), This is a method of redoing the calculation with 3 × 4 × 4. This method changes the number of nodes. In this method, 5 × 4 × 4 is divided into 1 × 4 × 4 + 4 × 4 × 4 from the beginning, and if an error occurs, the failure node is subdivided into 1 × 4 × 4. . In these methods, the failure node in the middle of the calculation execution is recovered in the parallel network, and the calculation up to the middle is wasted.

特許公表第２００４−５３２４４７号公報Patent Publication No. 2004-532447

上記の通り、複数のノードを有する並列システムにおいて、ノード故障などハードウエア障害が生じた場合、実行中の計算を無駄にすることを解決しない。また、既存のトーラス・ネットワークの構成を大幅に変更する必要があるために、計算実行のパフォーマンスの向上が図れない。特に、長時間かけて計算した科学技術計算、金融工学などの分野では、ノンストップで継続的に計算履歴を取得したい場合に、ユーザ及びシステム運用者に損失は大きい。 As described above, when a hardware failure such as a node failure occurs in a parallel system having a plurality of nodes, it does not solve the waste of computation being executed. In addition, since it is necessary to significantly change the configuration of the existing torus network, the performance of calculation execution cannot be improved. In particular, in fields such as scientific and engineering calculations and financial engineering calculated over a long period of time, if it is desired to continuously obtain a calculation history in a non-stop manner, there is a great loss for users and system operators.

そこで本発明は、上記の課題を解決することのできるトーラスネットワーク（コンピュータ・システム）を提供することを目的とする。
また本発明は、上記の課題を解決することのできる複数のノードを有するコンピュータ・システム（トーラス・ネットワーク）の故障ノードの修復（フェイルオーバー）する方法を提供することを目的とする。 Accordingly, an object of the present invention is to provide a torus network (computer system) that can solve the above-described problems.
It is another object of the present invention to provide a method of repairing (failing over) a failed node in a computer system (torus network) having a plurality of nodes that can solve the above-described problems.

かかる目的のもと、本発明は、３次元格子点（アドレス）に配置され隣接格子点間でリンクを形成する複数の計算ノードからなるトーラス・ネットワークと、複数のＩＯノードからなるツリー・ネットワークとを有し前記計算ノードの各々は、前記ツリー・ネットワークの末端のＩＯノードとリンクを形成する、コンピュータ・システムにおいて計算の実行中に１つの計算ノードが故障した場合フェイル・オーバーする方法である。この方法は、計算の実行中に故障の計算ノードを検出するステップと、前記故障の計算ノード（故障ノード）にリンクされた前記ＩＯノードを、前記故障ノードのアドレスに一次元増やしたアドレスにより特定される代替ノードとするステップと、前記故障ノードに隣接する計算ノード（隣接ノード）が前記故障ノード宛のパケットを受取ると、前記パケットを前記代替ノードにルーティングするステップと、を備えることを特徴とする。
また、この方法において、前記コンピュータ・システムの複数の計算ノードは、３次元トーラスとして接続されたａ×ｂ×ｃ個のアレイであり、前記計算ノードのそれぞれは隣接する計算ノードへ＋および−のｘ，ｙ，ｚ方向に６つのリンクを形成し、
前記コンピュータ・システムの前記末端のＩＯノードは、前記３次元トーラスのｚ面のａ×ｂ個のアレイの所定の数の計算ノードとリンクを形成し、前記計算ノードは、全体で７つのリンクを有することを特徴とする。
また、この方法において、前記ＩＯノードを代替ノードとするステップは、前記故障ノード（ｘ、ｙ、ｚ）とリンクを形成する前記ＩＯノードを代替ノードとして前記代替ノードのアドレス（ｘ、ｙ、ｚ、１）を、前記故障ノードに隣接する計算ノードに知らせるステップを含むことを特徴とする。
また、この方法において、前記代替ノードに到達した前記パケットが指定するジョブを前記代替ノードにおいて処理するステップを更に含むことを特徴とする。
また、この方法において、前記代替ノードは前記ジョブの処理結果を含むパケットを送る計算ノードのアドレスを確認し、前記代替ノードに接続された計算ノードから前記アドレスに一番近い計算ノードに前記パケットを送るステップを更に備えることを特徴とする。
また、この方法において、前記ルーティングするステップは、前記隣接する計算ノードが前記代替ノードに接続されている場合、前記代替ノードに前記パケットを送るステップであることを特徴とする。
また、この方法において、前記ルーティングするステップは、前記隣接ノードが前記代替ノードと別のＩＯノードに接続されている場合、前記隣接ノードに到達した前記パケットを前記別のＩＯノードに送り、前記ツリー・ネットワークを経由して前記代替ノードに送るステップであることを特徴とする。
また、この方法において、前記計算ノード及び前記ＩＯノードは、少なくとも１つのＣＰＵ、及びメモリを含むことを特徴とする。 For this purpose, the present invention provides a torus network composed of a plurality of calculation nodes arranged at a three-dimensional lattice point (address) and forming a link between adjacent lattice points, and a tree network composed of a plurality of IO nodes. Each of the compute nodes forms a link with the IO node at the end of the tree network and is a method of failing over if one compute node fails during computation execution in a computer system. This method includes a step of detecting a failure calculation node during execution of the calculation, and the IO node linked to the failure calculation node (failure node) is identified by an address that is one-dimensionally increased to the address of the failure node. And a step of routing the packet to the alternative node when a computation node (adjacent node) adjacent to the failed node receives a packet addressed to the failed node. To do.
Further, in this method, the plurality of calculation nodes of the computer system are a × b × c arrays connected as a three-dimensional torus, and each of the calculation nodes is connected to adjacent calculation nodes by + and −. Form 6 links in x, y, z direction,
The terminal IO node of the computer system forms a link with a predetermined number of compute nodes in an a × b array of z-planes of the three-dimensional torus, and the compute node has a total of seven links. It is characterized by having.
Further, in this method, the step of using the IO node as an alternative node includes an address (x, y, z) of the alternative node using the IO node forming a link with the failed node (x, y, z) as an alternative node. 1) including the step of notifying a calculation node adjacent to the failed node.
The method further includes the step of processing, in the alternative node, a job designated by the packet that has reached the alternative node.
In this method, the alternative node confirms an address of a calculation node to which a packet including the processing result of the job is sent, and sends the packet from a calculation node connected to the alternative node to a calculation node closest to the address. The method further comprises the step of sending.
Further, in this method, the routing step is a step of sending the packet to the alternative node when the adjacent computing node is connected to the alternative node.
Further, in this method, when the adjacent node is connected to another IO node different from the alternative node, the routing step sends the packet that has reached the adjacent node to the other IO node, and -It is the step which sends to the said alternative node via a network, It is characterized by the above-mentioned.
In this method, the calculation node and the IO node include at least one CPU and a memory.

かかる目的のもと、本発明は、３次元格子点（アドレス）に配置され隣接格子点間でリンクを形成する複数の計算ノードからなるトーラス・ネットワークと、複数のＩＯノードからなるツリー・ネットワークとを有し、前記計算ノードの各々は、前記ツリー・ネットワークの末端のＩＯノードとリンクを形成する、コンピュータ・システムにおいて計算の実行中に前記計算ノードの１つに故障が発生した場合に（ａ）フェイル・オーバーするためのプログラムである。このプログラムは、前記コンピュータに、
（ｂ）故障の計算ノード（故障ノード）を検出するステップと、
（ｃ）前記故障ノードと前記リンクを形成する前記ＩＯノードを、前記故障ノードのアドレスに一次元増やしたアドレスにより特定される代替ノードとして、前記故障ノードに隣接するノード（隣接ノード）に知らせるステップと、
（ｄ）前記隣接ノードが前記故障ノード宛のパケットを受取ると、前記パケットを前記代替ノードにルーティングするステップと、
（ｇ）前記代替ノードに到達した前記パケットが指定するジョブを前記代替ノードにおいて処理するステップと、
（ｈ）前記代替ノードは前記ジョブの処理結果を含むパケットを送る計算ノードのアドレスを確認し、前記代替ノードに接続された計算ノードから前記アドレスに一番近い計算ノードに前記パケットを送るステップと、
を実行させることを特徴とする。 For this purpose, the present invention provides a torus network composed of a plurality of calculation nodes arranged at a three-dimensional lattice point (address) and forming a link between adjacent lattice points, and a tree network composed of a plurality of IO nodes. And each of the compute nodes forms a link with an IO node at the end of the tree network when a failure occurs in one of the compute nodes during the computation in the computer system (a ) A program for failing over. This program is stored in the computer
(B) detecting a failure calculation node (failure node);
(C) Informing the node (adjacent node) adjacent to the failed node of the IO node forming the link with the failed node as an alternative node specified by an address that is one-dimensionally increased to the address of the failed node When,
(D) when the neighboring node receives a packet addressed to the failed node, routing the packet to the alternative node;
(G) processing the job designated by the packet reaching the alternative node at the alternative node;
(H) the alternative node confirms an address of a calculation node that sends a packet including the processing result of the job, and sends the packet from a calculation node connected to the alternative node to a calculation node closest to the address; ,
Is executed.

かかる目的のもと、本発明は、３次元格子点に配置され隣接格子点間でリンクを形成する複数の計算ノードからなるトーラス・ネットワークと、複数のＩＯノードからなるツリー・ネットワークとを備え、前記計算ノードの各々は、前記ツリー・ネットワークの末端のＩＯノードとリンクを形成し、更に、計算の実行中に前記計算ノードが故障した場合、前記故障ノードと前記リンクを形成する前記ＩＯノードを、前記故障ノードのアドレスに一次元増やしたアドレスにより特定される前記代替ノードに代替ノードとする手段と、備えるコンピュータ・システムである。
また、このコンピュータ・システムは、前記隣接する計算ノードは、前記故障ノード宛のパケットを受取ると、前記代替ノードにルーティングする手段と、前記代替ノードは、前記パケットにより指定されるジョブの処理結果のパケットを送るアドレスを確認し、前記代替ノードに接続されている複数の計算ノードのアドレスから、宛先に一番近いアドレスの計算ノードを選び、前記アドレスの計算ノードに前記パケットを送る手段とを、
更に備えることを特徴とする。 For this purpose, the present invention comprises a torus network composed of a plurality of calculation nodes arranged at three-dimensional lattice points and forming links between adjacent lattice points, and a tree network composed of a plurality of IO nodes, Each of the calculation nodes forms a link with the IO node at the end of the tree network, and further, when the calculation node fails during the execution of the calculation, the IO node forming the link with the failure node , A computer system comprising: means for making the alternative node specified by an address one-dimensionally increased to the address of the failed node as an alternative node.
Further, in this computer system, when the adjacent computing node receives a packet addressed to the failed node, the computer node routes to the alternative node, and the alternative node indicates the processing result of the job specified by the packet. Means for confirming an address to which a packet is sent, selecting a calculation node having an address closest to a destination from addresses of a plurality of calculation nodes connected to the alternative node, and sending the packet to the calculation node of the address;
It is further provided with the feature.

本発明によれば、複数の計算ノードから構成されるコンピュータ・システムのトーラスの構成を変えることなく故障時の代替ノードの割当てることを可能にすることが出来る。
また、本発明によれば、トーラスの構成を実質的に変更しないため、トーラス型のコンピュータ・システムの変更を最小限に抑えられる。
また、本発明によれば、トーラス型のコンピュータ・システムにおいて故障ノードが検出されるまでの計算実行のジョブ（チェックポイント）を短時間に回復して、チェックポイントから後続の計算を再開できる。 According to the present invention, it is possible to assign an alternative node at the time of failure without changing the torus configuration of a computer system including a plurality of calculation nodes.
Further, according to the present invention, since the configuration of the torus is not substantially changed, the change of the torus type computer system can be minimized.
In addition, according to the present invention, a calculation execution job (checkpoint) until a fault node is detected in a torus type computer system can be recovered in a short time, and subsequent calculations can be resumed from the checkpoint.

以下、発明の実施の形態を通じて本発明を説明するが、以下の実施形態（実施例）は特許請求の範囲にかかる発明を限定するものではなく、また実施形態の中で説明されている特徴の組み合わせの全てが発明の解決手段に必須であるとは限らない。 Hereinafter, the present invention will be described through embodiments of the invention. However, the following embodiments (examples) do not limit the invention according to the scope of claims, and the features described in the embodiments are described below. Not all combinations are essential for the solution of the invention.

以下の方法により、トーラスの３次元格子点（ｘ、ｙ、ｚ）の複数のノードの構成を実質的に変えることなく故障時の代替ノードのアサインを可能にすることが出来るようにする。
１．トーラス・ネットワークを構成する各ノード（「計算ノード」と言う。）に少なくとも１つのリンク（接続）を追加する。
２．追加された１つのリンクはトーラス・ネットワークの外にあるＩＯノードに接続される。故障した計算ノード（故障ノード）とリンクを形成しているＩＯノードを、故障ノードの代替ノードとして、以下で説明する変換則を適用する。
３．トーラス・ネットワークの外にあるＩＯノード１４は、複数のトーラス・ネットワーク上の計算ノード１２とスター状に接続されている。図３は、トーラス・ネットワークの外にあるＩＯノード同士はツリー（Ｔｒｅｅ）状に接続されている。
４．トーラス・ネットワークの計算ノードからトーラスの外にあるＩＯノードへのルーティングは、以下で詳述するように実質的にトーラス・ネットワークでルーティングされる。このルーティング方法が、本発明の特徴的な内容である。このルーティング方法により、既存の並列ネットワーク・システムのトーラスノード（計算ノード）の構成の変更を最小限に抑えられる。言い換えると、この方法は、故障ノードの発生したコンピュータ・システムの既存のトーラスノードの構成を擬似的に維持して、自動的にフェイルオーバー（故障ノードの交換）をする。 The following method makes it possible to assign an alternative node at the time of failure without substantially changing the configuration of a plurality of nodes of the three-dimensional lattice point (x, y, z) of the torus.
1. At least one link (connection) is added to each node (referred to as a “computation node”) constituting the torus network.
2. One added link is connected to an IO node outside the torus network. The conversion rule described below is applied with an IO node forming a link with a failed calculation node (failed node) as an alternative node of the failed node.
3. The IO node 14 outside the torus network is connected to the computing nodes 12 on the plurality of torus networks in a star shape. In FIG. 3, IO nodes outside the torus network are connected in a tree shape.
4). Routing from a torus network compute node to an IO node outside the torus is substantially routed in the torus network as detailed below. This routing method is a characteristic content of the present invention. This routing method can minimize the change in the configuration of the torus node (compute node) of the existing parallel network system. In other words, in this method, the configuration of the existing torus node of the computer system in which the failed node occurs is maintained in a pseudo manner, and failover (replacement of the failed node) is automatically performed.

図５は、本発明の３次元トーラスを含むコンピュータ・システムのハードウエアの構成の実施例を示す。３次元トーラスの場合、各ノード（ｘ、ｙ、ｚ）は、各軸のそれぞれ正負の方向にリンク（ｌｉｎｋ：接続関係を有する意味）を持つことから、６本のリンクを持っている。３次元トーラスは、複数のＩＯノード１４からなるツリー・ネットワークを経由して情報処理装置２０（図３）に接続される。 FIG. 5 shows an embodiment of the hardware configuration of a computer system including a three-dimensional torus according to the present invention. In the case of a three-dimensional torus, each node (x, y, z) has six links because it has links (link: meaning having a connection relationship) in the positive and negative directions of each axis. The three-dimensional torus is connected to the information processing apparatus 20 (FIG. 3) via a tree network composed of a plurality of IO nodes 14.

なお、図３に示されているように、ツリー・ネットワークは、従来のＩＢＭのコンピュータ・システムでは、最上位のＩＯノード１４以外はトーラスノード（計算ノード）１２で構成されている。一方、本発明のコンピュータ・システムでは、複数のトーラスノード１２（計算ノード）を併用せず、複数のＩＯノード１４のみによりツリー・ネットワークを構成する。 As shown in FIG. 3, in the conventional IBM computer system, the tree network is composed of torus nodes (calculation nodes) 12 other than the highest-level IO node 14. On the other hand, in the computer system according to the present invention, a plurality of torus nodes 12 (calculation nodes) are not used together, and a tree network is configured by only a plurality of IO nodes 14.

本実施例では、計算ノード１２にリンクを１本追加するので、各計算ノードは７本のリンクを持つことになる。各計算ノード１２の７本のリンクの内６本は、従来の通りトーラスノード（隣接する計算ノード１２）に接続される。追加された１本は、図５に示すようにトーラスノード（計算ノード）１２をツリー・ネットワークの末端のＩＯノード１４に接続させる。末端のＩＯノードは、代替ノードＳ１，Ｓ２，Ｓ３，Ｓ４（図６を参照）として機能する。ＩＯノード１４はトーラス上の計算ノード１２と同じハードウエアである場合には、設計効率の観点から、図５及び図６のように７本のリンクを設けてもよい。また、ＩＢＭＢｌｕｅＧｅｎｅ／ＬのＩＯノードなどと同じように、ＩＯノードは、トーラス上の計算ノードよりも大容量のメモリを実装している。トーラス上のノードに接続されているＩＯノードの７本のリンクは、６本はトーラス上のノードに接続され、他の１本は末端のＩＯノードに接続される。図５に示されるように、トーラス上のノード（計算ノード）に接続されていないＩＯノード１４は、７本とも他のＩＯノードに接続され、ＩＯノード１４同士はツリー・ネットワーク構造を構成している。尚、各計算ノード１２は、２以上（複数）の末端のＩＯノード１４とリンクを形成してもよい。 In this embodiment, since one link is added to the calculation node 12, each calculation node has seven links. Six of the seven links of each computation node 12 are connected to a torus node (adjacent computation node 12) as before. The added one connects a torus node (calculation node) 12 to an IO node 14 at the end of the tree network as shown in FIG. The terminal IO node functions as alternate nodes S1, S2, S3, and S4 (see FIG. 6). When the IO node 14 is the same hardware as the computation node 12 on the torus, seven links may be provided as shown in FIGS. 5 and 6 from the viewpoint of design efficiency. Further, like the IO node of IBM BlueGene / L, the IO node has a larger capacity memory than the calculation node on the torus. Of the seven links of the IO nodes connected to the nodes on the torus, six are connected to the nodes on the torus and the other one is connected to the terminal IO node. As shown in FIG. 5, seven IO nodes 14 that are not connected to nodes (calculation nodes) on the torus are connected to other IO nodes, and the IO nodes 14 constitute a tree network structure. Yes. Each calculation node 12 may form a link with two or more (multiple) terminal IO nodes 14.

図６はｚ＝２の面で３次元トーラスを切り取った図を示す。図６を参照しながら、１つの計算ノードが故障し、その故障した計算ノードに末端のＩＯノード（代替ノード）を割当てる方法を説明する。また、その故障ノード宛へのパケットの代替ノードへのルーティングする（ｒｏｕｔｉｎｇ：経路指定する）方法、及び、代替ノードからジョブの処理結果をルーティングする方法を以下のシーケンスにより説明する。 FIG. 6 is a diagram in which a three-dimensional torus is cut out on a plane where z = 2. A method of assigning a terminal IO node (alternative node) to a failed computation node will be described with reference to FIG. In addition, a method for routing (routing) a packet destined for the failed node to a substitute node and a method for routing a job processing result from the substitute node will be described in the following sequence.

正常なオペレーション（故障ノードが存在しない）の場合は、処理はすべてトーラス・ネットワークを構成する３次元格子点（ｘ、ｙ、ｚ）に配置した複数の計算ノードの中で閉じている。ＩＯノード及びこのツリー・ネットワークが通常処理に使われることはない。ただし、ＩＢＭＢｌｕｅＧｅｎｅ／Ｌなど既存の大規模並列ネットワーク・システムにおいて、ツリー・ネットワークを構成するＩＯノードは、各計算ノードの処理結果を情報処理装置２０（図ＣＰＵ及びＨＤＤ）に送る為に利用される。 In the case of normal operation (there is no failure node), all the processes are closed in a plurality of calculation nodes arranged at the three-dimensional lattice points (x, y, z) constituting the torus network. The IO node and this tree network are not used for normal processing. However, in existing large-scale parallel network systems such as IBM BlueGene / L, the IO nodes that make up the tree network are used to send the processing results of each computation node to the information processing device 20 (FIG. CPU and HDD). The

ＩＯノードは、自分に接続されているトーラスノード（計算ノード）と同じ数の仮想トーラスアドレスを持っている。図６を参照すると、ＩＯノードＳ２が代替ノードとなる場合、このノードが持つ仮想トーラスアドレス（計算ノード）は、このノードとリンクを形成する計算ノード（ｘ，ｙ，ｚ）に一次元と追加したものである（ｘ，ｙ，ｚ，１）。例えば、図６の代替ノードＳ２は、６個のアドレス（６，５，２，１）（６，６，２，１）（６，７，２，１）（６，８，２，１）（６，９，２，１）（６，１０，２，１）の計算ノードを代替する。このように末端のＩＯノードＳ２は、複数の仮想トーラスアドレスを持つことにより、ノードが故障した場合のルーティングをスムーズに行うことが出来る。 The IO node has the same number of virtual torus addresses as the torus node (calculation node) connected to the IO node. Referring to FIG. 6, when the IO node S2 is an alternative node, the virtual torus address (calculation node) possessed by this node is one-dimensionally added to the computation node (x, y, z) that forms a link with this node. (X, y, z, 1). For example, the alternative node S2 in FIG. 6 has six addresses (6, 5, 2, 1) (6, 6, 2, 1) (6, 7, 2, 1) (6, 8, 2, 1). The (6, 9, 2, 1) (6, 10, 2, 1) computation node is substituted. In this way, the terminal IO node S2 has a plurality of virtual torus addresses, so that routing in the event of a node failure can be performed smoothly.

本発明のコンピュータ・システムは、所定の計算の実行中において一定時間ごとにこれら６個の計算ノードの処理結果の情報を代替ノードＳ２の記憶部（図７を参照）にバックアップする。従来のトーラス型のコンピュータ・システムでは、所定の計算の実行中の各計算ノードの処理結果の情報をＩＯノードを経由して情報処理装置２０のＨＤＤ（図１及び図３を参照）にバックアップする。このバックアップされた各計算ノードの処理結果の情報は、システムにハードウエア障害が発生した場合そのハードウエア障害の修復後に、所定の計算の後続の処理をどこから始まるかのチェックポイントを特定するのに使用される。 The computer system of the present invention backs up the information of the processing results of these six calculation nodes to the storage unit (see FIG. 7) of the alternative node S2 at regular intervals during execution of a predetermined calculation. In a conventional torus type computer system, information on the processing result of each calculation node during execution of a predetermined calculation is backed up to the HDD (see FIGS. 1 and 3) of the information processing apparatus 20 via the IO node. . The information on the processing results of each backed up computation node is used to identify a checkpoint where to start the subsequent processing of a given computation after the hardware failure is repaired if a hardware failure occurs in the system. used.

従来のＩＢＭＢｌｕｅＧｅｎｅ／Ｌの通常オペレーション時にも、システムが故障した場合に備えて、各ノードは処理の再開に必要な最低限の情報を定期的に自分が接続されているＩＯノードに送っている。末端のＩＯは、各トーラスノード（計算ノード）から送られてきた情報を自分のメモリに蓄えておく。そして、従来のシステムでは、この処理の再開に必要な最低限の情報はＨＤＤに書いていた。ＨＤＤに全てのノードの情報を書き出すまでの間処理を中断する必要があった。そしてＨＤＤに蓄えられている情報から中断した次に処理が開始可能なチェックポントを判断する。故障ノードを交換した後に、そのチェックポントから全体の計算ノードの実行を開始する。なお、トーラス型の並列コンピュータ・システムは、３次元格子点に配置された複数の計算ノードからなるトーラス・ネットワークと、それら計算ノード及び１つのＩＯノードからなるツリー・ネットワークを持つ。図３に示すように、このツリー・ネットワークは、複数の計算ノード１２及び最上位の１つの専用の１つのＩＯノード１４からなるツリー・ネットワークのリンクにより形成されている（ツリー構成の図３）。 Even during normal operation of the conventional IBM BlueGene / L, each node periodically sends the minimum information necessary for resuming processing to the IO node to which it is connected in preparation for a system failure. . The terminal IO stores information transmitted from each torus node (calculation node) in its own memory. In the conventional system, the minimum information necessary for resuming the processing is written in the HDD. It was necessary to interrupt the processing until all node information was written to the HDD. Then, it is determined from the information stored in the HDD the check point at which the next processing can be started. After replacing the failed node, the execution of all the calculation nodes is started from the check point. A torus type parallel computer system has a torus network composed of a plurality of computation nodes arranged at three-dimensional lattice points, and a tree network composed of these computation nodes and one IO node. As shown in FIG. 3, this tree network is formed by a tree network link comprising a plurality of computation nodes 12 and one dedicated IO node 14 at the highest level (FIG. 3 of the tree configuration). .

この発明では、代替ノード（末端のＩＯノード）のメモリに書くまでの間中断するだけなので、大幅に時間が短縮できる。ツリー・ネットワークを構成するＩＯノード１４は、さらに長い周期で各々計算ノード（トーラスノード）の情報をＨＤＤに書き出すことが可能となる。また、ＩＯノード１４が、ＨＤＤに書き出すのは通常処理とは非同期に行えるため、その間に全体の処理を止める必要はない。 In the present invention, the time until writing to the memory of the alternative node (terminal IO node) is merely interrupted, so the time can be greatly reduced. The IO nodes 14 constituting the tree network can write the information of each calculation node (torus node) to the HDD in a longer cycle. In addition, since the IO node 14 can write to the HDD asynchronously with the normal processing, it is not necessary to stop the entire processing during that time.

図６において、故障ノードのｚ軸方向の上下にも隣接ノード（最近接のノード）が存在するが、ここでは説明の簡単化のためにｚ軸方向の隣接ノードは省略して考える。
１．（６，７，２）のノードが故障したとする。
２．システムが（６，７，２）のノードの故障を検出すると、ＩＯノードＳ２が（６，７，２）の代替ノードになることを故障ノードの隣接ノード（６，６，２）（６，８，２）（７，７，２）（５，７，２）に伝え、記憶させる。
３．最後に末端のＩＯノード（代替ノード）に書かれたチェックポイントまで戻り、処理が再開される。
４．故障ノードと、その隣接ノード以外では通常のオペレーションと変わることはない。
５．故障ノード宛てに送られてきたパケットは、隣接ノードまでは通常のルーティングで到達する。
６．隣接ノードは故障ノード（６，７，２）宛てのパケットを受け取ると、アドレスを１次元増やし、代替ノードＳ２（６，７，２，１）としてルーティングを行う。
７．代替ノード（６，７，２，１）宛のパケットは、通常のトーラスのルーティングに従い、隣接ノードから７本目のリンクに送り出され、代替ノードＳ２に到達する。代替ノード（６，７，２、１）とアドレス指定される末端のＩＯノードＳ２は、故障ノードと７本目のリンクにより、直接または別のＩＯノードを介して接続される。またＩＯノードＳ２は、この故障ノードに隣接する計算ノード（隣接ノード）の７本目のリンクを形成する。この時、代替ノードが複数のトーラスアドレスを持っているため、スター状に接続されたＩＯノードＳ２に、あたかもトーラス接続されているかのようにパケットを送ることが出来る。
８．このとき、故障ノードと同じ代替ノードＳ２と直接リンクを形成する、この故障ノードに隣接するノード（隣接ノード）（（６，８，２）（６，６，２））の場合は、直接代替ノードＳ２にパケットが送られる。
９．故障ノードと別の末端のＩＯノードＳ１、Ｓ３に接続された隣接ノード（（５，７，２）（７，７，２））の場合は、直接代替ノードにパケットを送ることが出来ないため、以下のルーティング行う。
（９-１）．（５，７，２）または（７，７，２）に到達した（６，７，２）宛てのパケットは、（６，７，２，１）宛てのパケットとしてＳ１またはＳ３に送られる。
（９−２）．Ｓ１またはＳ３に送られたパケットは、ツリー・ネットワーク（Ｓ５）を経由して故障ノードの代替ノードであるＳ２:（６，７，２，１）に送られる。
１０．代替ノードに到達したパケットは代替ノードで処理が行われる。
１１．代替ノードＳ２は処理結果を含むパケットの送り先のアドレスを確認し、Ｓ２に接続されているトーラスノード（計算ノード）の６個のアドレスから送り先アドレスに一番近いアドレスを選び、そのアドレスにパケットを送る。送り先アドレスが（５，９，２）である場合、Ｓ２は自分にリンクしている（６，１０，２），（６，９，２），（６，８，２），（６，６，２），（６，５，２）から（６，９，２）を選び、パケットを送る。
１２．代替ノードＳ２からパケットを受け取ったトーラスノード（計算ノード）は通常のルーティングでパケットを処理する。
この本発明の代替ノードへのルーティング方法を利用すると、従来では出来なかったか、非常にオーバーヘッドが大きく実用上難しかったトーラス・ネットワーク上のノードのフェイルオーバー（ＦａｉｌＯｖｅｒ）を実現することが可能になる。 In FIG. 6, there are adjacent nodes (nearest nodes) above and below the failure node in the z-axis direction. Here, for simplicity of explanation, the adjacent nodes in the z-axis direction are omitted.
1. Assume that the node (6, 7, 2) has failed.
2. When the system detects the failure of the node of (6, 7, 2), the IO node S2 is determined to be an alternative node of (6, 7, 2), and the adjacent node (6, 6, 2) (6, 6) of the failed node. 8,2) (7,7,2) (5,7,2) and store it.
3. Finally, the process returns to the checkpoint written in the terminal IO node (alternative node) and the process is resumed.
4). Normal operation is the same except for the failed node and its neighboring nodes.
5. The packet sent to the failed node reaches the adjacent node by normal routing.
6). When the adjacent node receives a packet addressed to the failed node (6, 7, 2), the address is increased by one dimension, and routing is performed as the alternative node S2 (6, 7, 2, 1).
7). The packet addressed to the alternative node (6, 7, 2, 1) is sent to the seventh link from the adjacent node according to the normal torus routing, and reaches the alternative node S2. The terminal IO node S2 addressed to the alternative node (6, 7, 2, 1) is connected to the failed node by a seventh link, either directly or via another IO node. The IO node S2 forms the seventh link of the calculation node (adjacent node) adjacent to the failed node. At this time, since the alternative node has a plurality of torus addresses, it is possible to send a packet to the IO node S2 connected in a star shape as if it were connected to the torus.
8). At this time, in the case of a node adjacent to the failed node (adjacent node) ((6, 8, 2) (6, 6, 2)) that forms a direct link with the same alternate node S2 as the failed node, direct replacement A packet is sent to the node S2.
9. In the case of an adjacent node ((5,7,2) (7,7,2)) connected to the IO node S1 or S3 at the other end from the failed node, the packet cannot be sent directly to the alternative node. Do the following routing.
(9-1). A packet addressed to (6, 7, 2) reaching (5, 7, 2) or (7, 7, 2) is sent to S1 or S3 as a packet addressed to (6, 7, 2, 1).
(9-2). The packet sent to S1 or S3 is sent via the tree network (S5) to S2: (6, 7, 2, 1) which is an alternative node of the failed node.
10. A packet that reaches the alternative node is processed by the alternative node.
11. The alternative node S2 confirms the destination address of the packet including the processing result, selects the address closest to the destination address from the six addresses of the torus node (calculation node) connected to S2, and sends the packet to that address. send. When the destination address is (5, 9, 2), S2 is linked to itself (6, 10, 2), (6, 9, 2), (6, 8, 2), (6, 6 , 2), (6, 5, 2) is selected from (6, 9, 2) and a packet is sent.
12 The torus node (calculation node) that receives the packet from the alternative node S2 processes the packet by normal routing.
By using the routing method to an alternative node according to the present invention, it is possible to realize a failover (FailOver) of a node on a torus network, which has not been possible in the past or has been very difficult and practically difficult.

図７は、計算ノード１２及びＩＯノード１４（ノードとも言う）の機能構成を示す。通信ノード（計算ノード１２及びＩＯノード１４）は、記憶部３００と、受信部３１０と、選択部３２０と、送信部３３０とを有する。記憶部３００は、通信ノード１２からトーラス・ネットワーク１３及びツリー・ネットワーク１４を経由して、他のそれぞれのリンク、通信ノードに至る通信経路のトポロジーを示す情報を記憶している。更に詳細な例を図８に示す。 FIG. 7 shows a functional configuration of the calculation node 12 and the IO node 14 (also referred to as a node). The communication node (the calculation node 12 and the IO node 14) includes a storage unit 300, a reception unit 310, a selection unit 320, and a transmission unit 330. The storage unit 300 stores information indicating the topology of communication paths from the communication node 12 via the torus network 13 and the tree network 14 to other links and communication nodes. A more detailed example is shown in FIG.

図８は、記憶部３００のデータ構造の一例を示す。記憶部３００は、パケットのノードを宛先とする場合（格子点）に対して、６つの隣接ノードをうち最適なノードの宛先を選択する場合にそのノードを変換する変換規則を記憶している。 FIG. 8 shows an exemplary data structure of the storage unit 300. The storage unit 300 stores a conversion rule for converting a node when a node of a packet is a destination (lattice point) and an optimal node destination is selected from six adjacent nodes.

故障ノード（ｘ，ｙ，ｚ）に隣接するノードの記憶部３００には、隣接ノードが故障ノード（ｘ、ｙ、ｚ）を宛先とするパケットを受取った場合に、（ｘ，ｙ，ｚ）を一次元増やしたアドレス（ｘ，ｙ，ｚ，１）とする変換則を適用する。このアドレスは、故障ノードとリンクを形成するＩＯノードを指定する。本発明では、このＩＯノードを故障ノードの代替ノードとして割当てる。故障ノード（ｘ、ｙ、ｚ）のこの変換則として、それと直接にリンクを形成するＩＯノードを指示するのに、４次元で表示（ｘ，ｙ，ｚ，１）表現している。 The storage unit 300 of the node adjacent to the failed node (x, y, z) receives (x, y, z) when the adjacent node receives a packet destined for the failed node (x, y, z). Is applied to the address (x, y, z, 1) which is one-dimensionally increased. This address specifies the IO node that forms a link with the failed node. In the present invention, this IO node is assigned as a substitute node for the failed node. As a conversion rule of the failure node (x, y, z), in order to indicate an IO node that directly forms a link with the failure node (x, y, z), it is expressed in four dimensions (x, y, z, 1).

具体例として、図８は故障ノード（６，７，２）が存在する場合、ＩＯノードＳ２を代替ノードとして割当てられていることを（図６を参照）指示する変換テーブルを、故障ノードに隣接する計算ノード（６，８、２）（６，６，２）（７，７，２）（５，７，２）に知らせ、それぞれの記憶部３００にＳ２を指示する変換テーブルを保持させる。 As a specific example, FIG. 8 shows that when a failure node (6, 7, 2) exists, a conversion table that indicates that the IO node S2 is assigned as an alternative node (see FIG. 6) is adjacent to the failure node. To the calculation nodes (6, 8, 2) (6, 6, 2) (7, 7, 2) (5, 7, 2) to be stored, and the respective storage units 300 hold the conversion tables instructing S2.

なお、本図の変換則は一例であり、記憶部３００のデータ構造には様々なバリエーションが考えられる。例えば、記憶部３００は、変換が必要な座標についてのみ変換則を記憶していてもよい。また、ノード（ｘ、ｙ、ｚ）に直接にリンクしているＩＯノードを指示出来れば、この４次元表現をすることに限られない。隣接ノードが故障ノード（ｘ，ｙ，ｚ）宛のパケットを受取った場合、記憶部３００に、単に（ｘ，ｙ，ｚ）のノードが正常（０）、故障（１）かを示すフラグを設けてもよい。隣接ノードがノード（ｘ，ｙ，ｚ）宛先のパケットパッケを受けた場合において、このフラグが１（オン）の場合にノード（ｘ，ｙ，ｚ）は故障であると判断さる。故障ノード宛のパケットを故障ノードに隣接する計算ノードが受取ると、故障ノードにリンクするＩＯノードＳ２が代替ノードしてパケットを転送する。 In addition, the conversion rule of this figure is an example, and various variations can be considered for the data structure of the memory | storage part 300. FIG. For example, the storage unit 300 may store a conversion rule only for coordinates that need to be converted. Further, if the IO node directly linked to the node (x, y, z) can be indicated, the four-dimensional expression is not limited. When the adjacent node receives a packet addressed to the failed node (x, y, z), a flag indicating whether the node of (x, y, z) is normal (0) or failed (1) is stored in the storage unit 300. It may be provided. When the adjacent node receives a packet package destined for the node (x, y, z), and the flag is 1 (on), it is determined that the node (x, y, z) is faulty. When a calculation node adjacent to the failed node receives a packet addressed to the failed node, the IO node S2 linked to the failed node transfers the packet as an alternative node.

図７に戻る。受信部３１０は、通信パケットを通信パケットの宛先に対応付けて受信する。受信部３１０は、通信パケットの宛先として、３次元格子空間に宛先のノードを配置した場合の座標値（ｘ、ｙ、ｚ）を受信する。例えば、一つ隣接のノード（６，８，２）がノード（６，７，２）の宛先の通信パケットであれば、受信部３１０は、座標（６，７，２，１）を受信する。選択部３２０は、記憶部３００に記憶されている変換則に基づいて、受信した宛先に至る通信経路上でノード次に通信パケットを転送する転送先のノードを選択する。 Returning to FIG. The receiving unit 310 receives the communication packet in association with the destination of the communication packet. The receiving unit 310 receives coordinate values (x, y, z) when the destination node is arranged in the three-dimensional lattice space as the destination of the communication packet. For example, if one adjacent node (6, 8, 2) is a communication packet destined for the node (6, 7, 2), the receiving unit 310 receives the coordinates (6, 7, 2, 1). . Based on the conversion rule stored in the storage unit 300, the selection unit 320 selects a transfer destination node that transfers the communication packet next to the node on the communication path to the received destination.

図９は、上述の変換則を用いて、科学技術など種々の大規模計算の実行の途中において故障ノードが発生した場合のパケットのルーティングのフローチャートを示す。典型的には、図６に示す複数の計算ノードを有するトーラス・ネットワークと複数のＩＯノードをツリー・ネットワーク状に有し、各計算ノードは末端のＩＯノードと少なくとも１つのリンクを有する、コンピュータ・システム場合を考える。図６に示す故障の計算ノード（６，７，２）をＩＯノードＳ２が代替ノードとしてルーティングする動作を説明する。
（a）コンピュータ・システムが計算を実行している際に１つの計算ノード（６，７，２）に障害が発生している。
（ｂ）情報処理装置２０（図１）または当該装置に常駐する監視システムは、故障ノード（６，７，２）を検出する。監視システムは、検出はＩＯノードが構築するツリー・ネットワークを通じて検出できる。
（ｃ）監視システムは、ＩＯノードＳ２が代替ノードであることを隣接ノードに知らせる。
監視システムは、ツリー・ネットワークのＩＯノードＳ２を通じて変換測（故障ノード（６，７，２） → 代替ノード（６，７，２，１）を４つの隣接ノード（６，８，２）、（６，６，２）に知らせ、それぞれの記憶部３００に記憶される。また、この変換則をＳ２→Ｓ５→Ｓ３またはＳ１を経由して２つの隣接ノード（７，６，２）または（５，７，２）に知らせ、それぞれの記憶部に記憶される。なお、ツリー・ネットワーク１５を構成するＩＯノード１４（Ｓ１，Ｓ２，Ｓ３、・・・・）間のルーティングは、各ＩＯノードのハードウエアにおいて事前に設定されている。
（ｄ）６つの隣接ノードは、故障ノード（ｘ、ｙ、ｚ）宛のパケットを受け取ると、代替ノードに変換測に従って、ルーティング（ｅ）または（ｄ）の場合に分かれる。
（ｅ）代替ノードＳ２（６，７，２，１）に接続された隣接ノード（６，８，２）（６，６，２）の場合：直接代替ノードＳ２にパケットを送る。
（ｆ）代替ノードＳ２とは別の代替ノードＳ３に接続された隣接ノード（７，７，２）の場合：隣接ノード（７，７，２）に到達した故障ノード（６，７，２）宛てのパケットは、代替ノードＳ２（６，７，２，１）宛のパケットとしてＳ３に送られる。Ｓ３に送られたパケットは、トリーネットワークを経由して最終的に代替ノードＳ２（６，７，２，１）に送られる
（ｇ）代替ノードＳ２（６，７，２，１）に到達したパケットは、Ｓ２で処理される。
（ｈ）代替ノードＳ２はパケットにより指定されたジョブの処理結果の新たなパケットとして送るアドレスを確認する。
例えば、パケットが計算ノード（５，９，２）宛である場合に、Ｓ２に接続されるトーラスノード６個のアドレスから宛先に一番近い計算ノード（６，９，２）を選び、その計算ノードを経由して目的にトーラスノード（５，９，２）にパケットを送る。 FIG. 9 is a flowchart of packet routing when a failure node occurs during the execution of various large-scale calculations such as science and technology, using the above conversion rule. Typically, a computer network having a torus network having a plurality of computing nodes and a plurality of IO nodes shown in FIG. 6 in a tree network, each computing node having a terminal IO node and at least one link. Consider the system case. An operation of routing the failure calculation node (6, 7, 2) shown in FIG. 6 as an alternative node by the IO node S2 will be described.
(A) One computer node (6, 7, 2) has a failure while the computer system is executing a calculation.
(B) The information processing apparatus 20 (FIG. 1) or the monitoring system resident in the apparatus detects the failure node (6, 7, 2). The monitoring system can detect through a tree network constructed by IO nodes.
(C) The monitoring system notifies the adjacent node that the IO node S2 is an alternative node.
The monitoring system converts the conversion node (failed node (6, 7, 2) → alternate node (6, 7, 2, 1) into four neighboring nodes (6, 8, 2), (through the IO node S2 of the tree network). 6, 6, 2) and stored in the respective storage units 300. Further, this conversion rule is transmitted to two adjacent nodes (7, 6, 2) or (5) via S 2 → S 5 → S 3 or S 1. 7, 7, 2) and stored in the respective storage units Note that routing between the IO nodes 14 (S 1, S 2, S 3,...) Constituting the tree network 15 is performed by each IO node. Pre-set in hardware.
(D) Upon receiving a packet addressed to the failed node (x, y, z), the six adjacent nodes are divided into the case of routing (e) or (d) according to the conversion measurement to the alternative node.
(E) In the case of the adjacent node (6, 8, 2) (6, 6, 2) connected to the alternative node S2 (6, 7, 2, 1): The packet is directly sent to the alternative node S2.
(F) In the case of the adjacent node (7, 7, 2) connected to the alternative node S3 different from the alternative node S2: the failed node (6, 7, 2) that has reached the adjacent node (7, 7, 2) The addressed packet is sent to S3 as a packet addressed to the alternative node S2 (6, 7, 2, 1). The packet sent to S3 is finally sent to the alternative node S2 (6, 7, 2, 1) via the tree network. (G) The alternative node S2 (6, 7, 2, 1) is reached. The packet is processed in S2.
(H) The alternative node S2 confirms the address to be sent as a new packet of the processing result of the job specified by the packet.
For example, when the packet is addressed to the calculation node (5, 9, 2), the calculation node (6, 9, 2) closest to the destination is selected from the addresses of the six torus nodes connected to S2, and the calculation is performed. A packet is sent to the torus node (5, 9, 2) via the node for the purpose.

以上、本実施例および変形例によれば、複数の計算ノードから構成されるコンピュータ・システムのトーラスの構成を変えることなく故障時の代替ノードの割当てることを可能にすることが出来る。
また、ノードが故障した場合にも最低限のオーバーヘッドで自動的に代替ノードをアサインすることで処理を継続することが出来、複数ノードを有する並列ネットワーク・システムのスループットへの影響を最小限にすることが可能となる。 As described above, according to the present embodiment and the modification, it is possible to assign an alternative node at the time of failure without changing the torus configuration of the computer system including a plurality of calculation nodes.
In addition, even if a node fails, processing can be continued by automatically assigning an alternative node with minimum overhead, minimizing the impact on the throughput of a parallel network system with multiple nodes. It becomes possible.

以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されない。上記実施の形態に、多様な変更または改良を加えることが可能であることが当業者に明らかである。その様な変更または改良を加えた形態も本発明の技術的範囲に含まれ得ることが、特許請求の範囲の記載から明らかである。 As mentioned above, although this invention was demonstrated using embodiment, the technical scope of this invention is not limited to the range as described in the said embodiment. It will be apparent to those skilled in the art that various modifications or improvements can be added to the above-described embodiment. It is apparent from the scope of the claims that the embodiments added with such changes or improvements can be included in the technical scope of the present invention.

コンピュータ・システム１０および情報処理装置２０の全体構成を示す。1 shows the overall configuration of a computer system 10 and an information processing apparatus 20. コンピュータ・システム１０のトーラス・ネットワークの構成部分を示す。2 shows components of a torus network of computer system 10. トーラス・ネットワークの外にあるＩＯノード同士はツリー（Ｔｒｅｅ）状に接続されていることを示す。It shows that IO nodes outside the torus network are connected in a tree shape. ２次元格子点位置（ｘ、ｙ）配置されたノードの一つが故障した場合を示す。A case where one of the nodes arranged at the two-dimensional lattice point position (x, y) fails is shown. 本発明の３次元トーラスを次のハードウエアの構成を示す。The three-dimensional torus of the present invention has the following hardware configuration. ｚ＝２の面で本発明の３次元トーラス・ネットワークを切り取った図を示す。FIG. 3 shows a cutaway view of the three-dimensional torus network of the present invention in the z = 2 plane. 計算ノード及びＩＯノード１２の機能構成を示す。A functional configuration of the calculation node and the IO node 12 is shown. 記憶部３００のデータ構造の一例を示す。An example of the data structure of the memory | storage part 300 is shown. 変換則を用いて、計算事項途中において発生した故障ノード宛のパケットのルーティング方法により、故障ノードをフェイル・オーバーするフローチャートを示す。The flowchart which fails over a failure node by the routing method of the packet addressed to the failure node which occurred in the middle of the calculation item using the conversion rule is shown.

Explanation of symbols

１０コンピュータ・システム
２０情報処理装置
１２計算ノード
１３トーラス・ネットワークのリンク
１４ＩＯノード
１５ツリー・ネットワークのリンク
３００記憶部
３１０受信部
３２０選択部
３３０送信部 DESCRIPTION OF SYMBOLS 10 Computer system 20 Information processing apparatus 12 Computation node 13 Torus network link 14 IO node 15 Tree network link 300 Storage part 310 Reception part 320 Selection part 330 Transmission part

Claims

A torus network composed of a plurality of calculation nodes arranged at three-dimensional lattice points (addresses) and forming links between adjacent lattice points;
In a computer system having a tree network composed of a plurality of IO nodes and each of the calculation nodes forming a link with the IO node at the end of the tree network, one calculation node fails during the execution of the calculation. It is a way to fail over when
Detecting a faulty compute node during the computation; and
The IO node linked to the failure calculation node (failure node) as an alternative node specified by an address that is one-dimensionally increased to the address of the failure node;
When a computing node (adjacent node) adjacent to the failed node receives a packet addressed to the failed node, routing the packet to the alternative node;
A fail over method comprising:

The plurality of computing nodes of the computer system is an a × b × c array connected as a three-dimensional torus, each of the computing nodes to an adjacent computing node in the x, y, and z directions Form 6 links to
The terminal IO node of the computer system forms a link with a predetermined number of compute nodes in an a × b array of z planes of the three-dimensional torus;
The method of claim 2, wherein the compute node has a total of seven links.

The step of using the IO node as an alternative node includes:
Informing the computation node adjacent to the failed node of the address (x, y, z, 1) of the alternative node with the IO node forming a link with the failed node (x, y, z) as the alternative node The method according to claim 1 or 2, comprising.

The method according to claim 3, further comprising: processing at the alternative node a job specified by the packet that has reached the alternative node.

The alternative node further comprises a step of confirming an address of a calculation node that sends a packet including the processing result of the job, and sending the packet from a calculation node connected to the alternative node to a calculation node closest to the address. The method of claim 4.

The method according to claim 4, wherein the routing step is a step of sending the packet to the alternative node when the adjacent computing node is connected to the alternative node.

When the adjacent node is connected to another IO node other than the alternative node, the routing step sends the packet that has reached the adjacent node to the other IO node, and passes through the tree network. The method of claim 4, wherein the method is the step of sending to the alternative node.

The method of claim 1, wherein the compute node and the IO node include at least one CPU and memory.

A torus network composed of a plurality of calculation nodes arranged at three-dimensional lattice points (addresses) and forming links between adjacent lattice points;
Each of the computing nodes in a computer system having a tree network comprising a plurality of IO nodes and forming a link with an IO node at the end of the tree network during execution of the computation. (A) the computer to fail over when a failure occurs
(B) detecting a failure calculation node (failure node);
(C) Informing the node (adjacent node) adjacent to the failed node of the IO node forming the link with the failed node as an alternative node specified by an address that is one-dimensionally increased to the address of the failed node When,
(D) when the neighboring node receives a packet addressed to the failed node, routing the packet to the alternative node;
(G) processing the job designated by the packet reaching the alternative node at the alternative node;
(H) the alternative node confirms an address of a calculation node that sends a packet including the processing result of the job, and sends the packet from a calculation node connected to the alternative node to a calculation node closest to the address; ,
A program that executes

The program according to claim 9, wherein the routing step is a step of sending the packet to the alternative node when the adjacent computing node is connected to the alternative node.

When the adjacent node is connected to another IO node other than the alternative node, the routing step sends the packet that has reached the adjacent node to the other IO node, and passes through the tree network. The program according to claim 9, which is a step of sending to the alternative node.

A torus network composed of a plurality of calculation nodes arranged at three-dimensional lattice points and forming links between adjacent lattice points;
With a tree network consisting of multiple IO nodes,
Each of the compute nodes forms a link with the IO node at the end of the tree network;
Further, when the calculation node fails during the execution of the calculation, the replacement node specified by the address that is one-dimensionally added to the address of the failure node is used as the replacement node for the IO node that forms the link with the failure node. And means to
A computer system with

The adjacent computing node, upon receiving a packet addressed to the failed node, routes to the alternative node;
The alternative node confirms an address to which the packet of the processing result of the job specified by the packet is sent, and selects a calculation node having an address closest to the destination from the addresses of a plurality of calculation nodes connected to the alternative node. Means for sending and sending the packet to the computing node of the address;
The computer system of claim 12, further comprising:

14. The computer system according to claim 13, wherein the routing means sends the packet to the alternative node when the adjacent computing node is connected to the alternative node.

When the adjacent node is connected to another IO node other than the alternative node, the routing means sends the packet that has reached the adjacent node to the other IO node, and passes through the tree network. The computer system of claim 13, wherein the computer system sends to the alternative node.