JP2006277635A

JP2006277635A - Information processing system and job execution method

Info

Publication number: JP2006277635A
Application number: JP2005099577A
Authority: JP
Inventors: Katsuhiko Okada; 克彦岡田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2005-03-30
Filing date: 2005-03-30
Publication date: 2006-10-12
Anticipated expiration: 2025-03-30
Also published as: JP4336894B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system and a method allowing high-speed execution of a multiple node job. <P>SOLUTION: A node, which is provided with a CPU 11 and a transfer control part 12, receives an MPI dedicated instruction given to the transfer control part in a request source node from the CPU, stores information in a dedicated instruction waiting buffer 13, and notifies mask information to a crossbar switch 20. In the the crossbar switch, the mask information is set in a data notification flag register 22, and a computing execution instruction is notified to all the nodes via broadcast communication. When computing between all the nodes performing operation is finished, a notification finish part 26 notifies operation results to all the nodes via casting. In each node, a return data JID comparison part 18 checks whether the operation result agrees with a required operation result or not. When the operation result is the required data, the data are returned to the CPU, otherwise the received operation result is discarded. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、情報処理装置に関し、特に、計算機が複数集まり１つのシステムを構成するマルチノードシステムとＪＯＢ実行方法に関する。 The present invention relates to an information processing apparatus, and more particularly, to a multi-node system and a JOB execution method in which a plurality of computers are collected to constitute one system.

計算機が複数集まり１つのシステムを構成するマルチノードシステムにおける、ＭＰＩ（Message Passing Interface）の機能を使ったマルチノードのＪＯＢ実行として、ＭＰＩの転送機能（関数）であるＭＰＩ＿ＲＥＤＵＣＥ、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ命令について、図４を参照して説明する。図４（Ａ）は、処理フロー、図４（Ｂ）は、マルチノードシステムのデータ通信を模式的に示している。 FIG. 4 shows MPI_REDUCE and MPI_ALL_REDUCE instructions that are MPI transfer functions (functions) as multi-node JOB execution using the MPI (Message Passing Interface) function in a multi-node system in which a plurality of computers constitute one system. Will be described with reference to FIG. FIG. 4A schematically shows a processing flow, and FIG. 4B schematically shows data communication of the multi-node system.

図４おいて、このＭＰＩ転送命令は、
（１）の命令の分配フェーズ、
（２）の各ノードの演算フェーズ、
（３）の結果の転送フェーズ
の３つに大別されることが示されている。 In FIG. 4, this MPI transfer instruction is
(1) instruction distribution phase,
The calculation phase of each node in (2),
It is shown that it is roughly divided into three transfer phases as a result of (3).

（１）の命令の分配フェーズでは、上記命令を実行する際に、リクエスト元のノードのＣＰＵは、各ノードに、ノード内のＭａｘ／Ｓｕｍの演算を実行するように、演算指示を、各ノードに通知する（図４（Ａ）のステップＳ２１、Ｓ２２、図４（Ｂ）の（１）分配参照）。 In the instruction distribution phase of (1), when executing the above instruction, the CPU of the request source node instructs each node to execute an operation instruction so as to execute the Max / Sum operation in the node. (See steps S21 and S22 in FIG. 4A, (1) distribution in FIG. 4B).

その際、該演算を終了したときに、結果をどのノードに送るかの情報も付随して、演算指示を送る。このため、各ノードへの指示は、別々になり、命令発行元のノードから１：１（１対１）で各ノードに通知を送ることになる。 At this time, when the calculation is finished, a calculation instruction is sent together with information on which node the result is sent to. For this reason, the instructions to each node are different, and a notification is sent to each node 1: 1 (one-to-one) from the instruction issuing node.

この通知の転送のためのデータ通信時間が無視できないため、上記命令の実装において、（１）の分配フェーズでは、発行元ノードが全てのノードに通知をするのではなく、例えば２分岐ツリー状に、指示を分担させ、複数のノードから分配通知を行うことで、転送の負荷を軽減するようにしている。 Since the data communication time for transferring this notification cannot be ignored, in the implementation of the above instruction, in the distribution phase (1), the issuing node does not notify all the nodes, for example in the form of a two-branch tree The load is reduced by sharing the instructions and sending distribution notifications from a plurality of nodes.

（２）の各ノードの演算フェーズでは、各ノードで、ノード内のＭａｘ／Ｓｕｍの演算を各々実行し、ノード内の最終値を求める（ステップＳ２３）。 In the calculation phase of each node in (2), each node executes a Max / Sum calculation in the node to obtain a final value in the node (step S23).

（３）の転送フェーズでは、（１）と逆方向に、Ｍａｘ／Ｓｕｍの結果の転送を行いＭａｘ／Ｓｕｍ値を求め（ステップＳ２４）、Ｍａｘ／Ｓｕｍがリクエスト元ノードに全て集まると（ステップＳ２５のｙｅｓ分岐）、最終的に、リクエスト元ノードプロセスで集まったＭａｘ／Ｓｕｍを求め、最終的なＭａｘ／Ｓｕｍ値を求める（ステップＳ２６）。ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ命令では、全ノードが、この結果値を得る必要性があるため、この結果値を、全ノードにブロードキャストするなどしている。 In the transfer phase of (3), the result of Max / Sum is transferred in the opposite direction to (1) to obtain the Max / Sum value (step S24), and when all Max / Sum gathers at the request source node (step S25). Finally, Max / Sum collected by the request source node process is obtained, and the final Max / Sum value is obtained (step S26). In the MPI_ALL_REDUCE instruction, since all nodes need to obtain this result value, the result value is broadcast to all the nodes.

ＭＰＩの機能を使ったマルチノードＪＯＢ実行（複数ノードでのＪＯＢの実行）は、性能が重視されるが、ノード間の転送時間が性能のネックとなる。特に、マルチノードＪＯＢでは、各ノードに分散したデータの合計値やＭａｘ値を求めることが多々発生するが、その演算の実行時の転送時間が、演算時間に対して、割合的に、無視できなくなる、という問題が生じていた。 Multi-node JOB execution using the MPI function (execution of JOB at a plurality of nodes) places importance on performance, but transfer time between nodes becomes a bottleneck in performance. In particular, in a multi-node job, there are many cases where the total value or the Max value of data distributed to each node is obtained, but the transfer time at the time of execution of the operation can be ignored in proportion to the operation time. The problem of disappearing occurred.

従来のマルチノードシステムは、ＭＰＩの転送機能の実現において、次のような課題がある。 The conventional multi-node system has the following problems in realizing the MPI transfer function.

リクエスト元ノードへの演算結果の転送フェーズにおいて、分配フェースと逆方向に、Ｍａｘ／Ｓｕｍの結果の転送を行い、最終的に、リクエスト元ノードでＭａｘ／Ｓｕｍの結果値を求めている。コネクション型のネットワークでは、全ノード（ノード数をＮとする）から、１つのノードにデータを通知するために、通常、Ｎ：１（２分岐ツリーでノード間転送の効率化を行った場合でも、ＬｏｇＮ：１）の通信が発生し、１つの転送ごとにコネクションをし直すネットワークでは、転送時間が多くかかる（第１の課題）。 In the transfer phase of the calculation result to the request source node, the Max / Sum result is transferred in the direction opposite to the distribution face, and finally the Max / Sum result value is obtained at the request source node. In a connection type network, in order to notify one node of data from all nodes (the number of nodes is N), usually N: 1 (even when the efficiency of inter-node transfer is performed by a two-branch tree) , Log N: 1) communication occurs, and a network that reconnects for each transfer requires a long transfer time (first problem).

分配フェーズにおいて返却値を返却するルート（ノード）を通知するために、ブロードキャスト通信して一括して演算の実行を指示することができない、ということである（第２の課題）。 In order to notify the route (node) for returning the return value in the distribution phase, it is impossible to instruct execution of the operation in a batch by broadcast communication (second problem).

したがって、本発明の目的は、ノード内の演算結果転送や、演算指示の分配フェーズで、コネクション型ネットワークをもつゆえの上記課題を解消し、高速なマルチノードＪＯＢ実行を可能とする情報処理システムと方法を提供することにある。 Accordingly, an object of the present invention is to solve the above-mentioned problems caused by having a connection-type network in the transfer of calculation results within a node and the distribution phase of calculation instructions, and an information processing system that enables high-speed multi-node JOB execution It is to provide a method.

本願で開示される発明は、上記目的を達成するため、概略以下の構成とされる。 The invention disclosed in the present application is generally configured as follows in order to achieve the above object.

本発明の１つのアスペクトに係るシステムは、複数のノードと、クロスバースイッチとを備え、前記クロスバースイッチは、１のリクエスト元のノードからの演算要求を受け、他のノードに前記要求をブロードキャスト通信で分配し、前記クロスバースイッチが、前記分配されたノードでのそれぞれの演算結果を集めて演算し、演算結果をブロードキャストで複数のノードに通知する構成とされる。本発明において、各ノードでは、前記クロスバースイッチから取った演算結果が、自ノードで要求したものに対応する演算結果であるかチェックし、自ノードで要求したものでない場合には、廃棄する。 A system according to one aspect of the present invention includes a plurality of nodes and a crossbar switch, and the crossbar switch receives a calculation request from one request source node and broadcasts the request to another node. The communication is distributed, and the crossbar switch is configured to collect and calculate the respective calculation results at the distributed nodes, and notify the calculation results to a plurality of nodes by broadcasting. In the present invention, each node checks whether the computation result taken from the crossbar switch is the computation result corresponding to that requested by its own node. If it is not requested by its own node, it is discarded.

本発明において、前記クロスバースイッチは、前記ノードから集めた演算結果が１つになるまで演算を行う演算部と、結果が１つなるまで演算される間、複数ノードから要求に応じて、並列に動作できるように、ＪＯＢ毎に指示されるシステムにユニークなＩＤで管理された多重実行手段を備える。 In the present invention, the crossbar switch includes a calculation unit that performs calculation until one calculation result collected from the nodes is obtained, and a plurality of nodes in parallel according to a request while the calculation is performed until one result is obtained. Multiple execution means managed with a unique ID in the system instructed for each JOB.

本発明の他のアスペクトに係る方法は、複数のノードと、クロスバースイッチとを備えたマルチノードシステムのジョブ実行方法であって、
前記各ノードは、ＣＰＵと転送制御部を有し、
リクエスト元ノードの前記転送制御部が、前記ＣＰＵから発行されたＭＰＩ（メッセージ・パッシング・インタフェース）専用命令を受け取り、専用命令待合せバッファに情報を格納する工程と、
前記リクエスト元ノードの転送制御部が、前記クロスバースイッチに対して、無効とするノードを指定するマスク情報を通知する工程と、
前記クロスバースイッチが、データ通知フラグレジスタにマスク情報を設定する工程と、
前記クロスバースイッチがブロードキャスト通信により、全ノードに対して演算実行指示を通知する工程と、
演算実行を行った各ノードにおいて、通知専用命令が前記ＣＰＵから送られてきた時に、演算結果を、前記クロスバースイッチの演算部に通知する工程と、
前記クロスバースイッチにおいて、前記演算部は、各ノードの通知命令作成部からの通知を受け取り、データ通知フラグレジスタの設定により、各ノード間の演算を実行し、演算実行する全てのノード間の演算が終了すると、演算結果を、全ノードにブロードキャスト通知する工程と、を含む。 A method according to another aspect of the present invention is a job execution method of a multi-node system including a plurality of nodes and a crossbar switch,
Each of the nodes has a CPU and a transfer control unit,
The transfer control unit of the request source node receives an MPI (Message Passing Interface) dedicated instruction issued from the CPU and stores information in a dedicated instruction queuing buffer;
A step in which the transfer control unit of the request source node notifies the crossbar switch of mask information for designating an invalid node;
The crossbar switch sets mask information in a data notification flag register;
The crossbar switch notifying a calculation execution instruction to all nodes by broadcast communication;
In each node that has performed computation, when a notification-only instruction is sent from the CPU, a step of notifying the computation result of the crossbar switch to the computation unit;
In the crossbar switch, the calculation unit receives a notification from the notification command generation unit of each node, executes a calculation between the nodes according to the setting of the data notification flag register, and performs a calculation between all the nodes performing the calculation When the process ends, a broadcast notification of the calculation result to all nodes is included.

本発明の他のアスペクトに係るノードは、クロスバースイッチに接続するノードであって、ＣＰＵと転送制御部を備え、前記転送制御部は、前記ＣＰＵから発行された転送専用命令とその関連情報を格納する専用命令待合せバッファと、前記クロスバースイッチに対して、演算結果の待ち合わせを行わないノードを指定するマスク情報を通知するマスク情報作成部と、前記ＣＰＵからの通知専用命令を受け、前記クロスバースイッチに送信する通知命令を生成する通知命令作成部と、前記クロスバースイッチから通知された演算結果が、自ノードのＣＰＵが発行した命令に対応するものであるか判定し、発行した命令に対応しない場合に廃棄する制御を行う返却データ比較部と、を備えている。 A node according to another aspect of the present invention is a node connected to a crossbar switch, and includes a CPU and a transfer control unit. The transfer control unit receives a transfer-dedicated command issued from the CPU and related information. A dedicated command queuing buffer for storing, a mask information creating unit for notifying the crossbar switch of mask information for designating a node for which no operation result is to be waited, and a notification dedicated command from the CPU, A notification command generation unit that generates a notification command to be transmitted to the bar switch, and determines whether the calculation result notified from the crossbar switch corresponds to a command issued by the CPU of the own node. A return data comparison unit that performs control of discarding when it does not correspond.

本発明の他のアスペクトに係るクロスバースイッチは、データ通知フラグレジスタと、演算部と、終了通知部と、ノードからの通知データを格納する通知受信バッファと、演算結果データを格納する演算結果格納バッファよりなるデータ格納バッファを備え、前記要求元のノードからマスク情報を受け取り、前記データ通知フラグレジスタに設定し、前記要求元のノードからの要求を受けて、ブロードキャスト通信により、複数のノードに対して演算実行を通知し、演算実行した複数のノードにおいて、通知専用命令がノードのＣＰＵから送られてきた時に、演算結果が、前記クロスバースイッチの前記演算部に通知され、前記演算部では、各ノードの通知命令作成部からの通知を受け取り、前記データ通知フラグレジスタの設定により、各ノード間の演算を実行し、前記演算部で演算実行する全てのノード間の演算が終了すると、前記終了通知部は、演算結果を、全ノードにブロードキャスト通知する構成とされる。 A crossbar switch according to another aspect of the present invention includes a data notification flag register, a calculation unit, a termination notification unit, a notification reception buffer for storing notification data from a node, and a calculation result storage for storing calculation result data. A data storage buffer comprising a buffer; receives mask information from the requesting node; sets the data notification flag register; receives a request from the requesting node; The calculation result is notified to the calculation unit of the crossbar switch when a notification-dedicated instruction is sent from the CPU of the node in the plurality of nodes that have executed the calculation. Receiving notification from the notification command creation unit of each node, each setting of the data notification flag register, Performs an operation between over de, the operation is completed between all nodes of computing executed by the computing unit, the completion notification unit, the operation result, is configured to broadcast the notification to all nodes.

本発明によれば、コネクション・ロックを持たずに、転送を行うことで、ＭＰＩ転送機能であるＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ、及びＭＰＩ＿ＲＥＤＵＣＥの機能を、高速に実行することができる。 According to the present invention, MPI_ALL_REDUCE and MPI_REDUCE functions that are MPI transfer functions can be executed at high speed by performing transfer without having a connection lock.

また、本発明によれば、ブロードキャスト通信して一括して演算の実行を指示することができる。 In addition, according to the present invention, it is possible to instruct the execution of calculation in a batch by broadcast communication.

［発明の原理］
本発明は、データの転送時に、データの転送先ノードが同時に２つ以上のノードからデータを転送されないように、転送先をロック（１つの転送元ノードしかロックが取れない）してから、データを転送するコネクション型のノード間のデータ転送ネットワークを具備し、これをクロスバースイッチで実現しているマルチノードシステムにおいて、ノード間転送の手段として使用される、ＭＰＩ転送機能であるＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ、及びＭＰＩ＿ＲＥＤＵＣＥの機能を、高速に実行可能とするものである。 [Principle of the Invention]
The present invention locks the transfer destination (only one transfer source node can be locked) so that the data transfer destination node cannot transfer data from two or more nodes at the same time when transferring data. MPI_ALL_REDUCE and MPI_REDUCE, which are MPI transfer functions, are used as means for transferring between nodes in a multi-node system having a data transfer network between connection-type nodes for transferring data, which is realized by a crossbar switch. These functions can be executed at high speed.

コネクション型のネットワークは、データの転送先ノードに、他のノードからの転送が同時に行われないように、転送先のノードをロック（「コネクション・ロック」と呼ぶ）する必要があるが、クロスバースイッチまでの転送であれば、コネクション・ロックの必要がなく、ノードからいつでも転送可能である。 In a connection-type network, it is necessary to lock the transfer destination node (referred to as “connection lock”) to prevent data transfer from other nodes at the same time. Transfer to the switch does not require a connection lock and can be transferred from the node at any time.

本発明は、この点に着目し、コネクション・ロックを持たずに、転送を行うことで、転送時間を短縮している。 The present invention pays attention to this point and shortens the transfer time by performing transfer without having a connection lock.

具体的には、
・ノードへではなく、クロスバースイッチ内に、ノードに転送データを送る転送制御機構を備える。 In particular,
A transfer control mechanism for sending transfer data to the node is provided in the crossbar switch, not to the node.

・クロスバースイッチ内で１つの結果になるまで演算を行えるように、演算部をクロスバー内に備える。 A calculation unit is provided in the crossbar so that calculation can be performed until one result is obtained in the crossbar switch.

・結果が１つになるまで演算される間、クロスバースイッチ内の演算部が、複数ノードからのリクエストに応じて、並列に動作できるように、ＪＯＢ毎に指示される、システムにユニークなＩＤ（「ＪＯＢＩＤ」、あるいは「ＪＩＤ」とも略記される）で管理された多重実行機構を備える。 A unique ID for the system that is instructed for each JOB so that the arithmetic unit in the crossbar switch can operate in parallel in response to requests from a plurality of nodes while it is calculated until one result is obtained. A multiple execution mechanism managed by (abbreviated as “JOBID” or “JID”).

・結果を命令種に応じ、リクエスト元ノードのＣＰＵのほか、全ノードのＣＰＵへ返却する機構をもつ。 -It has a mechanism to return the result to the CPUs of all nodes in addition to the CPU of the request source node according to the instruction type.

図１のリクエスト元ノード１０_１において、ＣＰＵ１１は、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ／ＭＰＩ＿ＲＥＤＵＣＥ命令を処理するＭＰＩ専用命令を発行する。 In the request source node 10 ₁ in FIG. 1, CPU 11 issues a dedicated MPI instructions for processing MPI_ALL_REDUCE / MPI_REDUCE instructions.

同様に、各ノード１０_ｉのＣＰＵ１１は、演算したＭａｘ／Ｓｕｍ（最大値／総和）の結果を、マルチノード間を接続するクロスバースイッチ２０に通知するために、通知専用命令を発行する。 Similarly, the CPU 11 of each node 10 _i issues a notification dedicated instruction to notify the calculated Max / Sum (maximum value / sum) result to the crossbar switch 20 connecting the multi-nodes.

リクエスト元ノード１０_１の転送制御部１２は、ＣＰＵ１１から発行されたＭＰＩ専用命令／通知専用命令を受け取り、ＭＰＩ専用命令の場合、転送制御部１２内の専用命令待合せバッファ（バッファ）１３が、この情報を格納し、同時に、転送制御部１２内のマスク情報作成部１４が、クロスバースイッチ２０内のデータ通知フラグレジスタ２２に対して、マスク情報を通知し、クロスバースイッチ２０内のデータ通知フラグレジスタ２２でマスク情報を設定をする（マスク情報の設定については後に詳述される）。 Transfer controller 12 of the request source node 10 ₁ receives the MPI dedicated instruction / notice dedicated command issued from the CPU 11, if the MPI dedicated command, the dedicated instruction waiting buffer (Buffer) 13 of the transfer control unit 12, this The information is stored, and at the same time, the mask information creation unit 14 in the transfer control unit 12 notifies the data notification flag register 22 in the crossbar switch 20 of the mask information, and the data notification flag in the crossbar switch 20 Mask information is set in the register 22 (mask information setting will be described in detail later).

さらに、ブロードキャスト通信により、クロスバースイッチ２０を介して、全ノードに対して、Ｍａｘ／Ｓｕｍの演算実行指示を通知する。その際、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ/ＭＰＩ＿ＲＥＤＵＣＥのどちらの命令実行中かは、各ノード１０_ｉのＣＰＵ１１に通知される。 Furthermore, a Max / Sum calculation execution instruction is notified to all nodes via the crossbar switch 20 by broadcast communication. At that time, which of the MPI_ALL_REDUCE / MPI_REDUCE instruction is being executed is notified to the CPU 11 of each node 10 _i .

また、同様に、各ノード１０_ｉの通知命令作成部１５では、ノード１０_ｉのＣＰＵ１１から通知専用命令（ＣＰＵ１１で実行した演算結果を通知するための命令）が送られてきた時に、Ｍａｘ／Ｓｕｍの結果を、クロスバースイッチ２０内のＭａｘ／Ｓｕｍ演算部２４に通知する。この通知の際、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ命令の実行時には、ノード１０ｉのＣＰＵ１１から、専用待ち合わせバッファ１３の具備する命令発行元フラグをセットし、結果値返却ＣＰＵ情報を格納する。 Similarly, in the notification instruction creating unit 15 of each node 10 _i , when a notification dedicated instruction (an instruction for notifying the calculation result executed by the CPU 11) is sent from the CPU 11 of the node 10 _i , Max / Sum Is notified to the Max / Sum calculation unit 24 in the crossbar switch 20. At the time of this notification, when the MPI_ALL_REDUCE instruction is executed, the instruction issuer flag included in the dedicated waiting buffer 13 is set from the CPU 11 of the node 10i, and the result value return CPU information is stored.

クロスバースイッチ２０において、データ格納バッファは、データが滞留し、デッドロックになるのを避けるために、ＪＯＢ識別するシステムにユニークなＩＤ（ＪＩＤ）でバッファ内アドレスを管理する機能を持つ。 In the crossbar switch 20, the data storage buffer has a function of managing the address in the buffer with an ID (JID) unique to the JOB identification system in order to avoid data accumulation and deadlock.

また、クロスバースイッチ２０において、データ通知フラグレジスタ２２は、
・各ノードから演算データが送られてきたことや演算完了を管理する演算フラグを格納すると共に、
・ノード１０_１のマスク情報作成部１４からの通知により、演算に使用しないノードのフラグをセットする。 In the crossbar switch 20, the data notification flag register 22
-Stores calculation flags for managing calculation data sent from each node and calculation completion,
The notification from the node 10 ₁ of the mask information generating unit 14 sets a flag for not using the operation node.

クロスバースイッチ２０において、Ｍａｘ／Ｓｕｍ演算部２４は、各ノード１０_ｉの通知命令作成部１５からの通知を受け取り、データ通知フラグレジスタ２２の設定により、各ノード間のＭａｘ／Ｓｕｍ演算を実行する。 In the crossbar switch 20, the Max / Sum operation unit 24 receives the notification from the notification instruction creation unit 15 of each node 10 _i and executes the Max / Sum operation between the nodes by setting the data notification flag register 22. .

この際、クロスバースイッチ２０において、Ｍａｘ／Ｓｕｍ演算部２４は、毎回、演算結果格納バッファ２５の指定されたジョブＩＤ（ＪＩＤ）のアドレスに、データを格納する。 At this time, in the crossbar switch 20, the Max / Sum calculation unit 24 stores data at the address of the designated job ID (JID) in the calculation result storage buffer 25 each time.

クロスバースイッチ２０において、終了通知部２６は、演算実行する全てのノード間の演算が終了すると、演算完了を検出し、全ノードに対して演算結果をブロードキャスト通知する。 In the crossbar switch 20, the end notification unit 26 detects the completion of the calculation and broadcasts the calculation result to all the nodes when the calculation between all the nodes to be calculated is completed.

再び、各ノード１０_１、１０_ｉの返却データＪＩＤ比較部１８では、専用命令待合せバッファ１３のＪＯＢ毎に指定されたＩＤのアドレスに格納される命令発行元フラグの情報を参照し、命令発行元ノードであるか否かを確認する。返却データＪＩＤ比較部１８は、命令発行元フラグが発行元であることを示していない場合、クロスバースイッチ２０からの返却情報を廃棄し、一方、発行元であれば、ＣＰＵ１１にその情報を返却する。 Again, the return data JID comparison unit 18 of each of the nodes 10 ₁ , 10 _i refers to the information of the instruction issuer flag stored in the address of the ID designated for each JOB of the dedicated instruction queuing buffer 13, and the instruction issuer Check if it is a node. The return data JID comparison unit 18 discards the return information from the crossbar switch 20 if the instruction issuer flag does not indicate that it is an issuer, while returning the information to the CPU 11 if it is an issuer. To do.

このようにして、複数ＪＯＢで、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ命令、ＭＰＩ＿ＲＥＤＵＣＥ命令を発行した際に、各ノード１０_ｉに、ノード１０_ｉ内のＣＰＵにＭａｘ／Ｓｕｍの演算をするように指示し、その結果を、クロスバースイッチ２０上で、デッドロックを生じさせないように考慮されたハードウェア機構により、Ｍａｘ／Ｓｕｍ演算し、各ノード１０_ｉからのコネクションを接続せずに、全ノードを対象としたＭａｘ／Ｓｕｍの演算結果を、全ノード（または発行元ノードにのみ）に返却することができる。 In this way, when the MPI_ALL_REDUCE instruction and the MPI_REDUCE instruction are issued by a plurality of JOBs, each node 10 _i is instructed to perform the Max / Sum operation on the CPU in the node 10 _i , and the result is cross-linked. on bar switch 20, the let no way considered hardware mechanism resulting deadlock, Max / Sum calculated, without connecting the connection from each node 10 _i, the Max / Sum intended for all nodes The calculation result can be returned to all nodes (or only to the issuer node).

コネクション型のネットワークのネックとなる１：１でコネクションしないデータ転送ができない問題を回避し、高速に、ＭＰＩ専用命令のＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ、ＭＰＩ＿ＲＥＤＵＣＥを演算でき、マルチノードシステムの演算性能を向上することができる。 It is possible to avoid MPI_ALL_REDUCE and MPI_REDUCE dedicated MPI instructions at a high speed by avoiding the problem of data transfer that does not allow 1: 1 connection, which is a bottleneck for connection-type networks, thereby improving the calculation performance of the multi-node system.

本発明においては、「コネクション型のネットワークは、データの転送先ノードに他のノードからの転送が同時に行われないように、コネクション・ロックをする必要があるが、クロスバースイッチまでの転送であれば、コネクション・ロックの必要がなく、Nodeからいつでも転送可能である。」という点に着目し、クロスバースイッチに、Node内転送データを送り、クロスバースイッチ内で１つの結果になるまで演算を行えるように、演算部を、クロスバースイッチ内に備える。 According to the present invention, “a connection-type network needs to lock the connection so that data from other nodes cannot be transferred simultaneously to the data transfer destination node. Paying attention to the point that there is no need for connection lock and transfer from Node at any time, send the transfer data in Node to the crossbar switch, and calculate until one result in the crossbar switch. An arithmetic unit is provided in the crossbar switch so that it can be performed.

また、本発明においては、１つに演算されるまでの間、その演算機構が複数ノードからのリクエストに応じて並列に動作できるように、ＪＯＢのＩＤで（ＪＯＢＩＤ）管理された多重実行機構を備える。 Also, in the present invention, a multiple execution mechanism managed with a JOB ID (JOBID) so that the calculation mechanism can operate in parallel in response to requests from a plurality of nodes until it is calculated as one. Prepare.

さらに、本発明においては、クロスバースイッチ２０から演算結果を、全ノードにブロードキャストすることで、コネクション・ロックを行わずに、転送フェーズの動作を実現する。 Furthermore, in the present invention, the operation result of the transfer phase is realized without broadcasting the connection lock by broadcasting the calculation result from the crossbar switch 20 to all the nodes.

さらにまた、本発明においては、クロスバースイッチ２０から全ノードに結果をブロードキャストし、その通知を、命令種によって選択して、ＣＰＵ１１に伝える転送制御部１２を備え、分配フェーズでルートを通知する必要がなくなり、分配フェーズについても、ブロードキャストでのノードの演算開始を指示することが出来る（分配フェーズを簡略化できる）。 Furthermore, in the present invention, it is necessary to provide a transfer control unit 12 that broadcasts a result from the crossbar switch 20 to all nodes, selects the notification according to the instruction type, and transmits it to the CPU 11, and notifies the route in the distribution phase. In the distribution phase, it is possible to instruct the start of node calculation by broadcasting (the distribution phase can be simplified).

このように、本発明においては、各ノードが、ＣＰＵ１１と転送制御部１２を備え、複数ＪＯＢのＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ命令、ＭＰＩ＿ＲＥＤＵＣＥ命令を発行した際に、クロスバースイッチ２０を介して、各ノードに、ノード内のＭａｘ／Ｓｕｍの演算をするよう指示し、その結果を、クロスバースイッチ２０上でデッドロックを生じさせないように考慮されたハードウェア機構により、Ｍａｘ／Ｓｕｍ演算し、各ノードからのコネクションを接続せずに、全ノードを対象としたＭａｘ／Ｓｕｍの演算結果を、全ノード（または、発行元ノードにのみ）返却することができる構成としたことで、コネクション型のネットワークのネックとなる１：１でコネクションしないデータ転送ができないという問題を回避し、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ／ＭＰＩ＿ＲＥＤＵＣＥ命令を処理する転送時間を大幅に減少させることで、高速なマルチノードＪＯＢを実現している。以下、実施例に即して説明する。 As described above, in the present invention, each node includes the CPU 11 and the transfer control unit 12, and when the MPI_ALL_REDUCE command and MPI_REDUCE command for a plurality of jobs are issued, each node is connected to each node via the crossbar switch 20. Max / Sum is instructed, and the result is subjected to Max / Sum by a hardware mechanism that is considered not to cause deadlock on the crossbar switch 20, and connections from each node are connected. In this configuration, the Max / Sum calculation results for all nodes can be returned to all nodes (or only to the issuing node), which becomes a bottleneck for connection-type networks. MPI_ALL avoids the problem of not being able to transfer data that is not connected with 1. REDUCE / MPI_Reduce transfer time to process an instruction to decrease considerably, thereby realizing a high-speed multi-node JOB. In the following, description will be made in accordance with examples.

図１は、本発明の一実施例のマルチノードシステムの構成を示す図である。図１には、複数のノード１０_１、１０_ｉ（ノード数がＮ（ただし、Ｎは２以上の整数）の場合、ｉは２〜Ｎの整数）と、１つのクロスバースイッチ２０によるマルチノードシステムが示されており、各ノード１０_１、１０_ｉは、同一構成とされ、それぞれが、少なくとも１つ以上のＣＰＵ１１をもち演算を処理し、各々一部分を分担しながらマルチノードＪＯＢを実行する。 FIG. 1 is a diagram showing a configuration of a multi-node system according to an embodiment of the present invention. In FIG. 1, a plurality of nodes 10 ₁ , 10 _i (when the number of nodes is N (where N is an integer equal to or greater than 2), i is an integer of 2 to N) and a multi-node with one crossbar switch 20 A system is shown, and each node 10 ₁ , 10 _i has the same configuration, each of which has at least one CPU 11, processes operations, and executes a multi-node JOB while sharing a part of each.

複数のノードは、どのノードからも、同様な命令が発行でき処理できる能力をもつが、構成例は説明のため、リクエスト元のノード１０_１からのみ命令が発行されているとして、構成を説明する。 The plurality of nodes have the capability to issue and process the same command from any node, but the configuration will be described assuming that the command is issued only from the requesting node 101 ₁ for the sake of explanation of the configuration example. .

ノード１０_１、１０_ｉは、１つ以上のＣＰＵ１１、メモリおよび転送制御部１２を備えている。 The nodes 10 ₁ , 10 _i include one or more CPUs 11, a memory, and a transfer control unit 12.

図１のリクエスト元ノード１０_１において、ＣＰＵ１１は、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ／ＭＰＩ＿ＲＥＤＵＣＥ命令を処理するＭＰＩ専用命令を、転送制御部１２に対して発行する。 In the request source node 10 ₁ in FIG. 1, CPU 11 has the MPI dedicated instructions for processing MPI_ALL_REDUCE / MPI_REDUCE instruction is issued to the transfer control unit 12.

ＭＰＩ専用命令は、
・マルチノードＪＯＢのシステムにユニークなＩＤ（ＪＯＢＩＤ）と、
・命令を演算実行する配列情報や命令を実行するノードの情報、
・命令の種類（命令種１：ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥか、ＭＰＩ＿ＲＥＤＵＣＥか、命令種２：Ｍａｘ演算、Ｓｕｍ演算か）、
・命令の返却先情報
を有し、これらの情報は、転送制御部１２に通知される。 MPI dedicated instructions
-A unique ID (JOBID) for the multi-node JOB system,
-Array information for executing instructions and node information for executing instructions,
Instruction type (instruction type 1: MPI_ALL_REDUCE or MPI_REDUCE, instruction type 2: Max operation, Sum operation)
-Instruction return destination information, which is notified to the transfer control unit 12.

また、その他のノード１０_ｉ（ｉは２以上の整数）内のＣＰＵ１１は、ＣＰＵ１１で演算したＭａｘ／Ｓｕｍの結果を、マルチノード間を接続するクロスバースイッチ２０に通知するために、通知専用命令を発行する機能を備えている。なお、ＣＰＵ自体は任意の公知の構成が用いられるが、ＣＰＵは、最終的な返却値を演算結果として持つためにその演算実行終了を待ちあわせることになる。ＣＰＵが参照するメモリ（不図示）等の構成についても、任意の公知の構成が用いられる。 In addition, the CPU 11 in the other node 10 _i (i is an integer of 2 or more) is a notification-dedicated command for notifying the result of Max / Sum calculated by the CPU 11 to the crossbar switch 20 connecting the multi-nodes. It has a function to issue. In addition, although arbitrary well-known structure is used for CPU itself, since CPU has a final return value as a calculation result, it will wait for the completion | finish of the calculation execution. Arbitrary publicly known composition is used also about composition of a memory (not shown) etc. which CPU refers.

転送制御部１２は、専用命令待ち合わせバッファ１３、マスク情報作成部１４、通知命令作成部１５、データ受信部１６、データ送信部１７、返却データＪＩＤ比較部１８、既設他命令発行機構１９を備えている。なお、既設他命令発行機構１９は、ＭＰＩ専用命令、通知専用命令以外の他の命令の発行を制御するユニットであり、本発明の主題とは直接関係しないため、説明は省略する。 The transfer control unit 12 includes a dedicated command waiting buffer 13, a mask information creation unit 14, a notification command creation unit 15, a data reception unit 16, a data transmission unit 17, a return data JID comparison unit 18, and an existing other command issue mechanism 19. Yes. The existing other instruction issue mechanism 19 is a unit that controls the issue of instructions other than the MPI dedicated instruction and the notification dedicated instruction, and is not directly related to the subject of the present invention, so the description is omitted.

また、図１では、動作の説明の容易化のため、リクエスト元ノード１０_１の転送制御部１２がマスク情報作成部１４を備え、その他の各ノード１０_ｉの転送制御部１２が通知命令作成部１５を備えた構成とされているが、各ノードとも同一構成とされ、マスク情報作成部１４、通知命令作成部１５を備えている。 Further, in FIG. 1, for ease of explanation of the operation, the request source node 10 ₁ of the transfer control unit 12 is provided with a mask information generating unit 14, the other transfer control unit 12 of each node 10 _i notice instruction preparation unit 15, each node has the same configuration, and includes a mask information creation unit 14 and a notification command creation unit 15.

リクエスト元ノード１０_１の転送制御部１２は、当該ノード１０_１のＣＰＵ１１から発行された上記ＭＰＩ専用命令を受け取り、ＭＰＩ専用命令の場合、転送制御部１２内に具備する専用命令待合せバッファ１３に、ＪＯＢＩＤ（ＪＩＤ）と、返却先情報を格納し、自ノードがＭＰＩ専用命令の発行元であることを示す命令発行元フラグをＯＮにして保存する。 Transfer controller 12 of the request source node 10 ₁ receives the MPI dedicated instruction issued from CPU11 of the node 10 _1, if the MPI dedicated instruction, the dedicated instruction waiting buffer 13 which includes the transfer control unit 12, JOBID (JID) and return destination information are stored, and the instruction issuer flag indicating that the own node is the issuer of the MPI dedicated instruction is turned on and saved.

リクエスト元ノード１０_１の転送制御部１２では、上記ＭＰＩ専用命令と同時にＣＰＵ１１から通知されるノード情報（演算を実行するノードの情報）より、マスク情報作成部１４にて、クロスバースイッチ２０内のデータ通知フラグレジスタ２２に対して通知するマスク情報を作成し、データ送信部１７より、作成したマスク情報をクロスバースイッチ２０に対して通知する。 The transfer control unit 12 of the request source node 10 _1, from node information notified simultaneously CPU11 and the MPI dedicated instruction (information nodes running operation), in the mask information generating unit 14, in the crossbar switch 20 Mask information to be notified to the data notification flag register 22 is created, and the created mask information is notified from the data transmission unit 17 to the crossbar switch 20.

リクエスト元ノード１０_１の転送制御部１２のマスク情報作成部１４では、命令種１、２や、演算実行する配列情報、ＪＯＢＩＤをその内容としてもつブロードキャスト通信命令を作成し、データ送信部１７より、クロスバースイッチ２０に対して通知する。 In the mask information generating unit 14 of the transfer control unit 12 of the request source node 10 _1, and instruction type 1, sequence information execution, to create a broadcast communication instruction having as its contents JOBID, from the data transmission unit 17, Notify the crossbar switch 20.

なお、ブロードキャスト通信については、クロスバースイッチがもつ公知の機能であり、クロスバースイッチ２０を経由し、全ノードのＣＰＵに対して通知される。 Broadcast communication is a known function of the crossbar switch, and is notified to the CPUs of all nodes via the crossbar switch 20.

一方、ノード１０_ｉの転送制御部１２では、通知専用命令がノード１０_ｉ内のＣＰＵ１１から送られてきた時に、通知命令作成部１５にて、通知専用命令を送信できる形状（形式）に変え、Ｍａｘ／Ｓｕｍの演算結果を、データ送信部１７から、クロスバースイッチ２０内のＭａｘ／Ｓｕｍ演算部２４に通知する。 On the other hand, in the transfer control unit 12 of the node 10 _i , when the notification dedicated command is sent from the CPU 11 in the node 10 _i , the notification command creating unit 15 changes the shape (form) so that the notification dedicated command can be transmitted. The calculation result of Max / Sum is notified from the data transmission unit 17 to the Max / Sum calculation unit 24 in the crossbar switch 20.

この通知の際に、ノード１０_ｉ内の専用待ち合わせバッファ１３に対して、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ命令実行時にはＣＰＵ１１から、ＪＯＢＩＤと返却先情報、および発行元フラグをＯＮにして情報を格納する。 During this notification, the node with respect to 10 _i only waiting buffer 13 in, from CPU11 at MPI_ALL_REDUCE instruction execution, and stores the information in the ON JOBID the return destination information, and issues an original flag.

各ノード（リクエスト元ノード１０_１、その他のノード１０_ｉ）の転送制御部１２において、クロスバースイッチ２０より、Ｍａｘ／Ｓｕｍ演算結果が、データ受信部１６を経由して、返却データＪＩＤ比較部１８に通知される。 In the transfer control unit 12 of each node (request source node 10 ₁ , other node 10 _i ), the Max / Sum calculation result is returned from the crossbar switch 20 via the data reception unit 16 and the return data JID comparison unit 18. Will be notified.

返却データＪＩＤ比較部１８では、Ｍａｘ／Ｓｕｍ演算結果と専用命令待ち合わせバッファ１３の情報について、比較を行い、ＣＰＵ１１へＭａｘ／Ｓｕｍ演算結果を通知するか、もしくはデータの廃棄を行う。具体的には、専用命令待合せバッファ１３の中に、ＪＯＢ毎に指定されたＩＤ（ＪＯＢＩＤ）が、クロスバースイッチ２０からデータ受信部１６を通して受け取ったＪＯＢＩＤと等しいものがあるかを確認し、等しいものが存在する場合には、さらに、専用命令待ち合わせバッファ１３内に格納されている命令発行元フラグの情報（命令発行元フラグは、ＭＰＩ専用命令の発行元である場合、オンとされる）を参照し、命令発行元フラグより、当該ノードが命令発行元でなければ、クロスバースイッチ２０より通知された返却情報、および専用命令待合せバッファ１３の格納情報を廃棄し、命令発行元（リクエスト元ノード）であれば、該当ＣＰＵ１１に、情報（Ｍａｘ／Ｓｕｍ演算結果）を返却し、専用命令待合せバッファ１３の情報のみ廃棄する。 The return data JID comparison unit 18 compares the Max / Sum calculation result with the information in the dedicated instruction waiting buffer 13, and notifies the CPU 11 of the Max / Sum calculation result or discards the data. Specifically, it is confirmed whether there is an ID (JOBID) specified for each JOB in the dedicated command queuing buffer 13 that is equal to the JOBID received from the crossbar switch 20 through the data receiving unit 16. If there is something, the instruction issuer flag information stored in the dedicated instruction queuing buffer 13 (the instruction issuer flag is turned on when the MPI dedicated instruction issuer) is stored. Refer to the instruction issuer flag, if the node is not the instruction issuer, the return information notified from the crossbar switch 20 and the information stored in the dedicated instruction waiting buffer 13 are discarded, and the instruction issuer (request source node) ), The information (Max / Sum calculation result) is returned to the corresponding CPU 11 and only the information of the dedicated instruction waiting buffer 13 is returned. To disposal.

各ノードの転送制御部１２は、クロスバースイッチ２０からの通信情報を、データ受信部１６で受け取ると、ブロードキャスト通信ならば、無条件で、これをＣＰＵ１１に通知する。 When the data receiving unit 16 receives the communication information from the crossbar switch 20, the transfer control unit 12 of each node notifies the CPU 11 unconditionally if it is broadcast communication.

なお、上記ブロードキャスト通信命令（命令種１、２や、演算実行する配列情報、ＪＯＢＩＤをもつ）は、クロスバースイッチ２０を介して、Ｍａｘ／Ｓｕｍの演算実行のリクエストを各ノード１０_ｉのＣＰＵ１１に通知し、Ｍａｘ／Ｓｕｍの演算を、各ノードのＣＰＵ１１が実行する。各ノードのＣＰＵ１１は、Ｍａｘ／Ｓｕｍの演算実行後、得られたＭａｘ／Ｓｕｍ演算結果を、ＪＯＢＩＤや命令種別とともに、各ノードのＣＰＵ１１から転送制御部１２に、通知専用命令で通知する。 The broadcast communication command (with command types 1 and 2, array information to be executed, and JOBID) sends a request for Max / Sum operation execution to the CPU 11 of each node 10 _i via the crossbar switch 20. The CPU 11 of each node executes the Max / Sum calculation. After executing Max / Sum calculation, the CPU 11 of each node notifies the transfer control unit 12 of the obtained Max / Sum calculation result from the CPU 11 of each node to the transfer control unit 12 together with the JOBID and the instruction type.

次に、本実施例のクロスバースイッチ２０について説明する。クロスバースイッチ２０は、ブロードキャスト実行制御部とコネクション型転送実行部（いずれも不図示）と、データフラグ通知レジスタ２２と、データ格納バッファ（２３、２５）と、Ｍａｘ／Ｓｕｍ演算部２４と、終了通知部２６を備えている。 Next, the crossbar switch 20 of the present embodiment will be described. The crossbar switch 20 includes a broadcast execution control unit, a connection-type transfer execution unit (all not shown), a data flag notification register 22, a data storage buffer (23, 25), a Max / Sum calculation unit 24, and an end. A notification unit 26 is provided.

クロスバースイッチ２０内のデータ通知フラグレジスタ２２は、ＪＯＢＩＤ毎に、Ｎノード数分のビットをもつレジスタを備え、この各ビットに０がある限り、Ｍａｘ／Ｓｕｍ演算部２４では、該当ノードからのデータを待ち合わせして、Ｍａｘ／Ｓｕｍ演算を行い続けるように制御する。 The data notification flag register 22 in the crossbar switch 20 includes a register having bits corresponding to the number of N nodes for each JOBID. As long as each bit has 0, the Max / Sum calculation unit 24 receives data from the corresponding node. Control is performed so that Max / Sum operation is continued after waiting for data.

演算のリクエスト元ノード１０_１から各ノードの演算が始まる前に、そのＪＯＢで使用しないノードのＭａｘ／Ｓｕｍ演算結果を待ち合わせしないために、データ通知フラグレジスタ２２にマスク設定が通知され、これに従い、ＪＯＢＩＤに対応するレジスタ２２にマスク情報を設定し（無効ノードに対応する位置のビットに１を立てる）、もしくは、ノードのＭａｘ／Ｓｕｍの演算データ到着の完了で、該ノードに対応する位置のビットが１にセットされる。 Before the requesting node 10 _first operational begins operation of each node, in order not to queuing the Max / Sum operation result of the node that is not used in the JOB, the mask set in the data notification flag register 22 is notified, which in accordance with, Set the mask information in the register 22 corresponding to JOBID (set the bit in the position corresponding to the invalid node to 1), or the completion of arrival of the Max / Sum operation data of the node, the bit in the position corresponding to the node Is set to 1.

データ格納バッファは、各ノード１０_ｉの通知命令作成部１５からの通知を受け取り管理するバッファであり、通知受信格納バッファ２３と演算結果格納バッファ２５を備え、各々のバッファは、ＪＯＢ識別するシステムにユニークなＩＤ（ＪＯＢＩＤ）によって、データを格納するバッファ内アドレスが管理される（多重実行理機構）。 The data storage buffer is a buffer that receives and manages the notification from the notification command generation unit 15 of each node 10 _i , and includes a notification reception storage buffer 23 and an operation result storage buffer 25. A buffer address for storing data is managed by a unique ID (JOBID) (multiple execution mechanism).

各ノードから通知されたデータは、一旦、通知受信格納バッファ２３に格納され、複数のＪＯＢＩＤの演算結果が同ノードから通知された場合でも、異なるＪＯＢＩＤのデータは別のアドレスに格納されるため、消えることがなく、かつ、演算優先順序により、可変に、Ｍａｘ／Ｓｕｍ演算部２４より読み出しが可能なため、先行する演算データが原因でデッドロックもしくはデータの消失は発生しない。 The data notified from each node is temporarily stored in the notification reception storage buffer 23, and even when the calculation results of a plurality of JOBIDs are notified from the same node, the data of different JOBIDs are stored at different addresses. Since it does not disappear and can be read from the Max / Sum operation unit 24 variably according to the operation priority order, deadlock or data loss does not occur due to preceding operation data.

Ｍａｘ／Ｓｕｍ演算部２４は、各ノード、データ通知フラグレジスタ２２の設定により、優先順位を決め、各ノード間のＭａｘ／Ｓｕｍ演算を実行する。この際、Ｍａｘ／Ｓｕｍ演算部２４は、毎回、データ格納バッファの演算結果格納バッファ２５にＪＯＢＩＤとともに演算結果を渡し、指定したＩＤのアドレスにデータを格納させる。演算の優先順位の決め方が固定的に、若番のＪＯＢＩＤ及びノード番号から実行する方法や、優先順位を可変にして負荷を均等にする等、任意の方法が用いられる。 The Max / Sum calculation unit 24 determines the priority order according to the setting of each node and the data notification flag register 22, and executes the Max / Sum calculation between the nodes. At this time, the Max / Sum operation unit 24 passes the operation result together with the JOBID to the operation result storage buffer 25 of the data storage buffer, and stores the data at the address of the designated ID. Arbitrary methods are used, such as a method of determining the priority of calculation from a fixed number, a method of executing from a young JOBID and a node number, or a variable priority and equalizing the load.

終了通知部２６は、演算実行する全てのノード間の演算が終了すると、演算完了を検出し、演算結果を、全ノードにブロードキャスト通知する。 When the calculation between all the nodes that perform the calculation is completed, the end notification unit 26 detects the completion of the calculation and broadcasts the calculation result to all the nodes.

具体的には、データ通知フラグレジスタ２２がもつＪＯＢＩＤ毎のＮビットのレジスタの各ビットが全て１の場合、終了通知部２６において、ＡＮＤ演算結果が１になり、出力通知の出力が有効（Ｖａｌｉｄ）になり、演算完了とともに、ブロードキャストで全ノードに対して最終結果を通知する。 Specifically, when all the bits of the N-bit register for each JOBID included in the data notification flag register 22 are 1, the end notification unit 26 sets the AND operation result to 1 and the output notification output is valid (Valid) When the calculation is completed, the final result is notified to all nodes by broadcast.

なお、上記実施例では、返却先は、ＣＰＵとしたが、データ通信の返却先をメモリ（不図示）として、ＣＰＵは、該メモリ領域を参照することで、命令実行終了を知る構成としてもよい。 In the above embodiment, the return destination is the CPU. However, the return destination of data communication may be a memory (not shown), and the CPU may be configured to know the end of instruction execution by referring to the memory area. .

また、演算種類に関しては、特に、制限されるものでなく、最大値を求めるＭａｘ演算のみ、総和を求めるＳｕｍ演算のみとしてもよく、さらに、それ以外の四則演算やスクエアルート（ＳＱＲＴ）等の関数演算を行うものであってもよい。 The operation type is not particularly limited, and only the Max operation for obtaining the maximum value or the Sum operation for obtaining the sum may be used. Further, other functions such as four arithmetic operations and square route (SQRT) may be used. An operation may be performed.

次に、図１のマルチノードシステムの動作を、図２に示す動作フローを参照して説明する。なお、図１及び図２において、複数のノードと１つのクロスバースイッチによるマルチノードシステムが示されており、どのノードも少なくとも１つ以上のＣＰＵを備え、演算処理し、各々一部分を分担しながら、マルチノードＪＯＢを実行する。各ノードがどのノードからも同様な命令が発行でき処理できる能力をもつが、図に示した例では、説明のため、リクエスト元のノードからのみ命令が発行されているものとして、説明する。 Next, the operation of the multi-node system of FIG. 1 will be described with reference to the operation flow shown in FIG. 1 and 2, a multi-node system including a plurality of nodes and one crossbar switch is shown. Each node includes at least one CPU, performs arithmetic processing, and shares a part thereof. , Execute multi-node JOB. Each node has the capability to issue and process the same command from any node, but in the example shown in the figure, for the sake of explanation, it is assumed that the command is issued only from the requesting node.

図２（Ａ）及び図２（Ｂ）の（１）分配のフェーズにおいて、マスタプロセス（リクエスト元ノードのＣＰＵ（プロセス））から各プロセス（ＣＰＵ）へのクロスバースイッチ２０を介してのブロードキャスト通信を実行する（ステップＳ１１）。全プロセス（ＣＰＵ）への通知が完了すると（ステップＳ１２のｙｅｓ分岐）、各ノードでの演算が実行される。 Broadcast communication via the crossbar switch 20 from the master process (CPU (process) of the request source node) to each process (CPU) in the (1) distribution phase of FIGS. 2 (A) and 2 (B). Is executed (step S11). When the notification to all the processes (CPU) is completed (yes branch of step S12), the calculation at each node is executed.

より詳細には、図１のリクエスト元ノード１０_１内において、ＣＰＵ１１は、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ／ＭＰＩ＿ＲＥＤＵＣＥ命令を処理するＭＰＩ専用命令を転送制御部１２に対して発行する。 More specifically, in the requesting node 10 ₁ in FIG. 1, CPU 11 issues to the transfer control unit 12 the MPI dedicated instructions for processing MPI_ALL_REDUCE / MPI_REDUCE instructions.

ＭＰＩ専用命令は、
・マルチノードＪＯＢのシステムにユニークなＩＤ（ＪＯＢＩＤ）と、
・命令を演算実行する配列情報や命令を実行するノードの情報、
・命令の種類（命令種１：ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥか、ＭＰＩ＿ＲＥＤＵＣＥか、命令種２：Ｍａｘ演算、Ｓｕｍ演算等）、及び、
・返却先情報を持つ。ＭＰＩ専用命令は転送制御部１２に通知される。 MPI dedicated instructions
-A unique ID (JOBID) for the multi-node JOB system,
-Array information for executing instructions and node information for executing instructions,
Instruction type (instruction type 1: MPI_ALL_REDUCE or MPI_REDUCE, instruction type 2: Max operation, Sum operation, etc.), and
・ Has return information. The MPI dedicated command is notified to the transfer control unit 12.

次に、リクエスト元ノードの転送制御部１２は、ＣＰＵ１１から発行された上記ＭＰＩ専用命令を受け取り、転送制御部１２内に具備する専用命令待合せバッファ１３に、ＪＯＢＩＤと返却先情報を格納し、命令発行元フラグをＯＮにして保存する。 Next, the transfer control unit 12 of the request source node receives the MPI dedicated command issued from the CPU 11, stores the JOBID and return destination information in the dedicated command queuing buffer 13 provided in the transfer control unit 12, Save the issuer flag on.

また、実行するノード情報より、マスク情報作成部１４が、クロスバースイッチ２０のデータ通知フラグレジスタ２２に対して通知するマスク情報を作成し、データ送信部１７よりクロスバースイッチ２０に対して通知する。 Further, the mask information creation unit 14 creates mask information to be notified to the data notification flag register 22 of the crossbar switch 20 from the node information to be executed, and notifies the crossbar switch 20 from the data transmission unit 17. .

同様にして、マスク情報作成部１４では、命令種１、２や演算実行する配列情報及びＪＯＢＩＤをもつブロードキャスト通信命令を作成し、データ送信部１７より、クロスバースイッチに対して通知する。 Similarly, the mask information creation unit 14 creates a broadcast communication command having the command types 1 and 2 and the array information to be executed and the job ID, and notifies the crossbar switch from the data transmission unit 17.

クロスバースイッチ２０のデータ通知フラグレジスタ２２は、通知されたマスク情報により該当ＪＯＢＩＤの該当ノードの各ビット（ビットフラグ）にマスク情報を設定する。 The data notification flag register 22 of the crossbar switch 20 sets the mask information in each bit (bit flag) of the corresponding node of the corresponding JOBID based on the notified mask information.

ブロードキャスト通信については、既にクロスバースイッチが従来技術としてもつ機能であるクロスバースイッチの通信機構が動作し、クロスバースイッチ２０を経由し全ノードのＣＰＵ１１に対して通知される。 As for the broadcast communication, the crossbar switch communication mechanism, which is a function that the crossbar switch already has as a conventional technology, operates and is notified to the CPUs 11 of all nodes via the crossbar switch 20.

図２（Ａ）及び図２（Ｂ）の（２）演算フェーズにおいて、各ノードでは、ブロードキャスト通信によりクロスバーを介してＭａｘ／Ｓｕｍの演算実行のリクエストが、転送制御部１２のデータ受信部１６に通知される。 In the (2) calculation phase of FIGS. 2A and 2B, each node sends a request for execution of Max / Sum via the crossbar by broadcast communication to the data receiving unit 16 of the transfer control unit 12. Will be notified.

これを、データ受信部１６はＣＰＵ１１に通知し、ＣＰＵ１１にてＭａｘ／Ｓｕｍの演算が実行される。各ノードのＣＰＵ１１でＭａｘ／Ｓｕｍの演算実行後、得られたＭａｘ／Ｓｕｍの演算結果を、ＪＯＢＩＤや命令種別とともに、各ノードのＣＰＵ１１から転送制御部１２に通知専用命令として通知する。 The data receiving unit 16 notifies the CPU 11 of this, and the CPU 11 executes Max / Sum calculation. After the Max / Sum operation is executed by the CPU 11 of each node, the obtained Max / Sum operation result is notified from the CPU 11 of each node to the transfer control unit 12 as a notification-only instruction along with the JOBID and the instruction type.

図２（Ａ）及び図２（Ｂ）の（３）転送フェーズにおいて、クロスバースイッチ２０に各ノードから非ブロードキャスト通信で演算結果を通知する（ステップＳ１４）。各ノードの出力がそろったところで、各ノードのＭａｘ／Ｓｕｍをクロスバースイッチ２０で一括演算する（ステップＳ１６）。リクエスト元ノードを含め、全ノードにＭａｘ／Ｓｕｍ演算結果を通知し、リクエスト元ノードはＭａｘ／Ｓｕｍ演算結果を得る（ステップＳ１７）。 In the (3) transfer phase of FIGS. 2A and 2B, the calculation result is notified to the crossbar switch 20 from each node by non-broadcast communication (step S14). When the output of each node is complete, the Max / Sum of each node is collectively calculated by the crossbar switch 20 (step S16). All nodes including the request source node are notified of the Max / Sum operation result, and the request source node obtains the Max / Sum operation result (step S17).

より詳細には、演算実行指示を受け取り演算を実行した各ノード内の転送制御部１２内に具備する通知命令作成部１５は、通知専用命令がＣＰＵ１１から送られてきた時に、通知情報を作成し、データ送信部よりＭａｘ／Ｓｕｍの結果をクロスバースイッチ２０のＭａｘ／Ｓｕｍ演算部２４に通知する。通知の際に、ノード内の専用待ち合わせバッファ１３に対してＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ命令実行時には、ＣＰＵ１１から、ＪＯＢＩＤと、返却先情報、および、命令発行元フラグをＯＮにして、情報を格納する。 More specifically, the notification command creating unit 15 included in the transfer control unit 12 in each node that has received the computation execution instruction and performed the computation creates notification information when a notification-only command is sent from the CPU 11. The Max / Sum result is notified from the data transmission unit to the Max / Sum calculation unit 24 of the crossbar switch 20. At the time of notification, when the MPI_ALL_REDUCE instruction is executed for the dedicated queuing buffer 13 in the node, the CPU 11 sets the JOBID, the return destination information, and the instruction issuer flag to ON and stores the information.

クロスバースイッチ２０において、データ格納バッファで、データが滞留しデッドロックになるのを避けるために、ＪＯＢを識別するシステムにユニークなＩＤ（ＪＯＢＩＤ）でバッファ内アドレスを管理する機能を持つ。また、データ通知フラグレジスタ２２は、各ノードから演算データが送られてきたことや、演算完了を管理する演算フラグ（ＪＯＢＩＤに対応するレジスタ（Ｎビット）のノードに対応して設けられるビット）を格納すると共に、マスク情報作成部１４からの通知により、演算に使用しないノードのフラグをセットする。 The crossbar switch 20 has a function of managing the address in the buffer with an ID (JOBID) unique to the system for identifying the job in order to prevent the data storage buffer from causing data retention and deadlock. The data notification flag register 22 also indicates that operation data has been sent from each node, and an operation flag (a bit provided corresponding to the node of the register (N bit) corresponding to JOBID) for managing the completion of the operation. At the same time, the flag of the node that is not used for the calculation is set by the notification from the mask information creation unit 14.

クロスバースイッチ２０のＭａｘ／Ｓｕｍ演算部２４は、各ノードの通知命令作成部からの通知をデータ送信部経由で受け取り、演算結果格納バッファ（データ格納バッファ）２５のＪＯＢＩＤのアドレスに格納する。 The Max / Sum calculation unit 24 of the crossbar switch 20 receives the notification from the notification command generation unit of each node via the data transmission unit, and stores it in the JOBID address of the calculation result storage buffer (data storage buffer) 25.

クロスバースイッチ２０のＭａｘ／Ｓｕｍ演算部２４は、データ通知フラグレジスタ２２の設定により、データ格納バッファより計算中のＭａｘ／Ｓｕｍ値及び各ノードのＭａｘ／Ｓｕｍ値を取り出し、各ノード間のＭａｘ／Ｓｕｍ演算を実行し、この結果を、再び、演算結果格納バッファ（データ格納バッファ）２５のＪＯＢＩＤのアドレスに格納する。これを、ノード数分繰り返し実行し、対象とする全ノードのＭａｘ／Ｓｕｍ演算を実行する。 The Max / Sum calculation unit 24 of the crossbar switch 20 takes out the Max / Sum value being calculated and the Max / Sum value of each node from the data storage buffer according to the setting of the data notification flag register 22, and the Max / Sum value between the nodes. The Sum operation is executed, and the result is stored again in the JOBID address of the operation result storage buffer (data storage buffer) 25. This is repeatedly executed for the number of nodes, and the Max / Sum calculation of all the target nodes is executed.

この際、Ｍａｘ／Ｓｕｍ演算部２４は、毎回、演算結果格納バッファ（データ格納バッファ）２５の指定されたＩＤのアドレスにデータを格納する。 At this time, the Max / Sum operation unit 24 stores data at the address of the designated ID in the operation result storage buffer (data storage buffer) 25 each time.

終了通知部２６は、演算実行する全てのノード間の演算が終了すると、演算完了を検出し、演算結果を全ノードに通知する（終了通知が出力される際に該当ＪＯＢＩＤアドレスのデータ格納バッファの値はクリアされる）。 When the calculation between all the nodes that perform the calculation is completed, the end notification unit 26 detects the completion of the calculation and notifies the calculation result to all the nodes (when the end notification is output, the data storage buffer of the corresponding JOBID address). Value is cleared).

各ノードでは、返却データＪＩＤ比較部１８が、クロスバースイッチ２０からデータ受信部を通して通信を受け取り通信内容と等しいＪＩＤが、専用命令待合せバッファ１３の中にもあるか確認する。等しいものが存在する場合、返却データＪＩＤ比較部１８は、さらに、専用命令待合せバッファ１３内に格納されている命令発行元フラグの情報を参照し、命令発行元フラグより発行元でなければ、クロスバースイッチ２０より通知された返却情報、及び専用命令待合せバッファ１３の格納情報を廃棄し、命令の発行元であれば、該当ＣＰＵ１１に、その情報を返却し、専用命令待合せバッファ１３の情報のみ廃棄する。 At each node, the return data JID comparison unit 18 receives communication from the crossbar switch 20 through the data reception unit and checks whether the JID equal to the communication content is also in the dedicated command waiting buffer 13. If there is an equal one, the return data JID comparison unit 18 further refers to the information of the instruction issuer flag stored in the dedicated instruction wait buffer 13 and if the issuer is not the issuer from the instruction issuer flag, the return data JID comparison unit 18 The return information notified from the bar switch 20 and the information stored in the dedicated command waiting buffer 13 are discarded. If the instruction is issued, the information is returned to the CPU 11 and only the information in the dedicated command waiting buffer 13 is discarded. To do.

以上のように動作することで、マルチノードＪＯＢのＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ、ＭＰＩ＿ＲＥＤＵＣＥ命令を高速に実行する。 By operating as described above, the MPI_ALL_REDUCE and MPI_REDUCE instructions of the multi-node JOB are executed at high speed.

以上説明したように、本実施例においては、下記記載の作用効果を奏する。 As described above, the present embodiment has the following operational effects.

マルチノードＪＯＢでのＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ、ＭＰＩ＿ＲＥＤＵＣＥ命令の実行過程において、コネクション型のネットワークでは、全ノード（ノード数をＮとする）から１つのノードにデータを通知するために、通常Ｎ：１（２分岐ツリーでノード間転送の効率化を行った場合でも、ＬｏｇＮ：１）の通信が発生する。１つの転送ごとにコネクションをしなおすネットワークでありながら、クロスバースイッチまでの転送であれば、コネクション・ロックの必要がなく、ノードからいつでも転送可能である。 In the execution process of MPI_ALL_REDUCE and MPI_REDUCE instructions in a multi-node job, in a connection type network, in order to notify data from all nodes (the number of nodes is N) to one node, normally N: 1 (two-branch tree) Even when the efficiency of inter-node transfer is improved, the communication of LogN: 1) occurs. Although the network re-establishes connection for each transfer, transfer to the crossbar switch does not require connection lock and can be transferred from the node at any time.

そこで、本発明は、クロスバースイッチに、ノード内転送データを送り、クロスバースイッチ内で、１つの結果になるまで、演算を行えるように演算部を、クロスバースイッチ内に具備したことにより、コネクション・ロックを要しなくし、コネクション・ロックのために費やされていた時間を、大幅に短縮し、マスチノードシステムのＪＯＢの実行を効率化、高速化することができる。 Therefore, the present invention provides the operation unit in the crossbar switch so that the intra-node transfer data is sent to the crossbar switch and the operation can be performed until one result is obtained in the crossbar switch. By eliminating the need for connection locking, the time spent for connection locking can be greatly shortened, and execution of JOBs in the mast node system can be made more efficient and faster.

また、ノードの転送制御部において、ＭＰＩ＿ＡＬＬ＿ＲＥＤＵＣＥ、ＭＰＩ＿ＲＥＤＵＣＥ命令時のＣＰＵとの通信を行うため、ソフトウェア（ＳＷ）として、従来技術の延長線上（かつソフトウェア処理が減る）形で、機能を実現しているので、修正が容易に対応でき、ソフトウェア資産を有効に活用することができる。 In addition, in the node transfer control unit, functions are realized as software (SW) on the extension line of the prior art (and software processing is reduced) in order to communicate with the CPU at the time of MPI_ALL_REDUCE and MPI_REDUCE instructions. Therefore, correction can be easily handled and software assets can be used effectively.

次に、本発明の他の実施例について説明する。本発明の第２の実施例として、その基本的構成は上記の通りであるが、クロスバースイッチ２０のデータ通知フラグレジスタ２２についてさらに工夫が施されている。なお、ノードは、図１の前記実施例と同一構成とされる。 Next, another embodiment of the present invention will be described. As a second embodiment of the present invention, the basic configuration is as described above, but the data notification flag register 22 of the crossbar switch 20 is further devised. The node has the same configuration as that of the embodiment shown in FIG.

図３は、本発明の第２の実施例の構成を示す図である。図３において、データ通知フラグレジスタ２２は、ＪＯＢＩＤごとに、Ｍａｘ／Ｓｕｍデータ通知カウンタ２８を備えることで、図１に示したように、フラグを持つ場合よりも、ハードウェア量を削減し、ほぼ同様な機能を実現する。 FIG. 3 is a diagram showing the configuration of the second exemplary embodiment of the present invention. In FIG. 3, the data notification flag register 22 includes a Max / Sum data notification counter 28 for each JOBID, thereby reducing the amount of hardware compared to the case of having a flag as shown in FIG. A similar function is realized.

本実施例では、マスク設定時に、図１の前記実施例のように、演算に関与しないノードのビットを１にする代わりに、演算に関与する（対象となる）ノード数をカウンタに設定する。例えば１０個のノードに演算指示を行う場合、Ｍａｘ／Ｓｕｍデータ通知カウンタを１０とする。 In this embodiment, at the time of mask setting, instead of setting the bit of the node not involved in the operation to 1 as in the embodiment of FIG. 1, the number of nodes involved in the operation (target) is set in the counter. For example, when a calculation instruction is given to 10 nodes, the Max / Sum data notification counter is set to 10.

また、特に制限されないが、Ｍａｘ／Ｓｕｍデータ通知カウンタ２８としてダウンカウンタを用いた場合、ＪＯＢＩＤの演算が実行されるたびに、Ｍａｘ／Ｓｕｍデータ通知カウンタ２８を減算していき、Ｍａｘ／Ｓｕｍデータ通知カウンタ２８のカウント値が０になったときに、終了とみなして、出力を各ノードに通知するように変更する。なお、ダウンカウンタを用いた場合と比較して、ハードウェア量は多くなるが、Ｍａｘ／Ｓｕｍデータ通知カウンタ２８の構成を変え、設定値を格納するレジスタと別に、加算するカウンタを設け、終了を判定する回路を備えて構成するようにしてもよい。 Further, although not particularly limited, when a down counter is used as the Max / Sum data notification counter 28, the Max / Sum data notification counter 28 is subtracted every time the JOBID calculation is executed, and the Max / Sum data notification is performed. When the count value of the counter 28 becomes 0, it is regarded as the end, and the output is changed to notify each node. Although the amount of hardware increases compared to the case where a down counter is used, the configuration of the Max / Sum data notification counter 28 is changed, and a counter for adding is provided separately from the register for storing the set value. You may make it comprise the circuit to determine.

このように、本実施例では、データ通知フラグレジスタをカウンタという構成で実現しているので、ハードウェア量を削減することができる、という効果が得られる。 Thus, in this embodiment, since the data notification flag register is realized by a configuration called a counter, an effect that the amount of hardware can be reduced is obtained.

以上本発明を上記実施例に即して説明したが、本発明は上記実施例の構成にのみに限定されるものでなく、本発明の範囲内で当業者であればなし得るであろう各種変形、修正を含むことは勿論である。 Although the present invention has been described with reference to the above-described embodiments, the present invention is not limited to the configurations of the above-described embodiments, and various modifications that can be made by those skilled in the art within the scope of the present invention. Of course, including modifications.

本発明の一実施例の構成を示す図である。It is a figure which shows the structure of one Example of this invention. （Ａ）、（Ｂ）は本発明の一実施例の動作を説明するための図である。(A), (B) is a figure for demonstrating operation | movement of one Example of this invention. 本発明の第２の実施例の構成を示す図である。It is a figure which shows the structure of the 2nd Example of this invention. （Ａ）、（Ｂ）は従来のマルチノードシステムのＪＯＢ実行動作を説明するための図である。(A), (B) is a figure for demonstrating the JOB execution operation | movement of the conventional multinode system.

Explanation of symbols

１０ノード
１１ＣＰＵ
１２転送制御部
１３専用命令待ち合わせバッファ
１４マスク情報作成部
１５通知命令作成部
１６データ受信部
１７データ送信部
１８返却データＪＩＤ比較部
１９他の命令発行機構
２０クロスバースイッチ
２２データ通知フラグレジスタ
２３通知受信格納バッファ
２４Ｍａｘ／Ｓｕｍ演算部
２５演算結果格納バッファ
２６終了通知部
２８Ｍａｘ／Ｓｕｍデータ通知カウンタ
10 nodes 11 CPU
DESCRIPTION OF SYMBOLS 12 Transfer control part 13 Dedicated command waiting buffer 14 Mask information preparation part 15 Notification instruction preparation part 16 Data reception part 17 Data transmission part 18 Return data JID comparison part 19 Other instruction issue mechanism 20 Crossbar switch 22 Data notification flag register 23 Notification Reception storage buffer 24 Max / Sum calculation unit 25 Calculation result storage buffer 26 End notification unit 28 Max / Sum data notification counter

Claims

With multiple nodes and crossbar switches,
The crossbar switch receives a request from a request source node among a plurality of nodes, and distributes calculation instructions corresponding to the request to other nodes by broadcast communication;
Calculation means for collecting and calculating the calculation results executed at the nodes to which the calculation instructions are distributed;
Means for notifying at least the request source node of the result of the calculation by the calculation means;
An information processing system comprising:

The information processing system according to claim 1, wherein the crossbar switch notifies a plurality of nodes of a calculation result of the calculation means by broadcast communication.

Each node checks whether or not the operation result received from the crossbar switch is an operation result corresponding to an instruction issued by the own node, and if it is not requested by the own node, the crossbar switch 3. The information processing system according to claim 2, further comprising means for discarding the calculation result received from.

3. The information processing system according to claim 1, wherein, in the crossbar switch, the calculation unit performs control to perform calculation until the calculation result collected from the node becomes one. 4.

In the crossbar switch, a storage unit that stores the calculation result received from the node and supplies the calculation result to the calculation unit, and a storage unit that stores the calculation result output from the calculation unit have an ID corresponding to the request. 5. The information processing system according to claim 4, wherein the data is stored and managed at an address according to the address.

2. The information processing system according to claim 1, wherein the operation result is returned to the CPUs of all nodes in addition to the CPU of the requesting node according to the instruction type.

Each of the nodes includes a CPU and a transfer control unit,
The transfer control unit includes an MPI (Message Passing Interface) dedicated command issued from the CPU and a dedicated command queuing buffer for storing related information;
A mask information creation unit for notifying the crossbar switch of mask information for designating a node that is not used for computation and does not wait for computation results; and
A notification command generating unit that receives a notification dedicated command from the CPU and generates a notification command to be transmitted to the crossbar switch;
It is determined whether the calculation result notified from the crossbar switch corresponds to the instruction issued by the CPU of the own node. When the calculation result corresponds to the instruction issued by the CPU of the own node, the calculation result is notified to the CPU. However, if it does not correspond to the issued command, a return data comparison unit that performs control to discard,
The information processing system according to claim 1, further comprising:

The crossbar switch has a data storage buffer comprising a data notification flag register, a calculation unit, an end notification unit, a notification reception storage buffer for storing notification data from a node, and a calculation result storage buffer for storing calculation result data. Prepared,
In the crossbar switch, the data notification flag register is set with mask information received from the request source node,
In response to an operation execution instruction from the requesting node, broadcast execution is notified to multiple nodes by broadcast communication.
When a notification-dedicated instruction is sent from the CPU of each node in the plurality of nodes that have performed the calculation, the calculation result is notified to the calculation unit of the crossbar switch from each node,
In the crossbar switch, the calculation unit receives a notification from the notification instruction creation unit of each node, and executes a calculation between the nodes by setting the data notification flag register,
According to the setting of the data notification flag register in the calculation unit, when the calculation between all the nodes performing the calculation is completed, the completion notification unit broadcasts the calculation result to the plurality of nodes. The information processing system according to claim 7.

The data notification flag register is provided for each job ID (referred to as “JOBID”) related to a request, and includes a register having a bit number corresponding to the number of the plurality of nodes,
The arithmetic unit controls so as to wait for notification data from a node corresponding to the bit and perform an operation as long as the bit of the register corresponding to the job of the data notification flag register has a first value. The information processing system according to claim 8, wherein:

The calculation request source node is notified of the setting of mask information in the data notification flag register so as not to wait for the calculation result of the node not used in the job before the calculation at the distributed node starts.
The information processing system according to claim 9, wherein a second value is set as an invalid node bit as mask information in a corresponding register of the data notification flag register.

10. The bit corresponding to the node of the register corresponding to the job is set to a second value in the data notification flag register upon completion of arrival of notification data from the node. Information processing system.

9. The information processing system according to claim 8, wherein the notification reception storage buffer and the calculation result storage buffer manage addresses in the buffer according to a job ID for identifying a job.

The calculation unit executes a calculation between the nodes by setting the data notification flag register. The calculation unit passes a calculation result together with a job ID to the calculation result storage buffer every time the calculation is performed, and specifies a specified job ID. 9. The information processing system according to claim 8, wherein the data is stored at an address of the information processing system.

The end notification unit, when all the bits of the register corresponding to the job ID included in the data notification flag register have the second value, validates the output notification, and broadcasts all the nodes by broadcasting when the calculation unit completes the calculation. 9. The information processing system according to claim 8, wherein control for notifying a final result is performed.

The dedicated instruction queuing buffer stores an ID of a job related to execution of the MPI dedicated instruction, return destination information, and an instruction issuer flag indicating that it is an issuer node of the MPI dedicated instruction.
The return data comparison unit checks whether there is an ID specified for each job in the dedicated command waiting buffer that is equal to the job ID received from the crossbar switch. In addition, referring to the information of the instruction issuer flag stored in the dedicated instruction waiting buffer, the instruction issuer flag notifies the crossbar switch if the node is not the instruction issuer. If the return information and the information stored in the dedicated instruction waiting buffer are discarded, and if it is an instruction issuer, the calculation result from the crossbar switch is returned to the CPU, and only the information of the dedicated instruction waiting buffer is discarded. The information processing system according to claim 7, wherein:

The data notification flag register includes a counter for each job ID, sets the number of nodes to be calculated in the counter,
Each time an operation is performed in the operation unit corresponding to the job ID, the corresponding counter is counted down, and when the count value becomes 0, it is regarded as an end, and the end notification unit outputs the operation result. The information processing system according to claim 8, wherein each node is notified.

The data notification flag register includes a counter for each job ID,
Each time an operation is performed in the operation unit corresponding to a job ID, the corresponding counter is counted up, and when the count value reaches the number of nodes to be calculated, it is regarded as an end, The information processing system according to claim 8, wherein the end notification unit notifies the operation result to each node.

A job execution method for a multi-node system having a plurality of nodes and a crossbar switch,
Each of the nodes has a CPU and a transfer control unit,
The transfer control unit of the request source node receives an MPI (Message Passing Interface) dedicated instruction issued from the CPU and stores information in a dedicated instruction queuing buffer;
A step in which the transfer control unit of the request source node notifies the crossbar switch of mask information for designating an invalid node;
The crossbar switch sets mask information in a data notification flag register;
The crossbar switch notifying a calculation execution instruction to all nodes by broadcast communication;
In each node that has performed computation, when a notification-only instruction is sent from the CPU, a step of notifying the computation result of the crossbar switch to the computation unit;
In the crossbar switch, the calculation unit receives a notification from the notification command generation unit of each node, executes a calculation between the nodes according to the setting of the data notification flag register, and performs a calculation between all the nodes performing the calculation When is finished, a step of broadcasting the calculation result to all nodes;
A job execution method for a multi-node system, comprising:

Each node refers to the instruction issuer flag information stored in the address of the job ID specified for each job in the dedicated instruction queuing buffer, confirms whether it is an instruction issuer node, and determines the instruction issuer flag. 19. The multi-purpose device according to claim 18, wherein the return information from the crossbar switch is discarded if it is not an issuer, and the information is returned to the CPU if it is an issuer. Node system job execution method.

A node connected to the crossbar switch,
CPU and transfer control unit,
The transfer control unit includes a dedicated command queuing buffer for storing a dedicated transfer command issued from the CPU and related information;
A mask information creating unit for notifying the crossbar switch of mask information for designating a node that is not used for computation and does not wait for computation results; and
A notification command generating unit that receives a notification dedicated command from the CPU and generates a notification command to be transmitted to the crossbar switch;
It is determined whether the calculation result notified from the crossbar switch corresponds to an instruction issued by the CPU of its own node, and when it does not correspond to the issued instruction, a return data comparison unit that performs discard control,
A node characterized by comprising:

A crossbar switch connected to multiple nodes,
A data notification flag register; and
An arithmetic unit;
An end notification section;
A notification reception buffer for storing notification data from the node, and a data storage buffer including a calculation result storage buffer for storing calculation result data,
Receive mask information from the requesting node, set in the data notification flag register,
In response to a request from the requesting node, broadcast operation is notified to a plurality of nodes by broadcast communication,
When a notification-dedicated instruction is sent from the CPU of the node in the plurality of nodes that have performed the calculation, the calculation result is notified to the calculation unit of the crossbar switch,
In the calculation unit, the notification from the notification instruction creation unit of each node is received, and the calculation between the nodes is executed by setting the data notification flag register,
The crossbar switch, wherein when the calculation between all the nodes that are executed by the calculation unit is completed, the completion notification unit broadcasts the calculation result to all the nodes.

The data notification flag register is provided for each job ID, and includes a register having bits corresponding to the number of nodes.
The crossbar switch according to claim 21, wherein as long as each bit of the register has a first value, the arithmetic unit controls to wait for data from the node and continue to perform arithmetic operations. .

Before the calculation of each node starts from the calculation request source node, in order not to wait for the calculation result of the node not used in the job, the mask setting is notified to the data notification flag register, and the corresponding register of the data notification flag register 23. The crossbar switch according to claim 21, wherein a bit corresponding to a node that is not used in a job is set to a second value as an invalid node bit as mask information.

The bit corresponding to the node of the register corresponding to the job is set to the second value in the data notification flag register upon completion of the arrival of the operation data from the node. 22. A crossbar switch according to 22.

The crossbar switch according to claim 21, wherein the notification reception storage buffer and the operation result storage buffer manage addresses in the buffer for storing data in accordance with a job ID for identifying a job.

The calculation unit performs calculation between the nodes by setting the data notification flag register,
26. The crossbar switch according to claim 25, wherein the calculation unit passes a calculation result together with a job ID to the calculation result storage buffer, and stores data at an address of a designated job ID.

The completion notification unit validates the output notification when all the bits of the register corresponding to the job ID of the data notification flag register have the second value, and completes the calculation by the calculation unit and broadcast communication. The crossbar switch according to claim 21, wherein control for notifying a node of a final result is performed.

The data notification flag register includes a counter for each job ID, sets the number of nodes to be calculated in the counter,
Corresponding to the job ID, each time an operation is performed by the operation unit, the corresponding counter is counted down, and when the count value becomes 0, the end notification unit The crossbar switch according to claim 21, wherein the calculation result is notified to each node.

The data notification flag register includes a counter for each job ID,
Each time an operation is performed in the operation unit corresponding to a job ID, the corresponding counter is counted up, and when the count value reaches the number of nodes to be calculated, it is regarded as an end, The crossbar switch according to claim 21, wherein the end notification unit notifies each node of a calculation result.