JP2022522320A

JP2022522320A - Reconfigurable Computation Pod Using Optical Network

Info

Publication number: JP2022522320A
Application number: JP2021522036A
Authority: JP
Inventors: パティル，ニシャント; ジョウ，シアン; スウィング，アンドリュー
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2019-03-06
Filing date: 2019-12-18
Publication date: 2022-04-18
Anticipated expiration: 2039-12-18
Also published as: US11537443B2; JP7242847B2; CN117873727A; KR102583771B1; US11042416B2; KR20210063382A; BR112021007538A2; CN112889032B; US20200285524A1; EP3853732A1; CN112889032A; US20230161638A1; WO2020180387A1; US20210286656A1; KR102625118B1; KR20230141921A; JP2023078228A

Abstract

光ネットワークを用いて計算ノードのビルディングブロックのクラスタを生成するための方法、システムおよび装置。一態様において、方法は、計算ワークロードを実行するために要求される計算ノードを指定する要求データを受信することを含む。要求データは、計算ノードのｎ次元目標構成を指定する。各々がｍ次元構成の計算ノードを含む１組のビルディングブロックを含むスーパポッドから、組み合わせられると、要求データによって指定された目標構成に一致するビルディングブロックの部分セットを選択する。１組のビルディングブロックは、１つ以上の光回路スイッチを含む光ネットワークに接続される。ビルディングブロックの部分セットを含む計算ノードのワークロードクラスタを生成する。ワークロードクラスタの生成は、ワークロードクラスタの各次元のための１つ以上の光回路スイッチのそれぞれのルーティングデータを構成することを含む。Methods, systems and appliances for creating clusters of building blocks of compute nodes using optical networks. In one aspect, the method comprises receiving request data specifying a compute node requested to perform a compute workload. The request data specifies the n-dimensional target configuration of the compute node. From a superpod containing a set of building blocks, each containing a compute node in an m-dimensional configuration, select a subset of the building blocks that, when combined, match the target configuration specified by the request data. A set of building blocks is connected to an optical network containing one or more optical circuit switches. Generate a workload cluster of compute nodes that contains a subset of building blocks. Creating a workload cluster involves configuring the routing data for each of one or more optical circuit switches for each dimension of the workload cluster.

Description

背景
いくつかの計算ワークロード、例えば機械学習トレーニングは、ワークロードを効率的に処理するために多くの処理ノードを必要とする。処理ノードは、相互接続ネットワークを介して互いに通信することができる。例えば、機械学習トレーニングの場合、処理ノードは、互いに通信することによって、最適な深層学習モデルに収束することができる。相互接続ネットワークは、処理ユニットが収束を達成する速度および効率にとって重要である。 Background Some computational workloads, such as machine learning training, require many processing nodes to process the workload efficiently. Processing nodes can communicate with each other via an interconnect network. For example, in the case of machine learning training, processing nodes can converge to an optimal deep learning model by communicating with each other. The interconnect network is important for the speed and efficiency with which the processing unit achieves convergence.

機械学習および他のワークロードのサイズおよび複雑性が変化するため、複数の処理ノードを含むスーパコンピュータの固定した構成は、スーパコンピュータの利用可能性、拡張性および性能を制限してしまう場合がある。例えば、複数の処理ノードからなる特定の構成を接続する固定の相互接続ネットワークを有するスーパコンピュータのいくつかの処理ノードが故障した場合、スーパコンピュータは、故障した処理ノードを置換することができないため、利用可能性および性能が低下する。また、いくつかの特定の構成の性能は、故障したノードに関係なく他の構成の性能よりも高くなる場合がある。 Due to the changing size and complexity of machine learning and other workloads, a fixed configuration of a supercomputer with multiple processing nodes can limit the availability, scalability and performance of the supercomputer. .. For example, if some processing node of a supercomputer with a fixed interconnect network connecting a particular configuration consisting of multiple processing nodes fails, the supercomputer cannot replace the failed processing node. Availability and performance are reduced. Also, the performance of some specific configurations may be higher than that of other configurations, regardless of the failed node.

概要
本明細書は、光ネットワークを用いて、ワークロードクラスタを生成する計算ノードのスーパポートを再構成できる技術を説明する。 Overview This specification describes a technique that can reconfigure the superports of compute nodes that generate workload clusters using optical networks.

概して、本明細書に記載された主題の１つの発明的態様は、計算ワークロードを実行するために要求される計算ノードを指定する要求データを受信することを含む方法に具現化することができる。要求データは、計算ノードのｎ（ｎは、２以上である）次元目標構成を指定する。各々がｍ（ｍは、２以上である）次元構成の計算ノードを含む１組のビルディングブロックを含むスーパポッドから、組み合わせられると、要求データによって指定されたｎ次元目標構成に一致するビルディングブロックの部分セットを選択する。上記の１組のビルディングブロックは、ｎ次元の各次元のための１つ以上の光回路スイッチを含む光ネットワークに接続される。ビルディングブロックの部分セットを含む計算ノードのワークロードクラスタを生成する。ワークロードクラスタは、特定の計算ワークロードの計算または実行に専用の計算ノードのクラスタである。生成することは、ワークロードクラスタの各次元について、当該次元のための１つ以上の光回路スイッチのそれぞれのルーティングデータを構成することを含む。ワークロードクラスタの各次元にそれぞれ対応するルーティングデータは、ワークロードクラスタの各次元に沿って、計算ワークロードのデータをどのように計算ノードの間にルーティングすることを指定する。ワークロードクラスタ内の計算ノードは、計算ワークロードを実行する。 In general, one invention aspect of the subject matter described herein can be embodied in a method comprising receiving request data specifying a compute node required to perform a computational workload. .. The request data specifies the n (n is 2 or more) dimensional target configuration of the compute node. A portion of the building block that, when combined, matches the n-dimensional target configuration specified by the request data, from a superpod containing a set of building blocks, each containing a compute node of the m (m is 2 or more) dimension configuration. Select a set. The set of building blocks described above is connected to an optical network containing one or more optical circuit switches for each of the n dimensions. Generate a workload cluster of compute nodes that contains a subset of building blocks. A workload cluster is a cluster of compute nodes dedicated to the computation or execution of a particular compute workload. Generating involves configuring the routing data for each dimension of one or more optical circuit switches for that dimension for each dimension of the workload cluster. The routing data corresponding to each dimension of the workload cluster specifies how to route the data of the compute workload between the compute nodes along each dimension of the workload cluster. Computational nodes in a workload cluster run computational workloads.

これらおよび他の実装は、必要に応じて、以下の特徴のうち、１つ以上を含むことができる。いくつかの態様において、要求データは、異なる種類の計算ノードを指定する。ビルディングブロックの部分セットを選択することは、要求データによって指定される各種類の計算ノードについて、指定された種類の１つ以上の計算ノードを含むビルディングブロックを選択することを含むことができる。 These and other implementations may optionally include one or more of the following features: In some embodiments, the request data specifies different types of compute nodes. Selecting a subset of building blocks can include selecting a building block that contains one or more compute nodes of the specified type for each type of compute node specified by the request data.

いくつかの態様において、スーパポッドの各次元のそれぞれのルーティングデータは、１つ以上の光回路スイッチのうちの１つについて光回路スイッチルーティングテーブルを含む。いくつかの態様において、光ネットワークは、ｎ次元の各次元について、当該次元に沿った計算ノードの間にデータをルーティングする当該光ネットワークの１つ以上の光回路スイッチを含む。各ビルディングブロックは、当該ビルディングブロックの各次元に沿った複数のセグメントの計算ノードを含むことができる。光ネットワークは、各次元の各セグメントについて、ワークロードクラスタ内の各ビルディングブロックに対応する計算ノードセグメントの間にデータをルーティングする当該光ネットワークの光回路スイッチを含むことができる。 In some embodiments, the routing data for each dimension of the superpod comprises an optical circuit switch routing table for one of one or more optical circuit switches. In some embodiments, the optical network comprises one or more optical circuit switches of the optical network that route data between computational nodes along that dimension for each dimension of n dimensions. Each building block can contain multiple segments of compute nodes along each dimension of the building block. An optical network can include an optical circuit switch of the optical network that routes data between the compute node segments corresponding to each building block in the workload cluster for each segment of each dimension.

いくつかの態様において、各ビルディングブロックは、３次元トーラス状計算ノードまたはメッシュ状計算ノードのうちの１つを含む。いくつかの態様において、スーパポッドは、複数のワークロードクラスタを含み、各ワークロードクラスタは、ビルディングブロックの異なる部分セットを含み、他のワークロードクラスタとは異なるワークロードを実行することができる。 In some embodiments, each building block comprises one of a three-dimensional torus-like compute node or a mesh-like compute node. In some embodiments, the superpod contains multiple workload clusters, each workload cluster containing a different set of building blocks and capable of running different workloads than other workload clusters.

いくつかの態様は、ワークロードクラスタの特定のビルディングブロックが故障したことを示すデータを受信することと、利用可能なビルディングブロックを用いて特定のビルディングブロックを置換することとを含む。利用可能なビルディングブロックを用いて特定のビルディングブロックを置換することは、ワークロードクラスタの特定のビルディングブロックと１つ以上の他のビルディングブロックとの間のデータルーティングを停止するように、光ネットワークの１つ以上の光回路スイッチのデータルーティングを更新することと、ワークロードクラスタの利用可能なビルディングブロックと１つ以上の他のビルディングブロックとの間にデータをルーティングするように、光ネットワークの１つ以上の光回路スイッチのデータルーティングを更新することとを含むことができる。 Some embodiments include receiving data indicating that a particular building block of a workload cluster has failed and replacing the particular building block with an available building block. Replacing a particular building block with an available building block is like stopping data routing between a particular building block in a workload cluster and one or more other building blocks. One of the optical networks to update the data routing of one or more optical circuit switches and route data between the available building blocks of the workload cluster and one or more other building blocks. It can include updating the data routing of the above optical circuit switch.

いくつかの態様において、組み合わせられると、要求データによって指定されたｎ次元目標構成に一致するビルディングブロックの部分セットを選択することは、要求データによって指定されたｎ次元構成が、スーパポッド内の利用可能且つ健全な第２の量のビルディングブロックを超える第１の量のビルディングブロックを必要とすることを判断することと、要求データによって指定されたｎ次元構成が、スーパポッド内の利用可能且つ健全な第２の量のビルディングブロックを超える第１の量のビルディングブロックを必要とするという判断に応じて、計算ワークロードより低い優先度を有し且つスーパポッドの他のビルディングブロックによって実行されている１つ以上の第２の計算ワークロードを特定こと、および１つ以上の第２の計算ワークロードの１つ以上のビルディングブロックを、計算ワークロードのワークロードクラスタに割り当て直すこととを含む。ビルディングブロックの部分セットを含む計算ノードのワークロードクラスタを生成することは、１つ以上の第２の計算ワークロードの１つ以上のビルディングブロックをビルディングブロックの部分セットに含めることを有することができる。 In some embodiments, selecting a subset of building blocks that, when combined, matches the n-dimensional target configuration specified by the request data means that the n-dimensional configuration specified by the request data is available within the superpod. And determining that a first quantity of building blocks is needed that exceeds a healthy second quantity of building blocks, and the n-dimensional configuration specified by the request data, is available and healthy in the superpod. One or more that has a lower priority than the computational workload and is being executed by other Superpod building blocks, depending on the determination that a first quantity of building blocks is needed that exceeds two quantities of building blocks. Includes identifying a second computational workload in and reassigning one or more building blocks of one or more second computational workloads to a workload cluster of computational workloads. Creating a compute node workload cluster containing a subset of building blocks can have one or more building blocks of one or more second computational workloads included in the building block subset. ..

いくつかの態様において、ビルディングブロックの部分セットを含む計算ノードのワークロードクラスタを生成することは、ワークロードクラスタの各次元について、１つ以上の第２の計算ワークロードの１つ以上のビルディングブロックの各ビルディングブロックが、１つ以上の第２の計算ワークロードのビルディングブロックではなく、ワークロードクラスタの他のビルディングブロックと通信するように、当該次元のための１つ以上の光回路スイッチの各々のルーティングデータを再構成することを含む。 In some embodiments, generating a workload cluster of compute nodes containing a subset of building blocks is one or more building blocks of one or more second computational workloads for each dimension of the workload cluster. Each of the one or more optical circuit switches for that dimension so that each building block in the Includes reconstructing the routing data of.

本明細書に記載された主題は、以下の１つ以上の利点を実現するように、特定の実施形態に実装されてもよい。光ネットワークを用いて、ワークロードを実行するための計算ノードのクラスタを動的に構成することによって、他の計算ノードで故障したまたはオフラインした計算ノードを容易に置換できるため、計算ノードの利用可能性がより高くなる。計算ノードの柔軟構成によって、計算ノードの性能がより高くなり、各ワークロードを実行するための計算ノードの適切な数をより効率で割り当てることができ、各ワークロードを実行するための計算ノードの構成を最適化（または改善）することができる。複数の種類の計算ノードを含むスーパポッドを使用して、例えば、データセンタまたは他の場所において互いに物理的に近接する（例えば、同一のラックにおいて互いに接続されるおよび／または隣接する）計算ノードに限定されず、計算ノードの適切な数および構成だけでなく、各ワークロードを実行するための計算ノードの適切な種類を含むワークロードクラスタを生成することができる。代わりに、光ネットワークは、様々な形状のワークロードクラスタを可能にする。これらのワークロードクラスタにおいて、計算ノードは、互いに任意の物理的位置に配置されても、互いに隣接するように動作する。 The subject matter described herein may be implemented in a particular embodiment to achieve one or more of the following advantages: Computational nodes are available because you can easily replace failed or offline compute nodes on other compute nodes by dynamically configuring a cluster of compute nodes to run workloads using an optical network. The sex becomes higher. Flexible configuration of compute nodes allows for better performance of compute nodes, more efficient allocation of the appropriate number of compute nodes to run each workload, and more compute nodes to run each workload. The configuration can be optimized (or improved). Limited to compute nodes that are physically close to each other (eg, connected and / or adjacent to each other in the same rack), for example, in a data center or elsewhere, using a superpod that contains multiple types of compute nodes. Instead, you can generate a workload cluster that contains the appropriate number and configuration of compute nodes, as well as the appropriate type of compute node to run each workload. Instead, optical networks allow workload clusters of various shapes. In these workload clusters, compute nodes operate so that they are adjacent to each other, even if they are placed at arbitrary physical locations.

また、光ネットワークを用いてポッドを構成することによって、故障の隔離およびワークロードのより良いセキュリティを提供する。例えば、いくつかの従来のスーパコンピュータは、スーパコンピュータを構成する様々なコンピュータの間にトラフィックをルーティングする。１台のコンピュータが故障すると、通信経路が中断する。光ネットワークを用いてデータを迅速に再ルーティングすることができ、および／または利用可能な計算ノードを用いて故障した計算ノードを置換することができる。また、光回路スイッチング（ＯＣＳ）スイッチによって提供されたワークロード間の物理的分離、例えば、異なる光路の物理的分離は、脆弱なソフトウェアを使用した分離を管理することに比べて、同一のスーパポートに実行されている様々なワークロード間により良いセキュリティを提供する。 Also, by configuring the pod with an optical network, fault isolation and better security of the workload are provided. For example, some traditional supercomputers route traffic between the various computers that make up a supercomputer. If one computer fails, the communication path is interrupted. Data can be rapidly rerouted using an optical network and / or a failed compute node can be replaced with an available compute node. Also, the physical separation between workloads provided by an optical circuit switching (OCS) switch, such as the physical separation of different optical paths, is the same superport as compared to managing the separation using vulnerable software. Provides better security between various workloads running on.

また、光ネットワークを用いてビルディングブロックを接続することによって、パケットスイッチングネットワークに比べて、ビルディングブロックの間にデータを送信する遅延を低減することができる。例えば、パケットスイッチングの場合、スイッチがパケットを受信し、バッファリングし、別のポートで再び送信する必要があるため、遅延が長くなる。ＯＣＳスイッチを用いてビルディングブロックを接続することによって、途中でパケットスイッチングまたはバッファリングを行わない真のエンドツーエンド光路を提供することができる。 Further, by connecting the building blocks using an optical network, it is possible to reduce the delay in transmitting data between the building blocks as compared with the packet switching network. For example, in the case of packet switching, the delay is long because the switch must receive the packet, buffer it, and send it again on another port. By connecting building blocks with OCS switches, it is possible to provide a true end-to-end optical path without packet switching or buffering along the way.

以下、図面を参照して、前述した主題の様々な特徴および利点を説明する。さらなる特徴および利点は、本明細書および特許請求の範囲に記載された主題から明らかである。 Hereinafter, various features and advantages of the above-mentioned subjects will be described with reference to the drawings. Further features and advantages are evident from the subject matter described herein and the claims.

例示的な処理システムが、計算ノードのワークロードクラスタを生成し、ワークロードクラスタを用いて計算ワークロードを実行する環境を示すブロック図である。FIG. 6 is a block diagram showing an environment in which an exemplary processing system creates a workload cluster of compute nodes and executes computational workloads using the workload cluster. 例示的な論理スーパポッド、およびスーパポッド内の一部のビルディングブロックから生成された例示的なワークロードクラスタを示す図である。It is a diagram showing an exemplary logical superpod, and an exemplary workload cluster generated from some building blocks within the superpod. 例示的なビルディングブロック、およびビルディングブロックを用いて生成された例示的なワークロードクラスタを示す図である。It is a figure which shows the exemplary building block, and the exemplary workload cluster generated using the building block. 計算ノードから光回路スイッチング（ＯＣＳ）スイッチまでの例示的な光リンクを示す図である。FIG. 6 shows an exemplary optical link from a compute node to an optical circuit switching (OCS) switch. ビルディングブロックを形成するための論理的計算トレイを示す図である。It is a figure which shows the logical calculation tray for forming a building block. １つの次元を省略した例示的なビルディングブロックのサブブロックを示す図である。It is a figure which shows the sub-block of an exemplary building block which omitted one dimension. 例示的なビルディングブロックを示す図である。It is a figure which shows an exemplary building block. スーパポッドのＯＣＳファブリックトポロジを示す図である。It is a figure which shows the OCS fabric topology of a superpod. 例示的なスーパポッドの構成要素を示す図である。It is a figure which shows the component of an exemplary superpod. ワークロードクラスタを生成し、ワークロードクラスタを用いて計算ワークロードを実行するための例示的なプロセスを示す流れ図である。It is a flow chart which shows an exemplary process for generating a workload cluster and executing a computational workload using the workload cluster. 故障したビルディングブロックを置換するように、光ネットワークを再構成するための例示的なプロセスを示す流れ図である。FIG. 6 is a flow chart illustrating an exemplary process for reconstructing an optical network to replace a failed building block.

詳細な説明
様々な図面において、同様の参照番号および名称は、同様の要素を示す。 Detailed Description In various drawings, similar reference numbers and names refer to similar elements.

一般的に、本明細書に記載されたシステムおよび技術は、光ネットワークファブリックを構成することによって、スーパポッドから計算ノードのワークロードクラスタを生成することができる。スーパポッドとは、光ネットワークを介して互いに接続されている計算ノードからなる複数のビルディングブロックのグループである。例えば、スーパポッドは、相互に接続された１組のビルディングブロックを含むことができる。各ビルディングブロックは、ｍ次元構成、例えば２次元構成または３次元構成の複数の計算ノードを含むことができる。 In general, the systems and techniques described herein can generate a workload cluster of compute nodes from a superpod by configuring an optical network fabric. A superpod is a group of multiple building blocks consisting of computational nodes connected to each other via an optical network. For example, a superpod can include a set of interconnected building blocks. Each building block can contain multiple compute nodes in an m-dimensional configuration, eg, a two-dimensional configuration or a three-dimensional configuration.

ユーザは、特定のワークロードを実行するために目標構成の計算ノードを指定することができる。例えば、ユーザは、機械学習ワークロードを提供し、機械学習演算を実行するための目標構成の計算ノードを指定することができる。目標構成は、ｎ（ｎは、例えば２以上である）次元の各次元に沿った計算ノードの数を定義することができる。すなわち、目標構成は、ワークロードクラスタのサイズおよび形状を定義することができる。例えば、一部の機械学習モデルおよび計算は、非正方形トポロジでより良好に機能する。 The user can specify the compute nodes of the goal configuration to run a particular workload. For example, the user can provide a machine learning workload and specify computational nodes for goal configurations to perform machine learning operations. The target configuration can define the number of computational nodes along each dimension of n (n is, for example, 2 or more) dimensions. That is, the goal configuration can define the size and shape of the workload cluster. For example, some machine learning models and computations work better in non-square topologies.

また、帯域幅の断面積は、例えば、データの転送を待機する計算ノードまたはアイドル計算サイクルから離脱する計算ノードにわたる計算を制限する可能性がある。計算ノードの全体にわたってワークをどのように割り当てるかおよびネットワークを介して様々な次元でどのくらいのデータを転送する必要があるかによって、ワークロードクラスタの形状は、ワークロードクラスタ内の計算ノードの性能に影響を及ぼす可能性がある。 Bandwidth cross-sectional areas can also limit computations across compute nodes that are waiting for data transfer or exit idle computation cycles, for example. Depending on how the work is allocated across the compute nodes and how much data needs to be transferred over the network in different dimensions, the shape of the workload cluster depends on the performance of the compute nodes in the workload cluster. May affect.

全ての計算ノードを用いて全ての計算ノードデータトラフィックを計算するワークロードの場合、立方体状のワークロードクラスタは、計算ノード間のホップ数を最小化することができる。ワークロードは、多くのローカル通信を有し、特定の次元に沿ってデータを隣接する１組の計算ノードに転送し、これらの隣接通信の多くを一体に連鎖する場合、他の次元よりも特定の次元に沿ってより多くの計算ノードを有する構成が有利である。したがって、ワークロードクラスタ内の計算ノードの構成を指定することをユーザに可能にさせることよって、ユーザは、ワークロードを実行するためにより良い性能をもたらす構成を指定することができる。 For workloads that use all compute nodes to compute all compute node data traffic, a cubic workload cluster can minimize the number of hops between compute nodes. A workload is more specific than any other dimension when it has many local communications and transfers data along a particular dimension to a set of adjacent compute nodes and many of these adjacent communications are chained together. A configuration with more compute nodes along the dimension of is advantageous. Therefore, by allowing the user to specify the configuration of the compute nodes in the workload cluster, the user can specify the configuration that provides better performance to run the workload.

異なる種類の計算ノードがスーパポッドに含まれる場合、要求は、ワークロードクラスタに含まれる各種類の計算ノードの数を指定することもできる。これによって、ユーザは、特定のワークロードを実行するためにより良好に動作する計算ノードの構成を指定することができる。 If different types of compute nodes are included in the superpod, the request can also specify the number of compute nodes of each type in the workload cluster. This allows the user to specify a configuration of compute nodes that works better to run a particular workload.

ワークロードスケジューラは、例えば、ビルディングブロックの利用可能性、ビルディングブロックの健全性（例えば、動作中または故障中）、および／またはスーパポッド内のワークロードの優先度（例えば、スーパポッドの計算ノードによって実行されるワークロードの優先度）に基づいて、ワークロードクラスタのビルディングブロックを選択することができる。ワークロードスケジューラは、選択されたビルディングブロックを特定するデータおよびビルディングブロックの目標構成を光回路スイッチング（ＯＣＳ）マネージャに提供することができる。ＯＣＳマネージャは、ビルディングブロックを互いに接続するように、光ネットワークの１つ以上のＯＣＳスイッチを構成することによって、ワークロードクラスタを生成することができる。その後、ワークロードスケジューラは、ワークロードクラスタ内の計算ノード上で計算ワークロードを実行することができる。 The workload scheduler is run, for example, by building block availability, building block health (eg, running or failing), and / or workload priority within the superpod (eg, the superpod compute node). The building blocks of the workload cluster can be selected based on the workload priority). The workload scheduler can provide the optical circuit switching (OCS) manager with data that identifies the selected building block and the target configuration of the building block. OCS managers can create workload clusters by configuring one or more OCS switches in an optical network to connect building blocks to each other. The workload scheduler can then run the compute workload on the compute nodes in the workload cluster.

ワークロードクラスタ内のビルディングブロックのうちの１つが故障した場合、単にＯＣＳスイッチを再構成することによって、別のビルディングブロックを用いて故障したビルディングブロックを迅速に置換することができる。例えば、ワークロードスケジューラは、故障したビルディングブロックを置換するように、スーパポッドから、利用可能なビルディングブロックを選択することができる。ワークロードスケジューラは、選択されたビルディングブロックを用いて故障したビルディングブロックを置換するように、ＯＣＳマネージャに命令することができる。その後、ＯＣＳマネージャは、選択されたビルディングブロックをワークロードクラスタ内の他のビルディングブロックに接続し、故障したビルディングブロックをワークロードクラスタ内のビルディングブロックに接続しないように、ＯＣＳスイッチを再構成することができる。 If one of the building blocks in the workload cluster fails, the failed building block can be quickly replaced with another building block by simply reconfiguring the OCS switch. For example, the workload scheduler can select available building blocks from the superpod to replace the failed building block. The workload scheduler can instruct the OCS manager to replace the failed building block with the selected building block. The OCS manager will then reconfigure the OCS switch to connect the selected building block to other building blocks in the workload cluster and not to connect the failed building block to the building block in the workload cluster. Can be done.

図１は、例示的な処理システム１３０が、計算ノードのワークロードクラスタを生成し、ワークロードクラスタを用いて計算ワークロードを実行する環境１００を示すブロック図である。処理システム１３０は、データ通信ネットワーク１２０、例えばローカルエリアネットワーク（ＬＡＮ）、ワイドエリアネットワーク（ＷＡＮ）、インターネット、モバイルネットワーク、またはそれらの組み合わせを介して、ユーザ装置１１０から、計算ワークロード１１２を受信することができる。例示として、ワークロード１１２は、ソフトウェアアプリケーション、機械学習モデル、例えば、機械学習モデルのトレーニングおよび／または使用、ビデオの符号化および復号、ならびにデジタル信号処理ワークロードを含む。 FIG. 1 is a block diagram showing an environment 100 in which an exemplary processing system 130 creates a workload cluster of compute nodes and executes a compute workload using the workload cluster. The processing system 130 receives a computational workload 112 from a user apparatus 110 via a data communication network 120, such as a local area network (LAN), wide area network (WAN), internet, mobile network, or a combination thereof. be able to. By way of example, workload 112 includes software applications, machine learning models, such as training and / or use of machine learning models, video coding and decoding, and digital signal processing workloads.

ユーザは、ワークロード１１２を実行するために要求される計算ノードのクラスタ１１４を指定することができる。例えば、ユーザは、要求される計算ノードのクラスタの目標形状および目標サイズを指定することができる。すなわち、ユーザは、複数の次元に沿った計算ノードの数量および計算ノードの形状を指定することができる。例えば、計算ノードが３次元ｘ、ｙおよびｚに沿って配置されている場合、ユーザは、各次元の計算ノードの数を指定することができる。また、ユーザは、クラスタに含まれる１つ以上の種類の計算ノードを指定することができる。以下で説明するように、処理システム１３０は、異なる種類の計算ノードを含むことができる。 The user can specify the cluster 114 of compute nodes required to run workload 112. For example, the user can specify the target shape and target size of the required cluster of compute nodes. That is, the user can specify the quantity of the calculation node and the shape of the calculation node along a plurality of dimensions. For example, if the compute nodes are arranged along three dimensions x, y, and z, the user can specify the number of compute nodes in each dimension. The user can also specify one or more types of compute nodes included in the cluster. As described below, the processing system 130 can include different types of compute nodes.

以下で説明するように、処理システム１３０は、ビルディングブロックを用いて、目標形状および目標サイズのクラスタに合致するワークロードクラスタを生成することができる。各ビルディングブロックは、ｍ次元、例えば３次元に配置された複数の計算ノードを含むことができる。したがって、ユーザは、複数の次元の各次元のビルディングブロックの数を指定することにより、目標形状および目標サイズを指定することができる。例えば、処理システム１３０は、ユーザ装置１１０にユーザインターフェイスを提供することができる。このユーザインターフェイスによって、ユーザは、各次元に沿った最大数のビルディングブロックを選択することができる。 As described below, the processing system 130 can use building blocks to generate workload clusters that match clusters of target shape and target size. Each building block can contain a plurality of compute nodes arranged in m dimensions, eg, 3 dimensions. Therefore, the user can specify the target shape and the target size by specifying the number of building blocks in each dimension of the plurality of dimensions. For example, the processing system 130 can provide a user interface to the user equipment 110. This user interface allows the user to select the maximum number of building blocks along each dimension.

ユーザ装置１１０は、ワークロード１１２と、要求されるクラスタ１１４を指定するデータとを処理システム１３０に提供することができる。例えば、ユーザ装置１１０は、ネットワーク１２０を介して、ワークロード１１２と、要求されるクラスタ１１４を指定するデータとを含む要求データを処理システム１３０に提供することができる。 The user apparatus 110 can provide the processing system 130 with the workload 112 and the data designating the requested cluster 114. For example, the user apparatus 110 can provide the processing system 130 with request data including the workload 112 and data specifying the requested cluster 114 via the network 120.

処理システム１３０は、セルスケジューラ１４０および１つ以上のセル１５０を含む。セル１５０は、１つ以上のスーパポッドからなるグループである。例えば、図示されたセル１５０は、４つのスーパポッド１５２～１５８を含む。各スーパポッド１５２～１５８は、本明細書においてビルディングブロックプールとも称される１組のビルディングブロック１６０を含む。この例において、各スーパポッド１５２～１５８は、６４個のビルディングブロック１６０を含む。しかしながら、スーパポッド１５２～１５８は、他の数のビルディングブロック１６０、例えば、２０、５０、１００、または別の適切な数のビルディングブロック１６０を含むことができる。また、スーパポッド１５２～１５８は、異なる数のビルディングブロック１６０を含むことができる。例えば、スーパポッド１５２は、６４個のビルディングブロックを含むことができ、スーパポッド１５２は、１００個のビルディングブロックを含む。 The processing system 130 includes a cell scheduler 140 and one or more cells 150. Cell 150 is a group consisting of one or more superpods. For example, the illustrated cell 150 includes four superpods 152-158. Each superpod 152-158 includes a set of building blocks 160, also referred to herein as building block pools. In this example, each superpod 152-158 contains 64 building blocks 160. However, the superpods 152-158 can include other numbers of building blocks 160, such as 20, 50, 100, or another suitable number of building blocks 160. Also, the superpods 152-158 can include a different number of building blocks 160. For example, the superpod 152 can contain 64 building blocks and the superpod 152 contains 100 building blocks.

以下でより詳細に説明するように、各ビルディングブロック１６０は、２次元以上に配置された複数の計算ノードを含むことができる。例えば、ビルディングブロック１６０は、３次元に沿って配置された６４個の計算ノード、具体的には各次元に沿って配置された４つの計算ノードを含むことができる。本明細書において、このような計算ノードの構成は、ｘ次元に沿って配置された４つの計算ノード、ｙ次元に沿って配置された４つの計算ノード、およびｚ次元沿って配置された４つの計算ノードを含む４×４×４ビルディングブロックとして呼ばれる。他の数の次元、例えば２次元および各次元に沿って他の数の計算ノード、例えば３×１、２×２×２、６×２、２×３×４も可能である。 As will be described in more detail below, each building block 160 can include a plurality of compute nodes arranged in two or more dimensions. For example, the building block 160 can include 64 compute nodes arranged along three dimensions, specifically four compute nodes arranged along each dimension. In the present specification, the configuration of such a calculation node is four calculation nodes arranged along the x dimension, four calculation nodes arranged along the y dimension, and four calculation nodes arranged along the z dimension. Called as a 4x4x4 building block containing compute nodes. Other dimensions, such as 2 dimensions, and other numbers of compute nodes along each dimension, such as 3x1, 2x2x2, 6x2, 2x3x4, are also possible.

また、ビルディングブロックは、１つのみの計算ノードを含んでもよい。しかしながら、後述するように、ワークロードクラスタを生成するために、ビルディングブロック間の光リンクは、ビルディングブロックを互いに接続するように構成されている。したがって、より小さいビルディングブロック、例えば１つのみの計算ノードを含むビルディングブロックは、より高い柔軟性でワークロードクラスタを生成することができるが、より多くのＯＣＳスイッチ構成およびより多くの光ネットワーク要素（例えば、ケーブルおよびスイッチ）を必要とする。ビルディングブロック内の計算ノードの数は、所望のワークロードクラスタの柔軟性と、ワークロードクラスタを生成するために互いに接続する必要のあるビルディングブロックと、必要なＯＣＳスイッチの数との間のトレードオフに基づいて、選択されてもよい。 Also, the building block may contain only one compute node. However, as described below, in order to generate a workload cluster, the optical links between the building blocks are configured to connect the building blocks to each other. Therefore, smaller building blocks, such as building blocks containing only one compute node, can generate workload clusters with greater flexibility, but with more OCS switch configurations and more optical network elements ( For example, cables and switches) are required. The number of compute nodes in a building block is a trade-off between the desired workload cluster flexibility and the number of building blocks that need to be connected to each other to create a workload cluster and the number of OCS switches required. May be selected based on.

ビルディングブロック１６０の各計算ノードは、特定用途向け集積回路（ＡＳＩＣ）、例えば機械学習ワークロード用のテンソル処理ユニット（ＴＰＵ）、グラフィック処理ユニット（ＧＰＵ）、または他の種類の処理ユニットを含むことができる。例えば、各計算ノードは、処理ユニットを含む単一のプロセッサチップであってもよい。 Each compute node in building block 160 may include application specific integrated circuits (ASICs), such as tensor processing units (TPUs), graphics processing units (GPUs), or other types of processing units for machine learning workloads. can. For example, each compute node may be a single processor chip containing a processing unit.

いくつかの実装形態において、スーパポッド内の全てのビルディングブロック１６０は、同様の計算ノードを有する。例えば、スーパポッド１５２は、機械学習ワークロードを実行するために、６４個のビルディングブロックを含み、各ビルディングブロックは、４×４×４構成の６４個のＴＰＵを有することができる。また、スーパポッドは、異なる種類の計算ノードを含むことができる。例えば、スーパポッド１５４は、ＴＰＵを有する６０個のビルディングブロックと、機械学習ワークロード以外のタスクを実行する専用処理ユニットを有する４個のビルディングブロックとを含むことができる。このようにして、ワークロードを実行するためのワークロードクラスタは、異なる種類の計算ノードを含むことができる。スーパポッドは、冗長性のためにおよび／またはスーパポッド内で複数のワークロードの実行を可能にするために、各種類の計算ノードからなる複数のビルディングブロックを含むことができる。 In some implementations, all building blocks 160 in the superpod have similar compute nodes. For example, the superpod 152 may include 64 building blocks to perform machine learning workloads, and each building block may have 64 TPUs in a 4x4x4 configuration. Superpods can also contain different types of compute nodes. For example, the superpod 154 can include 60 building blocks with TPUs and 4 building blocks with dedicated processing units that perform tasks other than machine learning workloads. In this way, the workload cluster for running the workload can contain different types of compute nodes. The superpod can contain multiple building blocks of each type of compute node for redundancy and / or to allow the execution of multiple workloads within the superpod.

いくつかの実装形態において、スーパポッド内の全てのビルディングブロック１６０は、同様の構成、例えば同様のサイズおよび形状を有する。例えば、スーパポッド１５２内の各ビルディングブロック１６０は、４×４×４構成を有することができる。また、スーパポッドは、異なる構成のビルディングブロックを含むことができる。例えば、スーパポッド１５４は、３２個の４×４×４構成のビルディングブロックと、３２個の１６×８×１６構成のビルディングブロックとを含むことができる。異なるビルディングブロック構成は、同様のまたは異なる計算ノードを含むことができる。例えば、ＴＰＵを含むビルディングブロックは、ＧＰＵを含むビルディングブロックとは異なる構成を有してもよい。 In some implementations, all building blocks 160 in the superpod have similar configurations, eg, similar sizes and shapes. For example, each building block 160 in the superpod 152 can have a 4x4x4 configuration. Superpods can also contain building blocks of different configurations. For example, the superpod 154 can include 32 4x4x4 building blocks and 32 16x8x16 building blocks. Different building block configurations can contain similar or different compute nodes. For example, a building block containing a TPU may have a different configuration than a building block containing a GPU.

スーパポッドは、異なる階層のビルディングブロックを含むことができる。例えば、スーパポッド１５２は、４×４×４構成を有する基本ビルディングブロックを含むことができる。また、スーパポッド１５２は、より多くの計算ノードからなる中級ビルディングブロックを含むことができる。例えば、中級ビルディングブロックは、例えば８つの基本ビルディングブロックから形成された８×８×８構成を有することができる。このようにして、基本ビルディングブロックを接続することによってより大きなワークロードクラスタを生成することよりも、中級ビルディングブロックを用いて、少ないリンクでより大きなワークロードクラスタを生成することができる。また、スーパポッドに基本ビルディングブロックを含むことによって、中級ビルディングブロック内の計算ノードの数を必要としないより小さなワークロードクラスタを柔軟的に形成することができる。 Superpods can contain building blocks at different levels. For example, the superpod 152 can include a basic building block having a 4x4x4 configuration. Also, the superpod 152 can include an intermediate building block consisting of more compute nodes. For example, an intermediate building block can have an 8x8x8 configuration formed from, for example, eight basic building blocks. In this way, intermediate building blocks can be used to generate larger workload clusters with fewer links than by connecting basic building blocks to generate larger workload clusters. Also, by including the basic building block in the superpod, it is possible to flexibly form a smaller workload cluster that does not require the number of compute nodes in the intermediate building block.

セル１５０内のスーパポッド１５２～１５８は、同様のまたは異なる種類の計算ノードからなるビルディングブロックを含むことができる。例えば、セル１５０は、ＴＰＵビルディングブロックからなる１つ以上のスーパポッドと、ＧＰＵビルディングブロックからなる１つ以上のスーパポッドとを含むことができる。セル１５０内の異なるスーパポッド１５２～１５８のビルディングブロックのサイズおよび形状は、同様であってもよく、異なってもよい。 Superpods 152-158 in cell 150 can include building blocks consisting of similar or different types of compute nodes. For example, cell 150 can include one or more superpods consisting of TPU building blocks and one or more superpods consisting of GPU building blocks. The sizes and shapes of the building blocks of the different superpods 152-158 in the cell 150 may be similar or different.

また、各セル１５０は、共有データストレージ１６２および共有補助計算要素１６４を含む。セル１５０内の各スーパポッド１５２～１５８は、共有データストレージ１６２を使用して、例えばスーパポッド１５２～１５８に実行されているワークロードによって生成されたデータを記憶することができる。共有データストレージ１６２は、ハードドライブ、ソリッドステートドライブ、フラッシュメモリ、および／または他の適切なデータ記憶装置を含むことができる。共有補助計算要素は、セル１５０内で共有されるＣＰＵ（例えば、汎用ＣＰＵ装置）、ＧＰＵ、および／または他のアクセラレータ（例えば、ビデオ復号アクセラレータまたは画像復号アクセラレータ）を含むことができる。また、補助計算要素１６４は、ストレージ機器、メモリ機器、および／またはネットワークを介して計算ノードによって共有され得る他の計算要素を含むことができる。 In addition, each cell 150 includes a shared data storage 162 and a shared auxiliary calculation element 164. Each superpod 152-158 in cell 150 can use shared data storage 162 to store data generated by a workload running in, for example, superpods 152-158. Shared data storage 162 can include hard drives, solid state drives, flash memory, and / or other suitable data storage devices. The shared auxiliary calculation element can include a CPU (eg, a general purpose CPU device), a GPU, and / or other accelerators (eg, a video decoding accelerator or an image decoding accelerator) shared in the cell 150. Auxiliary computational elements 164 can also include storage devices, memory devices, and / or other computational elements that can be shared by compute nodes via the network.

セルスケジューラ１４０は、ユーザ装置１１０から受信した各ワークロードを実行するために、セル１５０および／またはセル１５０のスーパポッド１５２～１５８を選択することができる。セルスケジューラ１４０は、ワークロードを実行するために指定された目標構成、スーパポッド１５２～１５８内のビルディングブロック１６０の利用可能性、およびスーパポッド１５２～１５８内のビルディングブロックの健全性に基づいて、スーパポッドを選択することができる。例えば、セルスケジューラ１４０は、ワークロードを実行するために、目標構成のワークロードクラスタを生成するように十分な数の利用可能且つ健全なビルディングブロックを少なくとも含むスーパポッドを選択することができる。要求データが計算ノードの種類を指定する場合、セルスケジューラ１４０は、指定された種類の計算ノードを含む十分な数の利用可能且つ健全なビルディングブロックを少なくとも含むスーパポッドを選択することができる。 The cell scheduler 140 can select cells 150 and / or superpods 152-158 of cell 150 to execute each workload received from the user apparatus 110. The cell scheduler 140 sets the superpods based on the goal configuration specified to run the workload, the availability of the building blocks 160 in the superpods 152-158, and the health of the building blocks in the superpods 152-158. You can choose. For example, the cell scheduler 140 can select a superpod that contains at least a sufficient number of available and healthy building blocks to generate a workload cluster with a target configuration to run the workload. If the request data specifies a type of compute node, the cell scheduler 140 may select a superpod that contains at least a sufficient number of available and healthy building blocks that include the specified type of compute node.

以下で説明するように、各スーパポッド１５２～１５８は、ワークロードスケジューラおよびＯＣＳマネージャを含んでもよい。セルスケジューラ１４０がセル１５０のスーパポッドを選択する場合、セルスケジューラ１４０は、そのスーパポッド１５０のワークロードスケジューラに、ワークロードおよび要求されたクラスタを指定するデータを提供することができる。以下でより詳細に説明するように、ワークロードスケジューラは、ビルディングブロックの利用可能性および健全性、必要に応じてスーパポッド内のワークロードの優先度に基づいて、スーパポッドのビルディングブロックから、ワークロードクラスタを生成するように接続すべき１組のビルディングブロックを選択することができる。例えば、以下で説明するように、ワークロードスケジューラがスーパポッド内の利用可能且つ健全なビルディングブロックの数よりも多くのビルディングブロックを含むワークロードクラスタを要求する要求を受信する場合、ワークロードスケジューラは、より低い優先度ワークロードを実行するためのビルディングブロックを、要求されたワークロードクラスタに割り当て直すことができる。ワークロードスケジューラは、選択されたビルディングブロックを特定するデータを、ＯＣＳマネージャに提供することができる。ＯＣＳマネージャは、ビルディングブロックを互いに接続するように１つ以上のＯＣＳスイッチを構成することによって、ワークロードクラスタを生成することができる。その後、ワークロードスケジューラは、ワークロードクラスタ内の計算ノード上でワークロードを実行することができる。 As described below, each superpod 152-158 may include a workload scheduler and an OCS manager. If the cell scheduler 140 selects the superpod of cell 150, the cell scheduler 140 may provide the workload scheduler of that superpod 150 with data specifying the workload and the requested cluster. As described in more detail below, the workload scheduler from the Superpod building blocks to the workload clusters based on the availability and health of the building blocks and, if necessary, the workload priorities within the Superpod. You can select a set of building blocks to connect to generate. For example, if the workload scheduler receives a request requesting a workload cluster that contains more building blocks than the number of available and healthy building blocks in the superpod, as described below, the workload scheduler will Building blocks for running lower priority workloads can be reassigned to the requested workload cluster. The workload scheduler can provide the OCS manager with data that identifies the selected building block. The OCS manager can create a workload cluster by configuring one or more OCS switches to connect the building blocks to each other. The workload scheduler can then run the workload on the compute nodes in the workload cluster.

いくつかの実装形態において、セルスケジューラ１４０は、例えば、ワークロードを実行するためにスーパポッド１５２～１５８を選択するときに、様々なセル１５０とスーパポッド１５２～１５８との間のロードをバランスする。例えば、ワークロードを処理する能力のあるビルディングブロックを含む２つ以上のスーパポッドの間に１つのスーパポッドを選択する場合、セルスケジューラ１４０は、最も高い能力を有するスーパポッド、例えば、最も利用可能且つ健全なビルディングブロック、または全体的な能力が最も高いセルのスーパポッドを選択することができる。利用可能且つ健全なビルディングブロックは、別のワークロードを実行していないまたは起動中のワークロードクラスタの一部ではない且つ故障していないビルディングブロックである。セルスケジューラは、ビルディングブロックのインデックスを記憶することができる。各ビルディングブロックのインデックスは、ビルディングブロックが健全（例えば、故障していない）であるかおよび／または利用可能（例えば、別のワークロードを実行していないまたは起動中のワークロードクラスタの一部）であるかを示すデータを含むことができる。 In some implementations, the cell scheduler 140 balances the load between the various cells 150 and the superpods 152-158, for example, when selecting the superpods 152-158 to run the workload. For example, if one superpod is selected between two or more superpods containing a building block capable of handling the workload, the cell scheduler 140 will be the most capable superpod, eg, the most available and healthy. You can choose the building block, or the superpod of the cell with the highest overall capacity. An available and healthy building block is a building block that is not running another workload or is not part of a running workload cluster and has not failed. The cell scheduler can store the index of the building block. The index for each building block is that the building block is healthy (eg, not failed) and / or is available (eg, part of a workload cluster that is not running or running another workload). Can include data indicating whether or not.

いくつかの実装形態において、セルスケジューラ１４０は、ワークロードを実行するための目標構成を決定することができる。例えば、セルスケジューラ１４０は、ワークロードの推定される計算需要および利用可能な１つ以上の種類の計算ノードのスループットに基づいて、ビルディングブロックの目標構成を決定することができる。この例において、セルスケジューラ１４０は、決定された目標構成をスーパポッドのワークロードスケジューラに提供することができる。 In some implementations, the cell scheduler 140 can determine the goal configuration for running the workload. For example, the cell scheduler 140 can determine the target configuration of building blocks based on the estimated computational demand of the workload and the throughput of one or more types of compute nodes available. In this example, the cell scheduler 140 can provide the determined goal configuration to the superpod workload scheduler.

図２は、例示的な論理スーパポッド２１０と、スーパポッド２１０内の一部のビルディングブロックから生成された例示的なワークロードクラスタ２２０、２３０、および２４０とを示す図である。この例において、スーパポッド２１０は、６４個のビルディングブロックを含み、各ビルディングブロックは、４×４×４構成を有する。本明細書の多くの例において、４×４×４構成のビルディングブロックを説明したが、当該技術は、他の構成のビルディングブロックに適用されてもよい。 FIG. 2 shows an exemplary logical superpod 210 and exemplary workload clusters 220, 230, and 240 generated from some building blocks within the superpod 210. In this example, the superpod 210 contains 64 building blocks, each building block having a 4x4x4 configuration. Although many examples herein have described building blocks in a 4x4x4 configuration, the technique may be applied to building blocks in other configurations.

後述するように、スーパポッド２１０内の斜線付きビルディングブロックは、ワークロードに割り当てられるビルディングブロックである。白いビルディングブロックは、利用可能且つ健全なビルディングブロックである。黒いビルディングブロックは、例えば故障によってワークロードクラスタの生成に使用できない不健全なビルディングブロックである。 As will be described later, the shaded building blocks in the superpod 210 are the building blocks assigned to the workload. White building blocks are available and healthy building blocks. Black building blocks are unhealthy building blocks that cannot be used to generate workload clusters, for example due to a failure.

ワークロードクラスタ２２０は、スーパポッド２１０のビルディングブロックのうち、４つの４×４×４ビルディングブロックを含む８×８×４ポッドである。すなわち、ワークロードクラスタ２２０は、ｘ次元に沿った８つの計算ノードと、ｙ次元に沿った８つの計算ノードと、ｚ次元に沿った４つの計算ノードとを有する。各ビルディングブロックが各次元に沿って４つの計算ノードを有するため、ワークロードクラスタ２２０は、ｘ次元に沿った２つのビルディングブロックと、ｙ次元に沿った２つのビルディングブロックと、ｚ次元に沿った１つのビルディングブロックとを含む。 The workload cluster 220 is an 8x8x4 pod containing four 4x4x4 building blocks among the building blocks of the superpod 210. That is, the workload cluster 220 has eight calculation nodes along the x-dimensional, eight calculation nodes along the y-dimension, and four calculation nodes along the z-dimension. Because each building block has four compute nodes along each dimension, the workload cluster 220 has two building blocks along the x-dimension, two building blocks along the y-dimension, and two along the z-dimension. Includes one building block.

スーパポッド２１０内での、それらの位置を示すように、ワークロードクラスタ２２０の４つのビルディングブロックは、斜線で示されている。図示のように、ワークロードクラスタ２２０のビルディングブロックは、互いに隣接していない。以下でより詳細に説明するように、光ネットワークを使用することによって、スーパポッド２１０内のビルディングブロックの相対位置に関係なく、スーパポッド２１０内のビルディングブロックの任意の組み合わせから、ワークロードクラスタを生成することができる。 The four building blocks of the workload cluster 220 are shaded to indicate their location within the superpod 210. As shown, the building blocks of workload cluster 220 are not adjacent to each other. As described in more detail below, using an optical network to generate a workload cluster from any combination of building blocks in the superpod 210, regardless of the relative location of the building blocks in the superpod 210. Can be done.

ワークロードクラスタ２３０は、スーパポッド２１０のビルディングブロックのうち、８つのビルディングブロックを含む８×８×８ポッドである。具体的には、ワークロードクラスタ２３０は、各次元に沿って２つのビルディングブロックを含む。これによって、ワークロードクラスタ２３０は、各次元に沿って８つの計算ノードを含む。スーパポッド２１０内での、それらの位置を示すように、ワークロードクラスタ２３０内のビルディングブロックは、縦線で示されている。 The workload cluster 230 is an 8 × 8 × 8 pod containing eight building blocks among the building blocks of the superpod 210. Specifically, the workload cluster 230 contains two building blocks along each dimension. Thereby, the workload cluster 230 includes eight compute nodes along each dimension. Building blocks within the workload cluster 230 are indicated by vertical lines to indicate their location within the superpod 210.

ワークロードクラスタ２４０は、スーパポッド２１０のビルディングブロックのうち、３２個のビルディングブロックを含む１６×８×１６ポッドである。具体的には、ワークロードクラスタ２４０は、ｘ次元に沿った４つのビルディングブロックと、ｙ次元に沿った２つのビルディングブロックと、ｚ次元に沿った４つのビルディングブロックとを含む。これによって、このワークロードクラスタは、ｘ次元に沿った１６個の計算ノードと、ｙ次元に沿った８つの計算ノードと、ｚ次元に沿った１６個の計算ノードとを含む。スーパポッド２１０内での、それらの位置を示すように、ワークロードクラスタ２４０内のビルディングブロックは、網目で示されている。 The workload cluster 240 is a 16 × 8 × 16 pod containing 32 building blocks among the building blocks of the superpod 210. Specifically, the workload cluster 240 includes four building blocks along the x-dimension, two building blocks along the y-dimension, and four building blocks along the z-dimension. Thereby, this workload cluster includes 16 compute nodes along the x-dimension, 8 compute nodes along the y-dimension, and 16 compute nodes along the z-dimension. The building blocks within the workload cluster 240 are shown in mesh to indicate their location within the superpod 210.

ワークロードクラスタ２２０、２３０および２４０は、単にワークロードを実行するために生成され得るスーパポッド２１０のクラスタのいくつかの例である。ワークロードクラスタは、多くの他の構成を有してもよい。例示のワークロードクラスタ２２０、２３０および２４０は、矩形を有するが、他の形状を有してもよい。 Workload clusters 220, 230 and 240 are just a few examples of superpod 210 clusters that can be generated to simply run a workload. The workload cluster may have many other configurations. The exemplary workload clusters 220, 230 and 240 have a rectangular shape, but may have other shapes.

ワークロードクラスタ２２０、２３０および２４０を含むワークロードクラスタの形状は、物理的な形状ではなく、論理的な形状である。光ネットワークは、論理構成においてワークロードクラスタが物理的に接続されるように、ビルディングブロックが各次元に沿って互いに通信するように構成される。しかしながら、物理的なビルディングブロックおよび対応する計算ノードは、様々な方法でデータセンタ内で物理的に配置されてもよい。ワークロード２２０、２３０および２４０のビルディングブロックは、スーパポッド２１０内の全てのビルディングブロックが光ネットワークに接続されることを除いて、スーパポッド２１０内のビルディングブロック間の物理的関係に関係なく、任意の利用可能且つ健全なビルディングブロックから選択することができる。例えば、上記で説明され、図２に図示されたように、ワークロードクラスタ２２０、２３０および２４０は、物理的に隣接していないビルディングブロックを含む。 The shape of the workload cluster including the workload clusters 220, 230 and 240 is not a physical shape but a logical shape. The optical network is configured so that the building blocks communicate with each other along each dimension so that the workload clusters are physically connected in a logical configuration. However, physical building blocks and corresponding compute nodes may be physically located within the data center in various ways. The building blocks of workloads 220, 230 and 240 are of any use, regardless of the physical relationship between the building blocks in the superpod 210, except that all the building blocks in the superpod 210 are connected to the optical network. You can choose from possible and healthy building blocks. For example, as described above and illustrated in FIG. 2, workload clusters 220, 230 and 240 include building blocks that are not physically adjacent.

さらに、ワークロードクラスタの論理構成は、スーパポッド内のビルディングブロックの物理的構成によって制限されない。例えば、８行および８列のビルディングブロックを配置すると共に、ｚ次元に沿って１つのみのビルディングブロックを配置してもよい。しかしながら、ｚ次元に沿って複数のビルディングブロックを含む論理構成を作成するように光ネットワークを構成することによって、ワークロードクラスタを構成することができる。 Moreover, the logical configuration of the workload cluster is not limited by the physical configuration of the building blocks in the superpod. For example, you may place 8 rows and 8 columns of building blocks and only one building block along the z dimension. However, a workload cluster can be configured by configuring the optical network to create a logical configuration containing a plurality of building blocks along the z dimension.

図３は、例示的なビルディングブロック３１０、およびビルディングブロック３１０を用いて生成された例示的なワークロードクラスタ３２０、３３０および３４０を示す図である。ビルディングブロック３１０は、各次元に沿った４つの計算ノードを含む４×４×４ビルディングブロックである。この例において、各次元のビルディングブロック３１０の各次元は、１６個のセグメントを含み、各セグメントは、４つの計算ノードを含む。例えば、ビルディングブロック３１０の上部には、１６個の計算ノードがある。１６個の計算ノードのうち、ｙ次元に沿ったセグメントは、１つの計算ノードと、ビルディングブロック３１０の底部の対応する最後の計算ノードを含む３つの他の計算ノードとを含む。例えば、ｙ次元に沿った１つのセグメントは、計算ノード３０１～３０４を含む。 FIG. 3 is a diagram showing an exemplary building block 310 and exemplary workload clusters 320, 330 and 340 generated using the building block 310. Building block 310 is a 4x4x4 building block containing four compute nodes along each dimension. In this example, each dimension of the building block 310 in each dimension contains 16 segments, each segment containing 4 compute nodes. For example, at the top of the building block 310, there are 16 compute nodes. Of the 16 compute nodes, the segment along the y dimension includes one compute node and three other compute nodes including the corresponding last compute node at the bottom of the building block 310. For example, one segment along the y dimension includes compute nodes 301-304.

ビルディングブロック３１０内の計算ノードは、導電性材料作られた内部リンク３１８、例えば銅ケーブルを介して互いに接続することができる。各次元の各セグメント内の計算ノードは、内部リンク３１８を介して接続することができる。例えば、１つの内部リンク３１８は、計算ノード３０１を計算ノード３０２に接続する。また、１つの内部リンク３１８は、計算ノード３０２を計算ノード３０３に接続する。別の内部リンク３１８は、計算ノード３０３を計算ノード３０４に接続する。同様に、他のセグメント内の計算ノードを接続することによって、ビルディングブロック３１０の計算ノード間の内部データ通信を提供することができる。 Computational nodes in the building block 310 can be connected to each other via internal links 318 made of conductive material, such as copper cables. Computational nodes within each segment of each dimension can be connected via internal link 318. For example, one internal link 318 connects compute node 301 to compute node 302. Also, one internal link 318 connects the compute node 302 to the compute node 303. Another internal link 318 connects compute node 303 to compute node 304. Similarly, by connecting compute nodes in other segments, internal data communication between compute nodes in building block 310 can be provided.

また、ビルディングブロック３１０は、ビルディングブロック３１０を光ネットワークに接続するための外部リンク３１１～３１６を含む。光ネットワークは、ビルディングブロック３１０を他のビルディングブロックに接続する。この例において、ビルディングブロック３１０は、ｘ次元に１６個の外部入力リンク３１１を含む。すなわち、ビルディングブロック３１０は、ｘ次元に沿った１６個のセグメントの各セグメントについて外部入力リンク３１１を含む。同様に、ビルディングブロック３１０は、ｘ次元に沿った各セグメントについて外部出力リンク３１２と、ｙ次元に沿った各セグメントについて外部入力リンク３１３と、ｙ次元に沿った各セグメントについて外部出力リンク３１４と、ｚ次元に沿った各セグメントについて外部入力リンク３１５と、ｚ次元に沿った各セグメントについて外部出力リンク３１６とを含む。いくつかのビルディングブロックは、４次元以上の構成、例えばトーラスを有することができるため、ビルディングブロック３１０は、ビルディングブロック３１０の各次元について同様の外部リンクを含むことができる。 The building block 310 also includes external links 311 to 316 for connecting the building block 310 to the optical network. The optical network connects the building block 310 to another building block. In this example, the building block 310 includes 16 external input links 311 in x-dimensional. That is, the building block 310 includes an external input link 311 for each of the 16 segments along the x-dimension. Similarly, the building block 310 has an external output link 312 for each segment along the x-dimension, an external input link 313 for each segment along the y-dimension, and an external output link 314 for each segment along the y-dimension. It includes an external input link 315 for each segment along the z dimension and an external output link 316 for each segment along the z dimension. Since some building blocks can have a configuration of four or more dimensions, such as a torus, the building block 310 can include similar external links for each dimension of the building block 310.

各外部リンク３１１～３１６は、対応する計算ノードのセグメント上の計算ノードを光ネットワークに接続するための光ファイバリンクであってもよい。例えば、各外部リンク３１１～３１６は、対応する計算ノードを光ネットワークのＯＣＳスイッチに接続することができる。以下で説明するように、光ネットワークは、ビルディングブロック３１０のセグメントの各次元について１つ以上のＯＣＳスイッチを含むことができる。すなわち、ｘ次元の外部リンク３１１、３１２は、外部リンク３１３および３１４とは異なるＯＣＳスイッチに接続される。以下に詳細に説明するように、ＯＣＳスイッチは、ビルディングブロックを他のビルディングブロックに接続することによって、ワークロードクラスタを生成するように構成されてもよい。 Each external link 311 to 316 may be an optical fiber link for connecting a compute node on a segment of the corresponding compute node to an optical network. For example, each external link 311-316 can connect the corresponding compute node to an OCS switch in the optical network. As described below, the optical network can include one or more OCS switches for each dimension of the segment of building block 310. That is, the x-dimensional external links 311 and 312 are connected to an OCS switch different from the external links 313 and 314. As described in detail below, OCS switches may be configured to generate workload clusters by connecting building blocks to other building blocks.

ビルディングブロック３１０は、４×４×４メッシュ構成を有する。４×４×４（または他のサイズ）のビルディングブロックは、他の構成を有してもよい。例えば、ビルディングブロック３１０は、ワークロードクラスタ３２０と同様に、ラップアラウンドトーラスリンクを含む３次元トーラス構成を有してもよい。ワークロードクラスタ３２０は、ラップアラウンドトーラスリンク３２１～３２３を形成するように光ネットワークを構成することによって、単一のメッシュビルディングブロック３１０から生成されてもよい。 The building block 310 has a 4x4x4 mesh configuration. The 4x4x4 (or other size) building block may have other configurations. For example, the building block 310 may have a three-dimensional torus configuration that includes a wraparound torus link, similar to the workload cluster 320. The workload cluster 320 may be generated from a single mesh building block 310 by configuring the optical network to form wraparound torus links 321 to 323.

トーラスリンク３２１～３２３は、各セグメントの一端と当該セグメントの他端との間のラップアラウンドデータ通信を提供する。例えば、トーラスリンク３２１は、ｘ次元に沿った各セグメントの各端部に配置された計算ノードを、当該セグメントの他方の端部に配置された計算ノードに接続する。トーラスリンク３２１は、計算ノード３２５を計算ノード３２６に接続するリンクを含むことができる。同様に、トーラスリンク３２２は、計算ノード３２５を計算ノード３２７に接続するリンクを含むことができる。 The torus links 321 to 323 provide wraparound data communication between one end of each segment and the other end of the segment. For example, the torus link 321 connects a compute node located at each end of each segment along the x-dimension to a compute node located at the other end of the segment. The torus link 321 may include a link connecting the compute node 325 to the compute node 326. Similarly, the torus link 322 may include a link connecting the compute node 325 to the compute node 327.

トーラスリンク３２１～３２３は、導電性ケーブル、例えば銅ケーブルであってもよく、または光リンクであってもよい。例えば、トーラスリンク３２１～３２３の光リンクは、対応する計算ノードを１つ以上のＯＣＳスイッチに接続することができる。ＯＣＳスイッチは、各セグメントの一端から各セグメントの他端にデータをルーティングするように構成することができる。ビルディングブロック３１０は、各次元についてＯＣＳスイッチを含むことができる。例えば、トーラスリンク３２１は、第１のＯＣＳスイッチに接続することができ、第１のＯＣＳスイッチは、ｘ次元に沿った各セグメントの一端とｘ次元に沿った各セグメントの他端との間にデータをルーティングすることができる。同様に、トーラスリンク３２２は、第２のＯＣＳスイッチに接続することができ、第２のＯＣＳスイッチは、ｙ次元に沿った各セグメントの一端とｙ次元に沿った各セグメントの他端との間にデータをルーティングすることができる。トーラスリンク３２２は、第３のＯＣＳスイッチに接続することができ、第３のＯＣＳスイッチは、ｚ次元に沿った各セグメントの一端とｚ次元に沿った各セグメントの他端との間にデータをルーティングすることができる。 The torus links 321 to 323 may be a conductive cable, for example, a copper cable, or may be an optical link. For example, the optical links of the torus links 321 to 323 can connect the corresponding compute node to one or more OCS switches. The OCS switch can be configured to route data from one end of each segment to the other end of each segment. The building block 310 may include an OCS switch for each dimension. For example, a torus link 321 can be connected to a first OCS switch, the first OCS switch between one end of each segment along the x-dimension and the other end of each segment along the x-dimension. Data can be routed. Similarly, the torus link 322 can be connected to a second OCS switch, which is between one end of each segment along the y dimension and the other end of each segment along the y dimension. Data can be routed to. The torus link 322 can be connected to a third OCS switch, which switches data between one end of each segment along the z dimension and the other end of each segment along the z dimension. Can be routed.

ワークロードクラスタ３３０は、４×８×４ポッドを生成する２つのビルディングブロック３３８および３３９を含む。ビルディングブロック３３８および３３９の各々は、ビルディングブロック３１０またはワークロードクラスタ３２０と同様であってもよい。２つのビルディングブロックは、外部リンク３３７を介してｙ次元に沿って接続される。例えば、１つ以上のＯＣＳスイッチは、ビルディングブロック３３８のｙ次元セグメントとビルディングブロック３３９のｙ次元セグメントとの間にデータをルーティングするように構成されてもよい。 The workload cluster 330 contains two building blocks 338 and 339 that generate 4x8x4 pods. Each of the building blocks 338 and 339 may be similar to the building block 310 or the workload cluster 320. The two building blocks are connected along the y dimension via an external link 337. For example, one or more OCS switches may be configured to route data between the y-dimensional segment of building block 338 and the y-dimensional segment of building block 339.

また、１つ以上のＯＣＳスイッチは、全ての３次元に沿って、各セグメントの一端と各セグメントの他端との間にラップアラウンドリンク３３１～３３３を形成するように構成されてもよい。この例において、ラップアラウンドリンク３３３は、ビルディングブロック３３８のｙ次元セグメントの一端をビルディングブロック３３９のｙ次元セグメントの一端に接続することによって、２つのビルディングブロック３３８および３３９を組み合わせることによって生成されたｙ次元セグメントに完全なラップアラウンド通信を提供する。 Further, the one or more OCS switches may be configured to form wraparound links 331 to 333 between one end of each segment and the other end of each segment along all three dimensions. In this example, the wraparound link 333 is generated by combining two building blocks 338 and 339 by connecting one end of the y-dimensional segment of building block 338 to one end of the y-dimensional segment of building block 339. Provides full wraparound communication for dimensional segments.

ワークロードクラスタ３４０は、８×８×８クラスタを生成する８つのビルディングブロック（１つは図示せず）を含む。各ビルディングブロック３４８は、ビルディングブロック３１０と同様であってもよい。ｘ次元に沿って接続されたビルディングブロックは、外部リンク３４５Ａ～３５４Ｃを介して接続される。同様に、ｙ次元に沿って接続されたビルディングブロックは、外部リンク３４４Ａ～３４４Ｃを介して接続され、ｚ次元に沿って接続されたビルディングブロックは、外部リンク３４６Ａ～３４６Ｃを介して接続される。例えば、１つ以上のＯＣＳスイッチは、ｘ次元セグメントの間にデータをルーティングするように構成されてもよく、１つ以上のＯＣＳスイッチは、ｙ次元セグメントの間にデータをルーティングするように構成されてもよく、１つ以上のＯＣＳスイッチは、ｚ次元セグメントの間にデータをルーティングするように構成されてもよい。各次元に沿って、図３に示されていないビルディングブロックを隣接するビルディングブロックに接続する追加の外部リンクがある。また、１つ以上のＯＣＳスイッチは、３次元の全てに沿って、各セグメントの一端と各セグメントの他端との間にラップアラウンドリンク３４１～３４３を形成するように構成されてもよい。 The workload cluster 340 contains eight building blocks (one not shown) that produces an 8x8x8 cluster. Each building block 348 may be similar to the building block 310. Building blocks connected along the x-dimension are connected via external links 345A-354C. Similarly, building blocks connected along the y dimension are connected via external links 344A to 344C, and building blocks connected along the z dimension are connected via external links 346A to 346C. For example, one or more OCS switches may be configured to route data between x-dimensional segments, and one or more OCS switches may be configured to route data between y-dimensional segments. The OCS switch may be configured to route data between z-dimensional segments. Along each dimension, there are additional external links connecting building blocks not shown in FIG. 3 to adjacent building blocks. Further, the one or more OCS switches may be configured to form wraparound links 341 to 343 between one end of each segment and the other end of each segment along all three dimensions.

図４は、計算ノードからＯＣＳスイッチまでの例示的な光リンク４００を示す図である。スーパポッドの計算ノードは、データセンタラックのトレイに設置されてもよい。各計算ノードは、６つの高速電気リンクを含むことができる。そのうち２つの電気リンクは、計算ノードの回路基板に接続されてもよく、４つの電気リンクは、ポート４１０、例えばＯＳＦＰ（Octal Small Form Factor Pluggable）ポートに接続されている外部電気コネクタ、例えばＯＳＦＰコネクタにルーティングされてもよい。この例において、ポート４１０は、電気リンク４１２を介して光モジュール４２０に接続される。光モジュール４２０は、必要に応じて、大きなデータセンタに配置された計算ノード間のデータ通信を提供するために、電気リンクを、外部リンクの長さで延在する、例えば１キロメートル（ｋｍ）を超えて延長する光リンクに変換することができる。光モジュールの種類は、ビルディングブロックとＯＣＳスイッチとの間の必要な長さおよびリンクの所望の速度および帯域幅に基づいて、変更されてもよい。 FIG. 4 is a diagram showing an exemplary optical link 400 from the compute node to the OCS switch. The compute node of the superpod may be installed in the tray of the data center rack. Each compute node can include six high speed electrical links. Two of these electrical links may be connected to the circuit board of the compute node, and the four electrical links are external electrical connectors connected to port 410, eg OSFP (Octal Small Form Factor Pluggable) ports, eg OSFP connectors. May be routed to. In this example, the port 410 is connected to the optical module 420 via an electrical link 412. The optical module 420 extends the electrical link by the length of the external link, eg, 1 km (km), to provide data communication between computing nodes located in large data centers, as needed. It can be converted into an optical link that extends beyond. The type of optical module may be varied based on the required length between the building block and the OCS switch and the desired speed and bandwidth of the link.

光モジュール４２０は、光ファイバケーブル４２２および４２４介してサーキュレータ４３０に接続される。光ファイバケーブル４２２は、光モジュール４２０からサーキュレータ４３０にデータを送信するための１つ以上の光ファイバケーブルを含むことができる。光ファイバケーブル４２４は、サーキュレータ４３０からデータを受信するための１つ以上の光ファイバケーブルを含むことができる。例えば、光ファイバケーブル４２２および４２４は、双方向光ファイバまたは単方向ＴＸ／ＲＸ光ファイバの対を含むことができる。サーキュレータ４３０は、光ファイバの数を減らすことができる（例えば、単方向光ファイバを双方向光ファイバに変換することによって、２対の光ファイバケーブル４３２を一対に減らすことができる）。これは、典型的に、一体に変換された一対の光路（２つのファイバ）を収容する、ＯＣＳスイッチ４４０の単一のＯＣＳポート４４５に一致する。いくつかの実装形態において、サーキュレータ４３０は、光モジュール４２０に一体化されてもよく、または光リンク４００から省略されてもよい。 The optical module 420 is connected to the circulator 430 via fiber optic cables 422 and 424. The fiber optic cable 422 can include one or more fiber optic cables for transmitting data from the optical module 420 to the circulator 430. The fiber optic cable 424 can include one or more fiber optic cables for receiving data from the circulator 430. For example, fiber optic cables 422 and 424 can include bidirectional fiber optic or unidirectional TX / RX fiber optic pairs. The circulator 430 can reduce the number of optical fibers (eg, by converting a unidirectional optical fiber to a bidirectional optical fiber, the pair of optical fiber cables 432 can be reduced to a pair). This typically corresponds to a single OCS port 445 of the OCS switch 440, which accommodates a pair of integrally converted optical paths (two fibers). In some implementations, the circulator 430 may be integrated into the optical module 420 or may be omitted from the optical link 400.

図５～７は、複数の計算トレイを用いて４×４×４ビルディングブロックを形成する方法を示す。同様の技術を用いて、他のサイズおよび形状のビルディングブロックを形成することができる。 FIGS. 5-7 show a method of forming a 4 × 4 × 4 building block using a plurality of calculation trays. Similar techniques can be used to form building blocks of other sizes and shapes.

図５は、４×４×４ビルディングブロックを形成するための論理的計算トレイ５００を示す。４×４×４ビルディングブロックの基本ハードウェアブロックは、２×２×１トポロジを有する単一の計算トレイ５００である。この例では、計算トレイ５００は、ｘ次元に沿った２つの計算ノード、ｙ次元に沿った２つの計算ノード、およびｚ次元に沿った１つの計算ノードを含む。例えば、計算ノード５０１および５０２は、ｘ次元セグメントを形成し、計算ノード５０３および５０４は、ｘ次元セグメントを形成する。同様に、計算ノード５０１および５０３は、ｙ次元セグメントを形成し、計算ノード５０２および５０４は、ｙ次元セグメントを形成する。 FIG. 5 shows a logical calculation tray 500 for forming a 4x4x4 building block. The basic hardware block of a 4x4x4 building block is a single compute tray 500 with a 2x2x1 topology. In this example, the compute tray 500 includes two compute nodes along the x-dimension, two compute nodes along the y-dimension, and one compute node along the z-dimension. For example, compute nodes 501 and 502 form an x-dimensional segment, and compute nodes 503 and 504 form an x-dimensional segment. Similarly, compute nodes 501 and 503 form a y-dimensional segment, and compute nodes 502 and 504 form a y-dimensional segment.

各計算ノード５０１～５０４は、内部リンク５１０、例えばプリント回路基板上の銅ケーブルまたはトレースを介して、２つの他の計算ノードに接続される。また、各計算ノードは、４つの外部ポートに接続される。計算ノード５０１は、外部ポート５２１に接続される。同様に、計算ノード５０２は、外部ポート５２２に接続され、計算ノード５０３は、外部ポート５２３に接続され、計算ノード５０４は、外部ポート５２４に接続される。上述したように、外部ポート５２１～５２４は、計算ノードをＯＣＳスイッチに接続するＯＳＦＰポートまたは他のポートであってもよい。これらのポートは、銅ケーブルまたは光ファイバケーブルに取り付けられた光ファイバモジュールを収容することができる。 Each compute node 501-504 is connected to two other compute nodes via an internal link 510, eg, a copper cable or trace on a printed circuit board. Also, each compute node is connected to four external ports. The compute node 501 is connected to the external port 521. Similarly, the compute node 502 is connected to the external port 522, the compute node 503 is connected to the external port 523, and the compute node 504 is connected to the external port 524. As mentioned above, the external ports 521-524 may be OSFP ports or other ports that connect the compute node to the OCS switch. These ports can accommodate fiber optic modules attached to copper or fiber optic cables.

計算ノード５０１～５０４の各々の外部ポート５２１～５２４は、１つのｘ次元ポートと、１つのｙ次元ポートと、２つのｚ次元ポートとを含む。その理由は、各計算ノード５０１～５０４は、既に内部リンク５１０を介してｘ次元およびｙ次元に沿った別の計算ノードに接続されているからである。２つのｚ次元外部ポートを含むことによって、各計算ノード５０１～５０４は、ｚ次元に沿った２つの計算ノードに接続することができる。 Each of the external ports 521 to 524 of the compute nodes 501 to 504 includes one x-dimensional port, one y-dimensional port, and two z-dimensional ports. The reason is that each compute node 501-504 is already connected to another compute node along the x-dimensional and y-dimensions via the internal link 510. By including two z-dimensional external ports, each compute node 501-504 can be connected to two compute nodes along the z-dimension.

図６は、１次元（ｚ次元）を省略した例示的なビルディングブロックのサブブロック６００を示す図である。具体的には、サブブロック６００は、２×２構成の計算トレイ、例えば、図１の２×２構成の計算トレイ５００によって形成された４×４×１ブロックである。サブブロック６００は、４つの２×２構成の計算トレイ６２０Ａ～６２０Ｄを含む。各計算トレイ６２０Ａ～６２０Ｄは、４つの２×２×１構成の計算ノード６２２を含む図５の計算トレイ５００と同様であってもよい。 FIG. 6 is a diagram showing a sub-block 600 of an exemplary building block omitting one dimension (z dimension). Specifically, the sub-block 600 is a 4 × 4 × 1 block formed by a calculation tray having a 2 × 2 configuration, for example, a calculation tray 500 having a 2 × 2 configuration in FIG. The sub-block 600 includes four calculation trays 620A to 620D having a 2 × 2 configuration. Each calculation tray 620A to 620D may be similar to the calculation tray 500 of FIG. 5 including four 2 × 2 × 1 configuration calculation nodes 622.

計算トレイ６２０Ａ～６２０Ｄの計算ノード６２２は、内部リンク６３１～６３４、例えば銅ケーブルを介して接続されてもよい。例えば、計算トレイ６２０Ａの２つの計算ノード６２２は、内部リンク６３２を介して、ｙ次元に沿って計算トレイ６２０Ｂの２つの計算ノード６２２に接続される。 Computation nodes 622 of the calculation trays 620A to 620D may be connected via internal links 631 to 634, for example, a copper cable. For example, the two compute nodes 622 of the compute tray 620A are connected to the two compute nodes 622 of the compute tray 620B along the y dimension via the internal link 632.

また、各計算トレイ６２０Ａ～６２０Ｄの２つの計算ノード６２２は、ｘ次元に沿って外部リンク６４０に接続される。同様に、各計算トレイ６２０Ａ～６２０Ｄの２つの計算ノードは、ｙ次元に沿って外部リンク６４１に接続される。具体的には、各ｘ次元セグメントの端部に配置された計算ノードおよび各ｙ次元セグメントの端部に配置された計算ノードは、外部リンク６４０に接続される。これらの外部リンク６４０は、例えば図４の光リンク４００を介して、計算ノード、すなわち、計算ノードを含むビルディングブロックをＯＣＳスイッチに接続する光ファイバケーブルであってもよい。 Further, the two calculation nodes 622 of each calculation tray 620A to 620D are connected to the external link 640 along the x-dimension. Similarly, the two compute nodes of each compute tray 620A-620D are connected to the external link 641 along the y dimension. Specifically, the compute node arranged at the end of each x-dimensional segment and the compute node arranged at the end of each y-dimensional segment are connected to the external link 640. These external links 640 may be fiber optic cables that connect a compute node, i.e., a building block containing the compute node, to an OCS switch via, for example, the optical link 400 of FIG.

４×４×４ビルディングブロックは、ｚ次元に沿って、サブブロック６００のうちの４つを互いに接続することによって形成されてもよい。例えば、各計算トレイ６２０Ａ～６２０Ａの計算ノード６２２は、内部リンクを介して、ｚ次元に配置された他のサブブロック６００上の対応する計算トレイの１つまたは２つの計算ノードに接続することができる。各ｚ次元セグメントの端部に配置された計算ノードは、ｘ次元セグメントおよびｙ次元セグメントの端部に配置された計算ノードと同様に、ＯＣＳスイッチに接続されている外部リンク６４０を含むことができる。 The 4x4x4 building blocks may be formed by connecting four of the subblocks 600 to each other along the z dimension. For example, the compute node 622 of each compute tray 620A-620A may be connected via an internal link to one or two compute nodes of the corresponding compute tray on the other subblock 600 located in the z dimension. can. Computational nodes located at the ends of each z-dimensional segment can include external links 640 connected to OCS switches, as well as compute nodes located at the ends of the x-dimensional and y-dimensional segments. ..

図７は、例示的なビルディングブロック７００を示す図である。ビルディングブロック７００は、ｚ次元に沿って接続された４つのサブブロック７１０Ａ～７１０Ｄを含む。各サブブロック７１０Ａ～７１０Ｄは、図６のサブブロック６００と同様であってもよい。図７は、ｚ次元に沿ったサブブロック７１０Ａ～７１０Ｄの間の接続の一部を示している。 FIG. 7 is a diagram showing an exemplary building block 700. The building block 700 includes four sub-blocks 710A-710D connected along the z dimension. Each sub-block 710A to 710D may be the same as the sub-block 600 of FIG. FIG. 7 shows some of the connections between the subblocks 710A-710D along the z dimension.

具体的には、ビルディングブロック７００は、ｚ次元に沿って、サブブロック７１０Ａ～７１０Ｄの計算トレイ７１５の対応する計算ノード７１６の間の内部リンク７３０～７３３を含む。例えば、内部リンク７３０は、ｚ次元に沿った計算ノード０のセグメントを接続する。同様に、内部リンク７３１は、ｚ次元に沿った計算ノード１のセグメントを接続し、内部リンク７３２は、ｚ次元に沿った計算ノード８のセグメントを接続し、内部リンク７３３は、ｚ次元に沿った計算ノード９のセグメントを接続する。図示されていないが、同様の内部リンクは、計算ノード２～７およびＡ～Ｆのセグメントを接続する。 Specifically, the building block 700 includes internal links 730-733 between the corresponding compute nodes 716 of the compute trays 715 of subblocks 710A-710D along the z dimension. For example, the internal link 730 connects the segments of compute node 0 along the z dimension. Similarly, the internal link 731 connects the segments of the compute node 1 along the z dimension, the internal link 732 connects the segments of the compute node 8 along the z dimension, and the internal link 733 connects the segments of the compute node 8 along the z dimension. Connect the segments of the compute node 9. Although not shown, similar internal links connect the segments of compute nodes 2-7 and AF.

また、ビルディングブロック７００は、ｚ次元に沿った各セグメントの端部に配置された外部リンク７２０を含む。図示は、計算ノード０、１、８および９のセグメントの外部リンク７２０のみを示しているが、計算ノード２～７およびＡ～Ｆの各他のセグメントも外部リンク７２０を含む。外部リンクは、ｘ次元およびｙ次元セグメントの端部に配置された外部リンクと同様に、セグメントをＯＣＳスイッチに接続することができる。 Also, the building block 700 includes an external link 720 located at the end of each segment along the z dimension. Although the illustration shows only the external link 720 of the segments 0, 1, 8 and 9, each of the other segments of compute nodes 2-7 and Af also includes the external link 720. External links can connect segments to OCS switches as well as external links located at the ends of x-dimensional and y-dimensional segments.

図８は、スーパポッドのＯＣＳファブリックトポロジ８００を示す図である。この例において、ＯＣＳファブリックトポロジは、６４個のビルディングブロック８０５、すなわち、ビルディングブロック０～６３を含むスーパポッドの４×４×４ビルディングブロックの各次元に沿って、各セグメントの別個のＯＣＳスイッチを含む。４×４×４ビルディングブロック８０５は、ｘ次元に沿った１６個のセグメントと、ｙ次元に沿った１６個のセグメントと、ｚ次元に沿った１６個のセグメントとを含む。この例において、ＯＣＳファブリックトポロジは、１６個のｘ次元ＯＣＳスイッチと、１６個のｙ次元ＯＣＳスイッチと、１６個のｚ次元ＯＣＳスイッチとを含み、合計で４８個のＯＣＳスイッチは、様々なワークロードクラスタを形成するように構成することができる。 FIG. 8 is a diagram showing the OCS fabric topology 800 of the superpod. In this example, the OCS fabric topology contains 64 building blocks 805, i.e., a separate OCS switch for each segment along each dimension of the 4x4x4 building blocks of the superpod containing building blocks 0-63. .. The 4x4x4 building block 805 includes 16 segments along the x-dimension, 16 segments along the y-dimension, and 16 segments along the z-dimension. In this example, the OCS fabric topology includes 16 x-dimensional OCS switches, 16 y-dimensional OCS switches, and 16 z-dimensional OCS switches, for a total of 48 OCS switches for various workpieces. It can be configured to form a load cluster.

ｘ次元の場合、ＯＣＳファブリックトポロジ８００は、ＯＣＳスイッチ８１０を含む１６個のＯＣＳスイッチを含む。各ビルディングブロック８０５は、ｘ次元に沿った各セグメントのＯＣＳスイッチ８１０に接続された外部入力リンク８１１および外部出力リンク８１２を含む。これらの外部リンク８１１および８１２は、図４の光学リンク４００と同様であってもよく、類似であってもよい。 In the x-dimensional case, the OCS fabric topology 800 includes 16 OCS switches, including an OCS switch 810. Each building block 805 includes an external input link 811 and an external output link 812 connected to OCS switches 810 for each segment along the x-dimension. These external links 811 and 812 may be similar or similar to the optical link 400 of FIG.

ｙ次元の場合、ＯＣＳファブリックトポロジ８００は、ＯＣＳスイッチ８２０を含む１６個のＯＣＳスイッチを含む。各ビルディングブロック８０５は、ｙ次元に沿った各セグメントのＯＣＳスイッチ８１０に接続された外部入力リンク８２１および外部出力リンク８２２を含む。これらの外部リンク８２１および８２２は、図４の光学リンク４００と同様であってもよく、類似であってもよい。 For the y-dimension, the OCS fabric topology 800 includes 16 OCS switches, including an OCS switch 820. Each building block 805 includes an external input link 821 and an external output link 822 connected to the OCS switch 810 of each segment along the y dimension. These external links 821 and 822 may be similar to or similar to the optical link 400 of FIG.

ｚ次元の場合、ＯＣＳファブリックトポロジ８００は、ＯＣＳスイッチ８３０を含む１６個のＯＣＳスイッチを含む。各ビルディングブロック８０５は、ｙ次元に沿った各セグメントのＯＣＳスイッチ８１０に接続された外部入力リンク８２１および外部出力リンク８２２を含む。これらの外部リンク８２１および８２２は、図４の光学リンク４００と同様であってもよく、類似であってもよい。 For the z dimension, the OCS fabric topology 800 includes 16 OCS switches, including an OCS switch 830. Each building block 805 includes an external input link 821 and an external output link 822 connected to the OCS switch 810 of each segment along the y dimension. These external links 821 and 822 may be similar to or similar to the optical link 400 of FIG.

他の例において、複数のセグメントは、例えば、スーパポッド内のＯＣＳの基数および／またはビルディングブロックの数に応じて、同一のＯＣＳスイッチを共有することができる。例えば、１つのＯＣＳスイッチがスーパポッド内の全てのビルディングブロックの全てのｘ次元セグメントについて十分な数のポートを有する場合、全てのｘ次元セグメントを当該ＯＣＳスイッチに接続することができる。別の例において、１つのＯＣＳスイッチが十分な数のポートを有する場合、各次元の２つのセグメントは、当該ＯＣＳスイッチを共有することができる。しかしながら、スーパポッドの全てのビルディングブロックのセグメントを同一のＯＣＳスイッチに接続することによって、単一のルーティングテーブルを用いて、これらのセグメントの計算ノードの間のデータ通信を行うことができる。また、各セグメントまたは各次元の別個のＯＣＳスイッチを使用することによって、故障の対応および診断を単純化することができる。例えば、特定のセグメントまたは特定の次元のデータ通信に問題が存在する場合、特定のセグメントまたは特定の次元に複数のＯＣＳを使用した場合よりも、潜在的に故障したＯＣＳを特定することがより容易であろう。 In another example, multiple segments may share the same OCS switch, for example, depending on the number of OCS radix and / or number of building blocks in the superpod. For example, if one OCS switch has a sufficient number of ports for all x-dimensional segments of all building blocks in the superpod, then all x-dimensional segments can be connected to the OCS switch. In another example, if one OCS switch has a sufficient number of ports, the two segments of each dimension can share the OCS switch. However, by connecting all the building block segments of the superpod to the same OCS switch, a single routing table can be used for data communication between the compute nodes in these segments. Failure response and diagnosis can also be simplified by using separate OCS switches for each segment or dimension. For example, if there is a problem with a particular segment or dimension of data communication, it is easier to identify a potentially failed OCS than if you used multiple OCS for a particular segment or dimension. Will.

図９は、例示的なスーパポッド９００の構成要素を示す図である。例えば、スーパポッド９００は、図１の処理システム１３０のスーパポッドのうち、１つのスーパポッドであってもよい。例示的なスーパポッド９００は、６４個の４×４×４ビルディングブロック９６０を含み、これらのビルディングブロック９６０を使用して、計算ワークロード、例えば機械学習ワークロードを実行するためのワークロードクラスタを形成することができる。上述したように、各４×４×４ビルディングブロック９６０は、３次元の各次元に沿って配置された４つの計算ノードからなる３２個の計算ノードを含む。例えば、ビルディングブロック９６０は、上述したビルディングブロック３１０、ワークロードクラスタ３２０、またはビルディングブロック７００と同様であってもよく、類似であってもよい。 FIG. 9 is a diagram showing components of an exemplary Superpod 900. For example, the superpod 900 may be one of the superpods of the processing system 130 of FIG. An exemplary Superpod 900 contains 64 4x4x4 building blocks 960 and uses these building blocks 960 to form a workload cluster for running computational workloads, such as machine learning workloads. can do. As mentioned above, each 4x4x4 building block 960 contains 32 compute nodes consisting of 4 compute nodes arranged along each of the 3 dimensions. For example, the building block 960 may be similar to or similar to the building block 310, workload cluster 320, or building block 700 described above.

例示的なスーパポッド９００は、光ネットワーク９７０を含み、光ネットワーク９７０は、各ビルディングブロック９６０の９６個の外部リンク９３１、９３２および９３３を介して、ビルディングブロックに接続された４８個のＯＣＳスイッチ９３０、９４０および９５０を含む。各外部リンクは、図４の光学リンク４００と同様または類似の光ファイバリンクであってもよい。 An exemplary Superpod 900 includes an optical network 970, wherein the optical network 970 has 48 OCS switches 930 connected to the building blocks via 96 external links 931, 932 and 933 of each building block 960. Includes 940 and 950. Each external link may be an optical fiber link similar to or similar to the optical link 400 of FIG.

光ネットワーク９７０は、図８のＯＣＳファブリックトポロジ８００と同様に、各ビルディングブロックの各次元の各セグメントについてＯＣＳスイッチを含む。ｘ次元の場合、光ネットワーク９７０は、ｘ次元に沿った各セグメントに１つずつ配置された１６個のＯＣＳスイッチ９５０を含む。また、光ネットワーク９７０は、各ビルディングブロック９６０について、ｘ次元に沿ったビルディングブロック９６０の各セグメントに対応する入力外部リンクおよび出力外部リンクを含む。これらの外部リンクは、セグメント上の計算ノードを当該セグメントのＯＣＳスイッチ９５０に接続する。各ビルディングブロック９６０がｘ次元に沿った１６個のセグメントを含むため、光ネットワーク９７０は、各ビルディングブロック９６０のｘ次元セグメントを当該セグメントに対応するＯＣＳスイッチ９５０に接続するための３２個の外部リンク９３３（すなわち、１６の入力リンクおよび１６の出力リンク）を含む。 The optical network 970 includes an OCS switch for each segment of each dimension of each building block, similar to the OCS fabric topology 800 of FIG. In the x-dimensional case, the optical network 970 includes 16 OCS switches 950, one in each segment along the x-dimensional. The optical network 970 also includes, for each building block 960, an input external link and an output external link corresponding to each segment of the building block 960 along the x-dimension. These external links connect compute nodes on a segment to OCS switches 950 for that segment. Since each building block 960 contains 16 segments along the x-dimension, the optical network 970 has 32 external links to connect the x-dimensional segments of each building block 960 to the OCS switch 950 corresponding to that segment. 933 (ie, 16 input links and 16 output links) is included.

ｙ次元の場合、光ネットワーク９７０は、ｙ次元に沿った各セグメントに１つずつ配置された１６個のＯＣＳスイッチ９３０を含む。また、光ネットワーク９７０は、各ビルディングブロック９６０について、ｙ次元に沿ったビルディングブロック９６０の各セグメントに対する入力外部リンクおよび出力外部リンクを含む。これらの外部リンクは、セグメント上の計算ノードを当該セグメントのＯＣＳスイッチ９３０に接続する。各ビルディングブロック９６０がｙ次元に沿った１６個のセグメントを含むため、光ネットワーク９７０は、各ビルディングブロック９６０のｙ次元セグメントを当該セグメントに対応するＯＣＳスイッチ９３０に接続するための３２個の外部リンク９３１（すなわち、１６の入力リンクおよび１６の出力リンク）を含む。 For the y-dimension, the optical network 970 includes 16 OCS switches 930, one for each segment along the y-dimension. The optical network 970 also includes, for each building block 960, an input external link and an output external link for each segment of the building block 960 along the y dimension. These external links connect compute nodes on a segment to OCS switches 930 for that segment. Since each building block 960 contains 16 segments along the y dimension, the optical network 970 has 32 external links to connect the y dimension segment of each building block 960 to the OCS switch 930 corresponding to that segment. Includes 931 (ie, 16 input links and 16 output links).

ｚ次元の場合、光ネットワーク９７０は、ｙ次元に沿った各セグメントに１つずつ配置された１６個のＯＣＳスイッチ９３２を含む。また、光ネットワーク９７０は、各ビルディングブロック９６０に対して、ｚ次元に沿ったビルディングブロック９６０の各セグメントに対応する入力外部リンクおよび出力外部リンクを含む。これらの外部リンクは、セグメント上の計算ノードを当該セグメントのＯＣＳスイッチ９４０に接続する。各ビルディングブロック９６０がｚ次元に沿った１６個のセグメントを含むため、光ネットワーク９７０は、各ビルディングブロック９６０のｚ次元セグメントを当該セグメントに対応するＯＣＳスイッチ９４０に接続するための３２個の外部リンク９３２（すなわち、１６の入力リンクおよび１６の出力リンク）を含む。 For the z dimension, the optical network 970 includes 16 OCS switches 932 arranged one for each segment along the y dimension. The optical network 970 also includes, for each building block 960, an input external link and an output external link corresponding to each segment of the building block 960 along the z dimension. These external links connect compute nodes on a segment to OCS switches 940 on that segment. Since each building block 960 contains 16 segments along the z dimension, the optical network 970 has 32 external links to connect the z dimension segment of each building block 960 to the OCS switch 940 corresponding to that segment. Includes 932 (ie, 16 input links and 16 output links).

ワークロードスケジューラ９１０は、ワークロードと、ワークロードを実行するために要求されるビルディングブロック９６０のクラスタを指定するデータとを含む要求データを受信することができる。要求データは、ワークロードの優先度を含んでもよい。優先度は、高、中または低のレベルで表すことができ、また例えば１～１００の範囲または別の適切な範囲で数値的に表すことができる。例えば、ワークロードスケジューラ９１０は、ユーザ装置またはセルスケジューラ、例えば図１のユーザ装置１１０またはセルスケジューラ１４０から、要求データを受信することができる。上述したように、要求データは、計算ノードのｎ次元目標構成、例えば計算ノードを含むビルディングブロックの目標構成を指定することができる。 The workload scheduler 910 can receive request data including the workload and data specifying a cluster of building blocks 960 requested to execute the workload. The request data may include workload priorities. Priority can be expressed in high, medium or low levels and can be expressed numerically, for example in the range 1-100 or another suitable range. For example, the workload scheduler 910 can receive request data from a user device or a cell scheduler, for example, the user device 110 or the cell scheduler 140 of FIG. As described above, the request data can specify the n-dimensional goal configuration of the compute node, for example, the goal configuration of the building block containing the compute node.

ワークロードスケジューラ９１０は、要求データによって指定された目標構成に一致するワークロードクラスタを生成するために、１組のビルディングブロック９６０を選択することができる。例えば、ワークロードスケジューラ９１０は、スーパポッド９００において利用可能且つ健全な１組のビルディングブロックを特定することができる。上述したように、利用可能且つ健全なビルディングブロックは、別のワークロードを実行していないまたは起動中のワークロードクラスタの一部ではない且つ故障していないビルディングブロックである。 The workload scheduler 910 can select a set of building blocks 960 to generate a workload cluster that matches the goal configuration specified by the request data. For example, the workload scheduler 910 can identify a set of healthy building blocks available in the Superpod 900. As mentioned above, an available and healthy building block is a building block that is not running another workload or is not part of a running workload cluster and has not failed.

例えば、ワークロードスケジューラ９１０は、スーパポッド内の各ビルディングブロック９６０の状態を示す状態データを、例えばデータベースの形で記憶および更新することができる。ビルディングブロック９６０の利用可能状態は、ビルディングブロック９６０がワークロードクラスタに割り当てられるか否かを示すことができる。ビルディングブロック９６０の健全状態は、当該ビルディングブロックが動作中であるかまたは故障中であるかを示すことができる。ワークロードスケジューラ９１０は、ワークロードに割り当てられていないことを示す利用可能状態および動作中であることを示す健全状態を有するビルディングブロック９６０を特定することができる。ビルディングブロック９６０がワークロードに割り当てられた場合、例えば、ワークロードを実行するためのワークロードクラスタを生成するために使用されている場合、または健全状態が動作中から故障中にもしくはその逆に変化した場合、ワークロードスケジューラは、それに応じてビルディングブロック９６０の状態データを更新することができる。 For example, the workload scheduler 910 can store and update state data indicating the state of each building block 960 in the superpod, for example in the form of a database. The availability of building block 960 can indicate whether building block 960 is assigned to a workload cluster. The healthy state of the building block 960 can indicate whether the building block is in operation or out of order. The workload scheduler 910 can identify building blocks 960 with an available state indicating that they are not assigned to a workload and a healthy state indicating that they are in operation. If building block 960 is assigned to a workload, for example, if it is used to create a workload cluster to run the workload, or if the health state changes from running to failing and vice versa. If so, the workload scheduler can update the state data of building block 960 accordingly.

ワークロードスケジューラ９１０は、特定された複数のビルディングブロック９６０から、目標構成によって定義された数に一致する数のビルディングブロック９６０を選択することができる。要求データが１つ以上の種類の計算ノードを指定する場合、ワークロードスケジューラ９１０は、特定されたビルディングブロック９６０から、要求された種類の計算ノードを含むビルディングブロックを選択することができる。例えば、要求データが、２つのＴＰＵビルディングブロックおよび２つのＧＰＵビルディングブロックからなる２×２構成のビルディングブロックを指定する場合、ワークロードスケジューラ９１０は、２つの利用可能且つ健全なＴＰＵビルディングブロックと、２つの利用可能且つ健全なＧＰＵビルディングブロックとを選択することができる。 The workload scheduler 910 can select a number of building blocks 960 that matches the number defined by the target configuration from the plurality of identified building blocks 960. If the request data specifies one or more types of compute nodes, the workload scheduler 910 can select from the identified building blocks 960 a building block containing the requested types of compute nodes. For example, if the request data specifies a 2x2 building block consisting of two TPU building blocks and two GPU building blocks, the workload scheduler 910 will have two available and healthy TPU building blocks and two. You can choose between two available and healthy GPU building blocks.

また、ワークロードスケジューラ９１０は、スーパポッド内で実行中の各ワークロードの優先度と、要求データに含まれたワークロードの優先度とに基づいて、ビルディングブロック９６０を選択することができる。スーパポッド９００が要求されたワークロードを実行するためのワークロードクラスタを生成するのに十分の利用可能且つ健全なビルディングブロックを有しない場合、ワークロードスケジューラ９１０は、要求されたワークロードよりも低い優先度を有するワークロードがスーパポッド９００内で実行されているか否かを判断することができる。要求されたワークロードよりも低い優先度を有するワークロードが実行されている場合、ワークロードスケジューラ９１０は、１つ以上のより低い優先度のワークロードを実行するワークロードクラスタのビルディングブロックを、要求されるワークロードを実行するためのワークロードクラスタに割り当て直すことができる。例えば、ワークロードスケジューラ９１０は、より低い優先度のワークロードを終了させる、より低い優先度のワークロードを遅らせる、またはより低い優先度のワークロードのためのワークロードクラスタのサイズを減らすことによって、より高い優先度のワークロードのためのビルディングブロックを解放することができる。 Further, the workload scheduler 910 can select the building block 960 based on the priority of each workload running in the superpod and the priority of the workload included in the request data. If the Superpod 900 does not have enough available and healthy building blocks to generate a workload cluster to run the requested workload, the workload scheduler 910 will have a lower priority than the requested workload. It is possible to determine if a workload with a degree is running within the Superpod 900. If a workload with a lower priority than the requested workload is running, the workload scheduler 910 requests a building block of the workload cluster to run one or more lower priority workloads. Can be reassigned to a workload cluster to run the workload to be executed. For example, the workload scheduler 910 terminates lower priority workloads, delays lower priority workloads, or reduces the size of the workload cluster for lower priority workloads. You can free up building blocks for higher priority workloads.

ワークロードスケジューラ９１０は、単に光ネットワークを再構成すること（例えば、以下で説明するようにＯＣＳスイッチを構成すること）によって、ビルディングブロックを１つのワークロードクラスタから別のワークロードクラスタに割り当て直すことができる。これによって、このビルディングブロックは、より低い優先度のワークロードを実行するためのビルディングブロックではなく、より高い優先度のワークロードを実行するためのビルディングブロックに接続される。同様に、より高い優先度のワークロードを実行するためのビルディングブロックが故障した場合、ワークロードスケジューラ９１０は、光ネットワークを再構成することによって、より低い優先度のワークロードを実行するためのワークロードクラスタ内のビルディングブロックを、より高い優先度のワークロードを実行するためのワークロードクラスタに割り当て直すことができる。 The workload scheduler 910 reassigns building blocks from one workload cluster to another by simply reconfiguring the optical network (eg, configuring an OCS switch as described below). Can be done. This connects this building block to the building block for running the higher priority workload, not the building block for running the lower priority workload. Similarly, if a building block to run a higher priority workload fails, the workload scheduler 910 works to run a lower priority workload by reconfiguring the optical network. Building blocks in a load cluster can be reassigned to a workload cluster to run higher priority workloads.

ワークロードスケジューラ９１０は、ジョブごとの構成データ９１２を生成して、スーパポッド９００のＯＣＳマネージャ９２０に提供することができる。ジョブごとの構成データ９１２は、ワークロードを実行するために選択されたビルディングブロック９６０およびビルディングブロックの構成を指定することができる。例えば、構成は、２×２構成である場合、ビルディングブロック配置するするための４つのスポットを含む。ジョブごとの構成データは、選択されたビルディングブロック９６０を４つのスポットにそれぞれ配置することを指定することができる。 The workload scheduler 910 can generate configuration data 912 for each job and provide it to the OCS manager 920 of the superpod 900. The per-job configuration data 912 can specify the building block 960 and the configuration of the building blocks selected to run the workload. For example, if the configuration is a 2x2 configuration, it will include four spots for arranging building blocks. The configuration data for each job can specify that the selected building blocks 960 are to be placed in each of the four spots.

ジョブごとの構成データ９１２は、各ビルディングブロックの論理識別子を用いて、選択されたビルディングブロック９６０を特定することができる。例えば、各ビルディングブロック９６０は、固有の論理識別子を含むことができる。特定の実施例において、６４個のビルディングブロック９６０に０～６３の数字を付与することができ、これらの数字は、固有の論理識別子であってもよい。 The configuration data 912 for each job can identify the selected building block 960 by using the logical identifier of each building block. For example, each building block 960 can include a unique logical identifier. In a particular embodiment, 64 building blocks 960 can be assigned numbers 0-63, which may be unique logical identifiers.

ＯＣＳマネージャ９２０は、ジョブごとの構成データ９１２を用いてＯＣＳスイッチ９３０、９４０および／または９５０を構成することによって、ジョブごとの構成データによって指定された構成に一致するワークロードクラスタを生成する。各ＯＣＳスイッチ９３０、９４０および９５０は、ＯＣＳスイッチの物理ポートの間にデータをルーティングするときに使用されるルーティングテーブルを含む。例えば、第１のビルディングブロックのｘ次元セグメントの出力外部リンクが、対応する第２のビルディングブロックのｘ次元セグメントの入力外部リンクに接続されていると仮定する。この場合、ｘ次元セグメントのＯＣＳスイッチ９５０のルーティングテーブルは、これらのセグメントが接続されているＯＣＳスイッチの物理ポートの間のデータがこれらの物理ポートの間にルーティングされることを示す。 The OCS manager 920 configures the OCS switches 930, 940 and / or 950 with the per-job configuration data 912 to generate a workload cluster that matches the configuration specified by the per-job configuration data. Each OCS switch 930, 940 and 950 contains a routing table used when routing data between the physical ports of the OCS switch. For example, assume that the output external link of the x-dimensional segment of the first building block is connected to the input external link of the x-dimensional segment of the corresponding second building block. In this case, the routing table of the OCS switch 950 for the x-dimensional segment indicates that the data between the physical ports of the OCS switch to which these segments are connected is routed between these physical ports.

ＯＣＳマネージャ９２０は、各ＯＣＳスイッチ９２０、９３０および９４０の各ポートを各ビルディングブロックの各論理ポートにマッピング（対応付け）するポートデータを記憶することができる。ビルディングブロックの各ｘ次元セグメントのポートデータは、外部入力リンクがＯＣＳスイッチ９５０のどの物理ポートに接続されているか、外部出力リンクがＯＣＳスイッチ９５０のどの物理ポートに接続されているかを指定することができる。スーパポッド９００の各ビルディングブロック９６０の各次元のポートデータは、同様のデータを含むことができる。 The OCS manager 920 can store port data that maps each port of each OCS switch 920, 930, and 940 to each logical port of each building block. The port data for each x-dimensional segment of the building block can specify which physical port of the OCS switch 950 the external input link is connected to and which physical port of the OCS switch 950 the external output link is connected to. can. The port data of each dimension of each building block 960 of the superpod 900 can include similar data.

ＯＣＳマネージャ９２０は、ポートデータを用いてＯＣＳスイッチ９３０、９４０および／または９５０のルーティングテーブルを構成することによって、ワークロードを実行するためのワークロードクラスタを生成することができる。例えば、第１のビルディングブロックがｘ次元に沿って第２のビルディングブロックの左側に配置される２×１構成で、第１のビルディングブロックを第２のビルディングブロックに接続しようとすると仮定する。ＯＣＳマネージャ９２０は、第１のビルディングブロックのｘ次元セグメントと第２のビルディングブロックのｘ次元セグメントとの間にデータをルーティングするために、ｘ次元のＯＣＳスイッチ９５０のルーティングテーブルを更新する。ビルディングブロックの各ｘ次元セグメントを接続する必要があるため、ＯＣＳマネージャ９２０は、各ＯＣＳスイッチ９５０のルーティングテーブルを更新することができる。 The OCS manager 920 can generate a workload cluster to run the workload by configuring the routing table of the OCS switches 930, 940 and / or 950 with the port data. For example, suppose you want to connect the first building block to the second building block in a 2x1 configuration where the first building block is located to the left of the second building block along the x-dimension. The OCS manager 920 updates the routing table of the x-dimensional OCS switch 950 to route data between the x-dimensional segment of the first building block and the x-dimensional segment of the second building block. Since it is necessary to connect each x-dimensional segment of the building block, the OCS manager 920 can update the routing table of each OCS switch 950.

ＯＣＳマネージャ９２０は、各ｘ次元セグメントのＯＣＳスイッチ９５０のルーティングテーブルを更新することができる。具体的には、ＯＣＳマネージャ９２０は、ルーティングテーブルを更新することによって、第１のビルディングブロックのセグメントが接続されているＯＣＳスイッチ９５０の物理ポートを、第２のビルディングブロックのセグメントが接続されているＯＣＳスイッチの物理ポートにマッピングすることができる。各ｘ次元セグメントが入力リンクおよび出力リンクを含むため、ＯＣＳマネージャ９２０は、第１のビルディングブロックの入力リンクが第２のビルディングブロックの出力リンクに接続され、第１のビルディングブロックの出力リンクが第２のビルディングブロックの入力リンクに接続されるように、ルーティングテーブルを更新することができる。 The OCS manager 920 can update the routing table of the OCS switch 950 for each x-dimensional segment. Specifically, the OCS manager 920 updates the routing table to connect the physical port of the OCS switch 950 to which the segment of the first building block is connected to the segment of the second building block. It can be mapped to the physical port of the OCS switch. Since each x-dimensional segment contains an input link and an output link, OCS Manager 920 connects the input link of the first building block to the output link of the second building block, and the output link of the first building block is the first. The routing table can be updated to connect to the input link of the 2 building blocks.

ＯＣＳマネージャ９２０は、各ＯＣＳスイッチから現在のルーティングテーブルを取得することによって、ルーティングテーブルを更新することができる。他の例において、ＯＣＳマネージャ９２０は、適切なルーティングテーブルを更新し、更新されたルーティングテーブルを適切なＯＣＳスイッチに送信することができる。他の例において、ＯＣＳマネージャ９２０は、更新を指定する更新データをＯＣＳスイッチに送信することができ、ＯＣＳスイッチは、更新データに従ってルーティングテーブルを更新することができる。 The OCS manager 920 can update the routing table by acquiring the current routing table from each OCS switch. In another example, the OCS manager 920 can update the appropriate routing table and send the updated routing table to the appropriate OCS switch. In another example, the OCS manager 920 can send update data specifying updates to the OCS switch, which can update the routing table according to the update data.

更新されたルーティングテーブルでＯＣＳスイッチを構成した後、ワークロードクラスタが生成される。次いで、ワークロードスケジューラ９１０は、ワークロードクラスタ内の計算ノードに、ワークロードを実行させることができる。例えば、ワークロードスケジューラ９１０は、ワークロードをワークロードクラスタ内の計算ノードに提供して実行させることができる。 After configuring the OCS switch with the updated routing table, a workload cluster is created. The workload scheduler 910 can then cause the compute nodes in the workload cluster to execute the workload. For example, the workload scheduler 910 can provide and execute the workload to the compute nodes in the workload cluster.

ワークロードの実行が終了した後、ワークロードスケジューラ９１０は、ワークロードクラスタを生成するために使用された各ビルディングブロックの状態を利用可能状態に戻すように、各ビルディングブロックの状態を更新することができる。また、ワークロードスケジューラ９１０は、ワークロードクラスタを生成するために使用されたビルディングブロックの間の接続を解除するように、ＯＣＳマネージャ９２０に命令することができる。したがって、ＯＣＳマネージャ９２０は、ビルディングブロックの間にデータをルーティングするために使用されたＯＣＳスイッチの物理ポート間のマッピングを解除するように、ルーティングテーブルを更新することができる。 After the workload finishes running, the workload scheduler 910 may update the state of each building block to return the state of each building block used to create the workload cluster to the available state. can. The workload scheduler 910 can also instruct OCS Manager 920 to disconnect between the building blocks used to create the workload cluster. Therefore, OCS Manager 920 can update the routing table to break the mapping between the physical ports of the OCS switch used to route data between building blocks.

このように、ＯＣＳスイッチを用いて光ファブリックトポロジを構成することにより、ワークロードを実行するためのワークロードクラスタを生成することによって、スーパポッドは、複数のワークロードを動的且つ安全にホストすることができる。ワークロードスケジューラ９２０は、新しいワークロードが受信されると、即座にワークロードクラスタを生成し、ワークロードが処理されると、即座にワークロードクラスタを解放することができる。ＯＣＳスイッチによって提供されたセグメント間のルーティングは、従来のスーパコンピュータよりも、同一のスーパポッドに実行されている異なるワークロードの間により良好なセキュリティを提供する。例えば、ＯＣＳスイッチは、ワークロードの間のエアギャップを用いて、ワークロードを物理的に分離する。従来のスーパコンピュータは、ソフトウェアを用いてワークロードを分離する。このため、情報が漏洩しやすい。 In this way, by configuring the optical fabric topology with OCS switches to create a workload cluster to run the workload, the superpod can dynamically and safely host multiple workloads. Can be done. The workload scheduler 920 can generate a workload cluster immediately when a new workload is received and release the workload cluster immediately when the workload is processed. The routing between segments provided by the OCS switch provides better security between different workloads running on the same superpod than traditional supercomputers. For example, OCS switches use the air gap between workloads to physically separate the workload. Traditional supercomputers use software to separate workloads. Therefore, information is easily leaked.

図１０は、ワークロードクラスタを生成し、ワークロードクラスタを用いて計算ワークロードを実行するための例示的なプロセス１０００を示す流れ図である。プロセス１０００の動作は、１つ以上のデータ処理装置を含むシステムによって実行されてもよい。例えば、プロセス１０００の動作は、図１の処理システム１３０によって実行されてもよい。 FIG. 10 is a flow chart illustrating an exemplary process 1000 for creating a workload cluster and executing a computational workload using the workload cluster. The operation of process 1000 may be performed by a system that includes one or more data processing devices. For example, the operation of process 1000 may be executed by the processing system 130 of FIG.

システムは、要求される計算ノードのクラスタを指定する要求データを受信する（１０１０）。要求データは、ユーザ装置から受信されてもよい。要求データは、計算ワークロードと、計算ノードのｎ次元目標構成を指定するデータとを含むことができる。例えば、要求データは、計算ノードを含むビルディングブロックのｎ次元目標構成を指定することができる。 The system receives request data specifying the cluster of compute nodes requested (1010). The request data may be received from the user device. The request data can include computational workloads and data that specify the n-dimensional goal configuration of the compute node. For example, the request data can specify an n-dimensional goal configuration of a building block containing compute nodes.

いくつかの実装形態において、要求データは、ビルディングブロックを生成するための計算ノードの種類を指定することができる。スーパポッドは、異なる種類の計算ノードからなるビルディングブロックを含むことができる。例えば、スーパポッドは、各々が４×４×４構成のＴＰＵを含む９０個のビルディングブロックと、２×１構成の専用計算ノードを含む１０個の専用ビルディングブロックとを含むことができる。要求データは、各種類の計算ノードからなるビルディングブロックの数およびこれらのビルディングブロックの構成を指定することができる。 In some implementations, the request data can specify the type of compute node for generating building blocks. Superpods can contain building blocks consisting of different types of compute nodes. For example, a superpod can include 90 building blocks, each containing a 4x4x4 TPU, and 10 dedicated building blocks, each containing a 2x1 dedicated compute node. The request data can specify the number of building blocks consisting of each type of compute node and the configuration of these building blocks.

システムは、１組のビルディングブロックを含むスーパポッドから、要求されたクラスタを生成するためのビルディングブロックの部分セット（サブセット）を選択する（１０２０）。上記で説明したように、スーパポッドは、３次元構成の計算ノード、例えば４×４×４構成の計算ノードを含む１組のビルディングブロックを含むことができる。システムは、目標構成によって定義された数量に一致する数量のビルディングブロックを選択することができる。上述したように、システムは、健全であり且つ要求されたクラスタを生成するために利用可能なビルディングブロックを選択することができる。 The system selects a subset of building blocks to generate the requested cluster from a superpod containing a set of building blocks (1020). As described above, a superpod can include a set of building blocks containing compute nodes in a three-dimensional configuration, eg, a 4x4x4 configuration. The system can select a quantity of building blocks that matches the quantity defined by the target configuration. As mentioned above, the system can select building blocks that are available to generate healthy and requested clusters.

ビルディングブロックの部分セットは、ビルディングブロックの適切な部分セットであってもよい。適切な部分セットは、１組のビルディングブロックのうちの全てのメンバーを含まない部分セットである。例えば、全てのビルディングブロックよりも少ないビルディングブロックは、目標構成の計算ノードに一致するワークロードクラスタを生成するために必要とされてもよい。 The subset of building blocks may be the appropriate subset of building blocks. A suitable subset is a subset that does not include all members of a set of building blocks. For example, fewer building blocks than all building blocks may be needed to generate workload clusters that match the compute nodes in the target configuration.

システムは、選択された計算ノードの部分セットを含むワークロードクラスタを生成する（１０３０）。このワークロードクラスタは、要求データによって指定された目標構成に一致する構成のビルディングブロックを含むことができる。例えば、要求データが４×８×４構成の計算ノードを指定する場合、ワークロードクラスタは、図３のワークロードクラスタ３３０のように配置された２つのビルディングブロックを含むことができる。 The system creates a workload cluster containing a subset of the selected compute nodes (1030). This workload cluster can contain building blocks with configurations that match the target configuration specified by the request data. For example, if the request data specifies a compute node with a 4x8x4 configuration, the workload cluster can include two building blocks arranged as in the workload cluster 330 of FIG.

ワークロードクラスタを生成するために、システムは、ワークロードクラスタの各次元についてルーティングデータを構成することができる。例えば、上述したように、スーパポッドは、ビルディングブロックの各次元について１つ以上のＯＣＳスイッチを含む光ネットワークを含むことができる。ある次元のためにルーティングデータは、１つ以上のＯＣＳスイッチのためにルーティングテーブルを含むことができる。図９を参照して上述したように、ＯＣＳスイッチのルーティングテーブルは、各次元に沿った計算ノードの適切なセグメントの間にデータをルーティングするように構成されてもよい。 To generate a workload cluster, the system can configure routing data for each dimension of the workload cluster. For example, as mentioned above, a superpod can include an optical network that includes one or more OCS switches for each dimension of the building block. For a dimension, the routing data can include a routing table for one or more OCS switches. As mentioned above with reference to FIG. 9, the OCS switch routing table may be configured to route data between the appropriate segments of compute nodes along each dimension.

システムは、ワークロードクラスタ内の計算ノードに、計算ワークロードを実行させる（１０４０）。例えば、システムは、計算ワークロードをワークロードクラスタ内の計算ノードに提供することができる。計算ワークロードが実行されている間に、構成されたＯＣＳスイッチは、ワークロードクラスタのビルディングブロックの間にデータをルーティングすることができる。構成されたＯＣＳスイッチは、計算ノードが目標構成において物理的に接続されていなくても物理的に接続されていたように、ビルディングブロックの計算ノードの間にデータをルーティングすることができる。 The system causes a compute node in the workload cluster to execute the compute workload (1040). For example, the system can provide a compute workload to compute nodes in a workload cluster. While the computational workload is running, the configured OCS switch can route data between the building blocks of the workload cluster. The configured OCS switch can route data between the compute nodes of the building block as if the compute nodes were physically connected even if they were not physically connected in the target configuration.

例えば、ある次元の各セグメントの計算ノードは、異なるビルディングブロック内のセグメントの他の計算ノードと単一の物理セグメントに物理的に接続されるように、ＯＣＳスイッチを介して、データを異なるビルディングブロック内のセグメントの他の計算ノードに通信することができる。この構成のワークロードクラスタは、途中でパケットスイッチングまたはバッファリングを行わない真のエンドツーエンド光路を提供することができるため、パケットスイッチングネットワークとは異なる。パケットスイッチングの場合、スイッチがパケットを受信し、バッファリングし、別のポートで再び送信する必要があるため、遅延が長くなる。 For example, a compute node in each segment of a dimension sends data to different building blocks through an OCS switch so that it is physically connected to a single physical segment with other compute nodes in the segment in different building blocks. It can communicate with other compute nodes in the segment within. A workload cluster in this configuration differs from a packet switching network because it can provide a true end-to-end optical path without packet switching or buffering along the way. In the case of packet switching, the delay is high because the switch has to receive the packet, buffer it, and send it again on another port.

計算ワークロードの実行が終了した後、システムは、他のワークロードを実行するために、例えばビルディングブロックの状態を利用可能な状態に更新し、ワークロードクラスタのビルディングブロックの間にデータをルーティングしないようにデータルーティングを更新することによって、ビルディングブロックを解放することができる。 After the computational workload finishes running, the system updates the state of the building blocks, for example, to the available state to run other workloads, and does not route data between the building blocks of the workload cluster. You can free the building blocks by updating the data routing so that.

図１１は、故障したビルディングブロックを置換するように、光ネットワークを再構成するための例示的なプロセス１１００を示す流れ図である。プロセス１１００の動作は、１つ以上のデータ処理装置を含むシステムによって実行されてもよい。例えば、プロセス１１００の動作は、図１の処理システム１３０によって実行されてもよい。 FIG. 11 is a flow chart illustrating an exemplary process 1100 for reconfiguring an optical network to replace a failed building block. The operation of process 1100 may be performed by a system that includes one or more data processing devices. For example, the operation of process 1100 may be performed by the processing system 130 of FIG.

システムは、ワークロードクラスタ内の計算ノードに、計算ワークロードを実行させる（１１１０）。例えば、システムは、ワークロードクラスタを生成し、計算ノードに、図１０のプロセス１０００に従って計算ワークロードを実行させることができる。 The system causes a compute node in the workload cluster to execute the compute workload (1110). For example, the system can create a workload cluster and have the compute node run the compute workload according to process 1000 in FIG.

システムは、ワークロードクラスタのビルディングブロックが故障したことを示すデータを受信する（１１２０）。例えば、ビルディングブロックの１つ以上の計算ノードが故障した場合、別の要素、例えば監視要素は、ビルディングブロックが故障したと判断し、ビルディングブロックが故障したことを示すデータをシステムに送信することができる。 The system receives data indicating that the building block of the workload cluster has failed (1120). For example, if one or more compute nodes in a building block fail, another element, such as a monitoring element, may determine that the building block has failed and send data to the system indicating that the building block has failed. can.

システムは、利用可能なビルディングブロックを特定する（１１３０）。例えば、システムは、ワークロードクラスタの他のビルディングブロックと同様のスーパポッドにおいて、利用可能且つ健全なビルディングブロックを特定することができる。システムは、例えば、システムによって記憶されたビルディングブロックの状態データに基づいて、利用可能且つ健全なビルディングブロックを特定することができる。 The system identifies available building blocks (1130). For example, the system can identify available and healthy building blocks in a superpod similar to other building blocks in a workload cluster. The system can identify available and healthy building blocks, for example, based on the building block state data stored by the system.

システムは、特定された利用可能なビルディングブロックを用いて、故障したビルディングブロックを置換する（１１４０）。システムは、特定された利用可能なビルディングブロックで故障したビルディングブロックを置換するように、ビルディングブロックを接続する光ネットワークの１つ以上のＯＣＳスイッチのデータルーティングを更新することができる。例えば、システムは、ワークロードクラスタの他のビルディングブロックと故障したビルディングブロックとの間の接続を解除するように、１つ以上のＯＣＳスイッチのルーティングテーブルを更新することができる。また、システムは、特定されたビルディングブロックをワークロードクラスタの他のビルディングブロックに接続するように、１つ以上のＯＣＳスイッチのルーティングテーブルを更新するができる。 The system replaces the failed building block with the identified available building block (1140). The system can update the data routing of one or more OCS switches in the optical network connecting the building blocks to replace the failed building block with the identified available building blocks. For example, the system can update the routing table of one or more OCS switches to disconnect between other building blocks in the workload cluster and the failed building block. The system can also update the routing table of one or more OCS switches to connect the identified building block to other building blocks in the workload cluster.

システムは、特定されたビルディングブロックを、故障したビルディングブロックスポットの論理スポットに論理的に配置することができる。上述したように、ＯＣＳスイッチのルーティングテーブルは、あるビルディングブロックのセグメントに接続されたＯＣＳスイッチの物理ポートを、対応する別のビルディングブロックのセグメントに接続されたＯＣＳスイッチの物理ポートにマッピングすることができる。この場合、システムは、故障したビルディングブロックではなく、特定された利用可能なビルディングブロックの対応するセグメントとのマッピングを更新することによって、置換を行うことができる。 The system can logically place the identified building block in the logical spot of the failed building block spot. As mentioned above, the OCS switch's routing table can map the physical port of an OCS switch connected to a segment of one building block to the physical port of an OCS switch connected to a segment of the corresponding building block. can. In this case, the system can make the replacement by updating the mapping with the corresponding segment of the identified available building block rather than the failed building block.

例えば、故障したビルディングブロックの特定のｘ次元セグメントの入力外部リンクがＯＣＳスイッチの第１のポートに接続され、特定された利用可能なビルディングブロックの対応するｘ次元セグメントの入力外部リンクがＯＣＳスイッチの第２のポートに接続されていると仮定する。さらに、ルーティングテーブルが第１のポートを、別のビルディングブロックの対応するｘ次元セグメントに接続されているＯＣＳスイッチの第３のポートにマッピングすると仮定する。置換を行うために、システムは、第１のポートを第３のポートにマッピングするのではなく、第２のポートを第３のポートにマッピングすることによって、ルーティングテーブルのマッピングを更新することができる。システムは、故障したビルディングブロックの各セグメントに対して同様のことを行うことができる。 For example, the input external link for a particular x-dimensional segment of a failed building block is connected to the first port of the OCS switch, and the input external link for the corresponding x-dimensional segment of the identified available building block is the OCS switch. Suppose you are connected to a second port. Further assume that the routing table maps the first port to the third port of the OCS switch connected to the corresponding x-dimensional segment of another building block. To make the substitution, the system can update the routing table mapping by mapping the second port to the third port instead of mapping the first port to the third port. .. The system can do the same for each segment of the failed building block.

本明細書に記載された主題および動作の実施形態は、本明細書に開示された構造およびそれらの均等物を含むデジタル電子回路、コンピュータソフトウェア、ファームウェア、ハードウェア、またはそれらの１つ以上の組み合わせで実現されてもよい。本明細書に記載された主題の実施形態は、１つ以上のコンピュータプログラム、すなわち、コンピュータ記憶媒体上にエンコードされ、データ処理装置によって実行されるまたはデータ処理装置の動作を制御するためのコンピュータプログラム命令の１つ以上のモジュールとして実現されてもよい。代替的にまたは追加的に、プログラム命令は、データ処理装置によって実行されるために、適切な受信装置に送信される情報をエンコードするように人工的に生成された伝搬信号、例えば機械によって生成された電気信号、光信号、または電磁信号にエンコードされてもよい。コンピュータ記憶媒体は、コンピュータ可読記憶装置、コンピュータ可読記憶基板、ランダムアクセスメモリアレイもしくは装置またはシリアルアクセスメモリアレイもしくは装置、またはそれらの１つ以上の組み合わせであってもよく、またはそれらに含まれてもよい。さらに、コンピュータ記憶媒体は、伝搬信号ではないが、人工的に生成された伝搬信号にエンコードされるコンピュータプログラム命令のソースまたはインストール先であってもよい。また、コンピュータ記憶媒体は、１つ以上の別個の物理的な要素または媒体（例えば、複数のＣＤ、ディスク、または他の記憶装置）であってもよく、またはそれらに含まれてもよい。 Embodiments of the subject matter and operation described herein are digital electronic circuits, computer software, firmware, hardware, or a combination thereof, including the structures and their equivalents disclosed herein. It may be realized by. The embodiments of the subject described herein are one or more computer programs, i.e., computer programs encoded on a computer storage medium and executed by or controlling the operation of the data processing apparatus. It may be implemented as one or more modules of instructions. Alternatively or additionally, the program instruction is generated by a propagating signal, eg, a machine, that is artificially generated to encode the information sent to the appropriate receiver for execution by the data processor. It may be encoded into an electrical signal, an optical signal, or an electromagnetic signal. The computer storage medium may be, or may be contained in, a computer readable storage device, a computer readable storage board, a random access memory array or device or a serial access memory array or device, or a combination thereof. good. Further, the computer storage medium may be a source or installation destination of computer program instructions encoded in an artificially generated propagating signal, although it is not a propagating signal. The computer storage medium may also be, or may be contained in, one or more separate physical elements or media (eg, multiple CDs, discs, or other storage devices).

本明細書に記載された動作は、データ処理装置によって１つ以上のコンピュータ可読ストレージ装置に記憶されたデータまたは他のソースから受信されたデータに対して実行された動作として実現されてもよい。 The operations described herein may be implemented as operations performed by a data processing device on data stored in one or more computer-readable storage devices or received from other sources.

「データ処理装置」という用語は、例えば、プログラム可能なプロセッサ、コンピュータ、システムオンチップ、またはそれらの組み合わせを含む、データを処理するための全ての種類の機器、装置、およびマシンを含む。装置は、専用論理回路、例えば、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）またはＡＳＩＣ（特定用途向け集積回路）を含むことができる。また、装置は、ハードウェアのほかに、当該コンピュータプログラムの実行環境を生成するコード、例えば、プロセッサファームウェア、プロトコルスタック、データベース管理システム、オペレーティングシステム、クロスプラットフォームランタイム環境、仮想マシン、またはそれらの組み合わせを構成するコードを含むことができる。装置および実行環境は、様々な異なるコンピューティングモデルインフラストラクチャ、例えばウェブサービス、分散コンピューティングインフラストラクチャ、およびグリッドコンピューティングインフラストラクチャを実現することができる。 The term "data processing appliance" includes all types of equipment, appliances, and machines for processing data, including, for example, programmable processors, computers, system-on-chips, or combinations thereof. The device can include dedicated logic circuits, such as FPGAs (field programmable gate arrays) or ASICs (application specific integrated circuits). In addition to the hardware, the device also contains code that creates the execution environment for the computer program, such as processor firmware, protocol stacks, database management systems, operating systems, cross-platform runtime environments, virtual machines, or a combination thereof. Can include constituent code. Equipment and execution environments can implement a variety of different computing model infrastructures, such as web services, distributed computing infrastructure, and grid computing infrastructure.

（プログラム、ソフトウェア、ソフトウェアアプリケーション、スクリプト、またはコードとしても知られている）コンピュータプログラムは、コンパイル型またはインタプリタ型言語、宣言または手続き型言語を含む任意のプログラミング言語で記述されてもよく、独立型プログラム、モジュール、コンポーネント、サブルーチン、オブジェクト、またはコンピューティング環境に適切に使用できる他のユニット含む任意の形で展開されてもよい。コンピュータプログラムは、ファイルシステム内のファイルに対応してもよいが、必ずしも対応する必要はない。プログラムは、他のプログラムまたはデータ（例えば、マークアップ言語文書に格納された１つ以上のスクリプト）を保持するファイルの一部、当該プログラム専用の単一のファイル、または複数の連携ファイル（例えば、１つ以上のモジュール、サブプログラム、またはコードの一部を記憶するファイル）に格納されてもよい。コンピュータプログラムは、１台のコンピュータ、または一箇所に配置されもしくは複数の箇所に分散され、通信ネットワークによって相互接続された複数のコンピュータ上で展開され、実行されてもよい。 Computer programs (also known as programs, software, software applications, scripts, or code) may be written in any programming language, including compiled or interpreted languages, declarative or procedural languages, and are stand-alone. It may be deployed in any form, including programs, modules, components, subroutines, objects, or other units that can be used appropriately in a computing environment. The computer program may, but does not necessarily, support the files in the file system. A program is a portion of a file that holds other programs or data (eg, one or more scripts stored in a markup language document), a single file dedicated to that program, or multiple collaborative files (eg,). It may be stored in one or more modules, subprograms, or files that store parts of the code). The computer program may be deployed and executed on one computer, or a plurality of computers arranged in one place or distributed in a plurality of places and interconnected by a communication network.

本明細書に記載されたプロセスおよび論理フローは、１つ以上のプログラム可能なプロセッサによって実行されてもよい。１つ以上のプログラム可能なプロセッサは、動作を実行するように、入力データを処理して、出力を生成することによって１つ以上のコンピュータプログラムを実行する。また、プロセスおよび論理フローは、専用論理回路、例えばＦＰＧＡ（フィールドプログラマブルゲートアレイ）またはＡＳＩＣ（特定用途向け集積回路）によって実行されてもよく、装置は、専用論理回路として実装されてもよい。 The processes and logical flows described herein may be performed by one or more programmable processors. One or more programmable processors execute one or more computer programs by processing input data and producing outputs to perform operations. Further, the process and the logic flow may be executed by a dedicated logic circuit, for example, FPGA (field programmable gate array) or ASIC (application specific integrated circuit), and the device may be implemented as a dedicated logic circuit.

コンピュータプログラムの実行に適したプロセッサは、例として、汎用マイクロプロセッサ、専用マイクロプロセッサ、および任意の種類のデジタルコンピュータの１つ以上のプロセッサを含む。一般的に、プロセッサは、読み取り専用メモリ、ランダムアクセスメモリ、またはその両方から命令およびデータを受信する。コンピュータの必須要素は、命令に従って動作を実行するためのプロセッサと、命令およびデータを記憶するための１つ以上のメモリ装置とを含む。一般的に、コンピュータはまた、データを記憶するための１つ以上の大容量記憶装置、例えば磁気ディスク、光磁気ディスクまたは光ディスクを含むか、または１つ以上の大容量記憶装置からデータを受信する、または１つ以上の大容量記憶装置にデータを転送する、またはその両方を行うように１つ以上の大容量記憶装置に動作可能に結合される。しかしながら、コンピュータは、そのような装置を有する必要はない。さらに、コンピュータは、別の装置、例えば携帯電話、携帯情報端末（ＰＤＡ）、モバイルオーディオプレーヤまたはビデオプレーヤ、ゲームコンソール、全地球測位システム（ＧＰＳ）受信機、または携帯記憶装置（例えば、ＵＳＢフラッシュドライブ）に組み込むことができる。コンピュータプログラム命令およびデータの記憶に適した装置は、例えば、ＥＰＲＯＭ、ＥＥＰＲＯＭおよびフラッシュメモリ装置などの半導体メモリ装置、例えば内部ハードディスクまたは取外し可能なディスクなどの磁気ディスク、光磁気ディスク、ＣＤＲＯＭおよびＤＶＤ－ＲＯＭディスクを含む全ての不揮発性メモリ、媒体、およびメモリ装置を含む。プロセッサおよびメモリは、専用論理回路によって補足されてもよく、または専用論理回路に組み込まれてもよい。 Suitable processors for running computer programs include, for example, general purpose microprocessors, dedicated microprocessors, and one or more processors of any type of digital computer. Generally, the processor receives instructions and data from read-only memory, random access memory, or both. Essential elements of a computer include a processor for performing operations according to instructions and one or more memory devices for storing instructions and data. In general, a computer also includes or receives data from one or more mass storage devices for storing data, such as magnetic disks, optomagnetic disks or optical disks, or from one or more mass storage devices. , Or transfer data to one or more mass storage devices, or both, and are operably coupled to one or more mass storage devices. However, the computer does not need to have such a device. In addition, the computer may be another device, such as a mobile phone, personal digital assistant (PDA), mobile audio player or video player, game console, Global Positioning System (GPS) receiver, or portable storage device (eg, USB flash drive). ) Can be incorporated. Suitable devices for storing computer program instructions and data include, for example, semiconductor memory devices such as EPROM, EEPROM and flash memory devices, such as magnetic disks such as internal hard disks or removable disks, magneto-optical disks, CD ROMs and DVDs. Includes all non-volatile memory, media, and memory devices, including ROM disks. The processor and memory may be supplemented by dedicated logic circuits or may be incorporated into dedicated logic circuits.

ユーザとの対話を提供するために、本明細書に記載された主題の実施形態は、情報をユーザに表示するための表示装置、例えばＣＲＴ（陰極線管）またはＬＣＤ（液晶ディスプレイ）モニタ、およびユーザがコンピュータに入力を提供することができるキーボードおよびポインティング装置、例えばマウスまたはトラックボールを含むコンピュータ上で実装されてもよい。他の種類の装置を使用して、ユーザとの相互作用を提供することもできる。ユーザに提供されるフィードバックは、例えば、任意の感覚フィードバック、例えば視覚フィードバック、聴覚フィードバック、または触覚フィードバックであってもよい。ユーザから受信される入力は、音響入力、音声入力、または触覚入力を含んでもよい。さらに、コンピュータは、ユーザによって使用されている装置に文書を送信し、当該装置から文書を受信することによって、例えば、ウェブブラウザから受信された要求に応答して、ユーザのクライアント装置上のウェブブラウザにウェブページを送信することによって、ユーザと対話することができる。 To provide interaction with the user, embodiments of the subject matter described herein are display devices for displaying information to the user, such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, and the user. May be implemented on a computer that includes a keyboard and pointing device that can provide input to the computer, such as a mouse or trackball. Other types of devices can also be used to provide user interaction. The feedback provided to the user may be, for example, any sensory feedback, such as visual feedback, auditory feedback, or tactile feedback. The input received from the user may include acoustic input, voice input, or tactile input. Further, the computer sends a document to the device used by the user and receives the document from the device, for example, in response to a request received from a web browser, a web browser on the user's client device. You can interact with the user by sending a web page to.

本明細書に記載された主題の実施形態は、バックエンドコンポーネントを含むコンピューティングシステム、例えばデータサーバに実装されてもよく、またはミドルウェアコンポーネントを含むコンピューティングシステム、例えばアプリケーションサーバに実装されてもよく、またはフロントエンドコンポーネントを含むコンピューティングシステム、例えばユーザが本明細書に記載された主題の実装形態と相互作用することができるグラフィカルユーザインターフェイスまたはウェブブラウザ、または上述したバックエンドコンポーネント、ミドルウェアコンポーネントもしくはフロントエンドコンポーネントの任意の組み合わせを備えるクライアントコンピュータに実装されてもよい。システムの構成要素は、任意のデジタルデータ通信媒体、例えば通信ネットワークによって相互接続することができる。通信ネットワークの例は、ローカルエリアネットワーク（ＬＡＮ）およびワイドエリアネットワーク（ＷＡＮ）、インターネットワーク（例えば、インターネット）、およびピアツーピアネットワーク（例えば、アドホックピアツーピアネットワーク）を含む。 The embodiments of the subject matter described herein may be implemented in a computing system that includes back-end components, such as a data server, or may be implemented in a computing system that includes middleware components, such as an application server. , Or a computing system that includes front-end components, such as a graphical user interface or web browser that allows the user to interact with the implementations of the subject matter described herein, or the back-end components, middleware components or front described above. It may be implemented on a client computer with any combination of end components. The components of the system can be interconnected by any digital data communication medium, such as a communication network. Examples of communication networks include local area networks (LANs) and wide area networks (WANs), internetworks (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks).

コンピューティングシステムは、クライアントおよびサーバを含むことができる。クライアントおよびサーバは、一般的に互いに遠隔であり、一般的に通信ネットワークを介して相互作用する。クライアントとサーバの関係は、それぞれのコンピュータ上で実行され、互いにクライアント－サーバ関係を有するコンピュータプログラムによって形成される。いくつかの実施形態において、サーバは、（例えば、クライアント装置と対話するユーザにデータを表示させ、ユーザ入力を受信するために）データ（例えば、ＨＴＭＬページ）をクライアント装置に送信する。サーバは、クライアント装置で生成されたデータ（例えば、ユーザインタラクションの結果）をクライアント装置から受信することができる。 The computing system can include clients and servers. Clients and servers are generally remote from each other and generally interact over a communication network. The client-server relationship runs on each computer and is formed by computer programs that have a client-server relationship with each other. In some embodiments, the server sends data (eg, an HTML page) to the client device (eg, to display data to a user interacting with the client device and receive user input). The server can receive data generated by the client device (eg, the result of user interaction) from the client device.

本明細書は、多くの具体的な実装詳細を含むが、これらの実装詳細は、発明の範囲または請求範囲に対する限定として解釈されるべきではなく、むしろ、特定の発明の特定の実施形態に特有の特徴の説明として解釈されるべきである。別個の実施形態に関して本明細書に記載された特定の特徴は、任意の組み合わせで単一の実施形態に実装されてもよい。逆に、単一の実施形態に関して記載された様々な特徴は、別々にまたは任意の適切なサブコンビネーションで複数の実施形態に実装されてもよい。さらに、特徴がある組み合わせで作用し、その組み合わせで作用すると主張したとしても、必要に応じて、主張した組み合わせから１つ以上の特徴を除去することができ、主張した組み合わせをサブコンビネーションに分割することができる。 Although the present specification includes many specific implementation details, these implementation details should not be construed as limitations to the scope or claims of the invention, but rather are specific to a particular embodiment of a particular invention. Should be interpreted as an explanation of the characteristics of. The particular features described herein with respect to separate embodiments may be implemented in any combination in a single embodiment. Conversely, the various features described for a single embodiment may be implemented separately or in any suitable subcombination in multiple embodiments. Furthermore, even if a characteristic combination works and it is claimed to work with that combination, one or more features can be removed from the claimed combination as needed, and the claimed combination is divided into sub-combinations. be able to.

同様に、特定の順序で動作を図面に示しているが、これは、望ましい結果を達成するために、これらの動作を図示された特定の順序でまたは順次に実行する必要、または図示された全ての動作を実行する必要があるとして理解すべきではない。特定の状況によって、マルチタスクおよび並列処理が有利であり得る。さらに、上述した実施形態における様々なシステム要素の分離は、全ての実施形態においてそのような分離が必要であるとして理解すべきではなく、記載されたプログラム要素およびシステムは、一般的に、単一のソフトウェア製品に一体化されてもよく、または複数のソフトウェア製品にパッケージ化されてもよいと理解すべきである。 Similarly, the drawings show the operations in a particular order, which requires or all of these operations to be performed in the specific order shown or in sequence to achieve the desired result. Should not be understood as having to perform the actions of. Depending on the particular situation, multitasking and parallel processing may be advantageous. Moreover, the separation of various system elements in the embodiments described above should not be understood as requiring such separation in all embodiments, and the program elements and systems described are generally single. It should be understood that it may be integrated into one software product or packaged into multiple software products.

したがって、本発明の特定の実施形態を説明してきた。他の実施形態は、特許請求の範囲に含まれる。場合によって、特許請求の範囲に記載された動作は、異なる順序で実行されても、望ましい結果を達成することができる。さらに、添付の図面に示されたプロセスは、望ましい結果を達成するために、必ずしも図示された特定の順序でまたは順次に実行される必要がない。いくつかの実装形態において、マルチタスクおよび並列処理が有利であり得る。 Therefore, specific embodiments of the present invention have been described. Other embodiments are included in the claims. In some cases, the actions described in the claims can be performed in different orders to achieve the desired result. Moreover, the processes shown in the accompanying drawings do not necessarily have to be performed in the particular order or sequence shown to achieve the desired results. In some implementations, multitasking and parallel processing can be advantageous.

Claims

A method performed by one or more data processing devices, wherein the method is performed.
The request data comprises receiving request data that specifies the compute node requested to perform the compute workload, which specifies the n (n is greater than or equal to 2) dimensional goal configuration of the compute node. ,
The building that, when combined, matches the n-dimensional target configuration specified by the request data, from a superpod containing a set of building blocks, each containing a compute node with an m (m is 2 or more) dimension configuration. A set of building blocks comprising selecting a subset of blocks is connected to an optical network containing one or more optical circuit switches for each of the n dimensions.
Including creating a workload cluster of compute nodes containing a subset of the building blocks
The above-mentioned generation is
For each dimension of the workload cluster, the routing data corresponding to each dimension of the workload cluster comprises configuring the routing data of each of the one or more optical circuit switches for that dimension. Specifying how to route the data of the compute workload between the compute nodes along the dimension of the workload cluster.
A method comprising causing the compute node of the workload cluster to execute the compute workload.

The request data specifies different types of compute nodes.
Choosing a subset of the building block comprises selecting a building block containing one or more compute nodes of the specified type for each type of compute node specified by the request data. Item 1. The method according to Item 1.

The method of claim 1, wherein the routing data for each dimension of the superpod comprises an optical circuit switch routing table for one of the one or more optical circuit switches.

The method of claim 1, wherein the optical network comprises, for each of the n dimensions, one or more optical circuit switches of the optical network that route data between computing nodes along the dimension.

Each building block contains multiple segments of compute nodes along each dimension of the building block.
4. The optical network of claim 4, wherein the optical network includes an optical circuit switch of the optical network that routes data between the compute node segments corresponding to each building block in the workload cluster for each segment of each dimension. Method.

The method of claim 1, wherein each building block comprises one of a three-dimensional torus-like calculation node or a mesh-like calculation node.

The superpod contains multiple workload clusters.
The method of claim 1, wherein each workload cluster comprises a different subset of said building blocks and performs a different workload than the other workload clusters.

Receiving data indicating that a particular building block in the workload cluster has failed,
The method of claim 1, further comprising replacing the particular building block with an available building block.

Replacing the particular building block with an available building block
Updating the data routing of one or more optical circuit switches in the optical network to stop the data routing between the particular building block of the workload cluster and one or more other building blocks. ,
Update the data routing of the one or more optical circuit switches in the optical network to route data between the available building blocks of the workload cluster and the one or more other building blocks. The method according to claim 8, including the above.

When combined, selecting a subset of the building blocks that matches the n-dimensional goal configuration specified by the request data
Determining that the n-dimensional configuration specified by the request data requires a first quantity of building blocks that exceeds the available and sound second quantity of building blocks in the superpod.
In response to the determination that the n-dimensional configuration specified by the request data requires the first quantity of building blocks that exceeds the available and sound second quantity of building blocks in the superpod.
Identifying one or more second computational workloads that have a lower priority than the computational workload and are being executed by other building blocks of the superpod, and
Including reallocating one or more building blocks of the one or more second compute workloads to the workload cluster for the compute workload.
Creating the workload cluster of the compute node containing a subset of the building blocks includes the one or more building blocks of the one or more second compute workloads in the building block subset. The method according to claim 1.

Creating said workload clusters of compute nodes containing a subset of said building blocks means that for each dimension of said workload cluster, said said one or more building blocks of said one or more second computational workloads. The one or more optical circuit switches for the dimension so that each building block communicates with the other building blocks of the workload cluster rather than the building blocks of the one or more second computational workloads. 10. The method of claim 10, comprising reconstructing each routing data of.

It ’s a system,
Data processing equipment,
Equipped with a computer storage medium that encodes a computer program
The program includes a data processing device instruction that causes the data processing device to perform the following operations when executed by the data processing device.
The request data comprises receiving request data that specifies the compute node requested to perform the compute workload, which specifies the n (n is greater than or equal to 2) dimensional goal configuration of the compute node. ,
The building that, when combined, matches the n-dimensional target configuration specified by the request data, from a superpod containing a set of building blocks, each containing a compute node with an m (m is 2 or more) dimension configuration. A set of building blocks comprising selecting a subset of blocks is connected to an optical network containing one or more optical circuit switches for each of the n dimensions.
Including creating a workload cluster of compute nodes containing a subset of the building blocks
The above-mentioned generation is
The routing data corresponding to each dimension of the workload cluster comprises configuring the routing data of each of the one or more optical circuit switches for that dimension for each dimension of the workload cluster. Specifying how to route the data for the compute workload between the compute nodes along the dimension of the workload cluster.
A system comprising having the compute node of the workload cluster execute the compute workload.

The request data specifies different types of compute nodes.
Choosing a subset of the building block comprises selecting a building block containing one or more compute nodes of the specified type for each type of compute node specified by the request data. Item 12. The system according to item 12.

12. The system of claim 12, wherein the routing data for each dimension of the superpod comprises an optical circuit switch routing table for one of the one or more optical circuit switches.

12. The system of claim 12, wherein the optical network comprises one or more optical circuit switches of the optical network that route data between computing nodes along the n-dimensions for each dimension.

Each building block contains multiple segments of compute nodes along each dimension of the building block.
15. The optical network of claim 15, wherein the optical network includes an optical circuit switch of the optical network that routes data between the compute node segments corresponding to each building block in the workload cluster for each segment of each dimension. system.

12. The system of claim 12, wherein each building block comprises one of a three-dimensional torus-like compute node or a mesh-like compute node.

The superpod contains multiple workload clusters.
12. The system of claim 12, wherein each workload cluster comprises a different subset of said building blocks and performs a different workload than the other workload clusters.

The above operation is
Receiving data indicating that a particular building block in the workload cluster has failed,
12. The system of claim 12, comprising replacing the particular building block with an available building block.

A non-temporary computer storage medium in which a computer program is encoded, wherein the program comprises a data processing device instruction that causes the data processing device to perform the following operations when executed by the data processing device.
The request data comprises receiving request data that specifies the compute node requested to perform the compute workload, which specifies the n (n is greater than or equal to 2) dimensional goal configuration of the compute node. ,
The building that, when combined, matches the n-dimensional target configuration specified by the request data, from a superpod containing a set of building blocks, each containing a compute node with an m (m is 2 or more) dimension configuration. A set of building blocks comprising selecting a subset of blocks is connected to an optical network containing one or more optical circuit switches for each of the n dimensions.
Including creating a workload cluster of compute nodes containing a subset of the building blocks
The above-mentioned generation is
For each dimension of the workload cluster, the routing data corresponding to each dimension of the workload cluster comprises configuring the routing data of each of the one or more optical circuit switches for that dimension. Specifying how to route the data of the compute workload between the compute nodes along the dimension of the workload cluster.
A non-temporary computer storage medium comprising having the compute node of the workload cluster execute the compute workload.