WO2016122714A1

WO2016122714A1 - Job scheduling in an infiniband network based hpc cluster

Info

Publication number: WO2016122714A1
Application number: PCT/US2015/042690
Authority: WO
Inventors: Prashanth TAMRAPARNI; Toby SEBASTIAN
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2015-01-30
Filing date: 2015-07-29
Publication date: 2016-08-04

Abstract

One example of job scheduling in a fat tree blocking InfiniBand network includes discovering compute nodes coupled to each leaf switch in the fat tree blocking InfiniBand network based HPC cluster, retrieving routing tables indicating the routes between the compute nodes, and selecting compute nodes for a job by selecting compute nodes having unshared routes between them prior to selecting compute nodes having shared routes between them.

Description

JOB SCHEDULING IN AN INFINIBAND NETWORK BASED HPC CLUSTER

Cross-Reference to Related Applications

[0001] This PCT Patent Application claims benefit from India Patent Application 460/CHE/2015, filed January 30, 2015, incorporated by reference herein.

Background

[0002] InfiniBand is an interconnect network that provides high bandwidth and low latency for High Performance Computing (HPC) clusters. As HPC cluster sizes increase, the number of nodes in an HPC cluster often exceeds the number of ports available in any single InfiniBand switch. For such clusters, an InfiniBand network can include multiple interconnected switches, each switch having a relatively small number of ports, while the multiple interconnected switches can support a relatively large number of nodes. One InfiniBand network topology used in HPC clusters is a fat tree topology wherein each node is coupled to a lower level leaf switch, and the leaf switches are interlinked using upper level spine switches. In one such topology, referred to as 1 :1 non- blocking, the number of nodes coupled to a leaf switch equals the number of uplinks of the leaf switch to spine switches. There are situations when applications running on an HPC cluster do not take full advantage of the 1 :1 bandwidth available by the network. In such cases, a fat tree blocking topology may be used to reduce the cost of the network by reducing the number of switches used without overly sacrificing performance. Brief Description of the Drawings

[0003] Figure 1 is a block diagram illustrating one example of a system including a fat tree blocking InfiniBand network.

[0004] Figure 2 is a block diagram illustrating one example of a processing system.

[0005] Figure 3 is a block diagram illustrating one example of allocating compute nodes in a fat tree 2:1 blocking InfiniBand network.

[0006] Figure 4 is a block diagram illustrating one example of allocating compute nodes in a fat tree 4:1 blocking InfiniBand network.

[0007] Figures 5 and 6 are flow diagrams illustrating one example of a method for allocating compute nodes in a fat tree blocking InfiniBand network.

Detailed Description

[0008] In the following detailed description, reference is made to the

accompanying drawings which form a part hereof, and in which is shown by way of illustration specific examples in which the disclosure may be practiced. It is to be understood that other examples may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims. It is to be understood that features of the various examples described herein may be combined, in part or whole, with each other, unless specifically noted otherwise.

[0009] InfiniBand networks may use static routing, which is assigned by a subnet manager, such that each compute node may communicate with every other compute node in the network. The static routing is based on a local identifier of each leaf switch, spine switch, and compute node in the network. When static routing is used in a 1 :1 non-blocking InfiniBand network, the compute nodes do not experience bandwidth congestion since the number of outgoing ports (and uplinks between each leaf switch and spine switch) is equal to the number of connected compute nodes.

[0010] In many cases, not all compute nodes coupled to each leaf switch are used to run jobs on an InfiniBand network. In a 2:1 blocking InfiniBand network, however, even if half the number of compute nodes coupled to each leaf switch are used for a job, 1 :1 bandwidth cannot be guaranteed between the compute nodes due to the static routing. This is due to the possibility of some compute nodes using shared uplinks for sending data to their respective destination compute nodes in spite of having an adequate number of uplinks available for delivering full bandwidth across the leaf switches. The same situation may arise for any blocking topology N:1 where "N" is an integer greater than 1 (e.g., 2:1 , 4:1 , 8:1 ) for jobs run between multiple groups of compute nodes where the number of compute nodes used for each leaf switch is less than or equal to the number of uplinks for the leaf switch. Accordingly, examples of this disclosure maximize the available bandwidth for a job in a fat tree blocking InfiniBand network by selecting compute nodes for each job by selecting compute nodes having unshared routes between them prior to selecting compute nodes having shared routes between them.

[0011] Figure 1 is a block diagram illustrating one example of a system 100 including a fat tree blocking InfiniBand network. System 100 includes spine switches 102i-102_x, where "X" is any suitable number of spine switches, leaf switches 104i-104_Y, where "Y" is any suitable number of leaf switches, a head node 106, and compute nodes 1 10 1 10_N, 1 12 1 12_N, ... 1 14 1 14_N, where "N" is a number of compute nodes up to the maximum number of compute nodes that can be supported by each leaf switch. System 100 may also include a login node (not shown). In one example, the login node is communicatively coupled to leaf switch 104i .

[0012] Head node 106 is communicatively coupled to leaf switch ^ 04^ through a downlink 1 161. Each compute node 1 10i-1 1 ON is communicatively coupled to leaf switch 104₂ through a respective downlink as indicated at 1 16₂- Each compute node 1 12i -1 12_N is communicatively coupled to leaf switch 104₃ through a respective downlink as indicated at 1 163. Each compute node 1 14 1 14_N is communicatively coupled to leaf switch 104_Y through a respective downlink as indicated at 1 16γ. While in the example of system 100, there are an equal number "N" of compute nodes communicatively coupled to each leaf switch 104₂-104_Y, in other examples, different numbers of compute nodes may by communicatively coupled to each leaf switch. Each leaf switch 104i-104_Y is communicatively coupled to at least one spine switch 102i-102_x through respective uplinks as indicated at 1 18. In other examples, system 100 may include additional levels of spine switches to interconnect spine switches 102r 102_x depending on the number of compute nodes and the size (i.e., the number of ports) of each leaf switch and spine switch in the InfiniBand Network.

[0013] For example, in a fat tree 2:1 blocking InfiniBand network, if "N" equals 16 such that 16 compute nodes are coupled to each leaf switch 104₂-104_Y, each leaf switch 104₂-104_Y includes eight outgoing ports with four uplinks connected to each of two of spine switches 102i-102_x. Thus, the number of uplinks equals one half the number of downlinks in a fat tree 2:1 blocking InfiniBand network. In a fat tree 4:1 blocking InfiniBand network, if "N" equals 16 such that 16 compute nodes are coupled to each leaf switch 104₂-104_Y, each leaf switch 104₂-104_Y includes four outgoing ports with all four uplinks coupled to one of spine switches 102i-102_x. Thus, the number of uplinks equals one fourth the number of downlinks in a fat tree 4:1 blocking InfiniBand network.

[0014] Each spine switch 102i-102_x includes a respective routing table 103i- 103χ and each leaf switch 104i-104_Y includes a respective routing table 105r 105_Y. Head node 106 and each compute node 1 10i-1 10_N, 1 12 1 12_N, ... 1 14 1 14_N also include a respective routing table (not shown). Each routing table directs communications from a source compute node to a destination compute node through the leaf switches and spine switches based on a local identifier of each spine switch, leaf switch, and compute node.

[0015] Head node 106 includes a job scheduler 120 and a subnet manager 122. Head node 106 may also include network deployment utilities, network management utilities (e.g., parallel shell, health monitoring, and reporting), parallel compilers, and/or a message passing interface (not shown). In other examples, subnet manager 122 may be located on a spine switch, a leaf switch, or on a compute node rather than on head node 106. Subnet manager 122 configures the routing tables for each spine switch, leaf switch, and compute node to route communications between the compute nodes via the spine switches and leaf switches. Subnet manager 122 configures the routing tables based on the local identifier of each leaf switch, spine switch, and compute node by assigning the outgoing ports for each leaf switch and spine switch such that each source compute node may communicate with each destination compute node. Therefore, the routing table of each leaf switch and spine switch includes entries of outgoing ports to reach each destination local identifier.

[0016] Job scheduler 120 selects the compute nodes for each job to be run on system 100. For jobs using less than all the available compute nodes, job scheduler 120 select compute nodes for each job by selecting compute nodes having unshared routes between them prior to selecting compute nodes having shared routes between them. If a job is submitted for less than the available compute nodes in system 100, job scheduler 120 selects compute nodes for the job from the available compute nodes by selecting less than all the available compute nodes coupled to each leaf switch. In one example, if possible, job scheduler 120 selects a number of compute nodes coupled to each leaf switch less than or equal to the number of uplinks of the leaf switch.

[0017] For example, for a 64 compute node 2:1 blocking InfiniBand network consisting of 4 leaf switches, there are 16 compute nodes and eight uplinks per leaf switch. If a job is submitted for 32 nodes, job scheduler 120 may select 8 compute nodes per leaf switch and select those compute nodes that have unshared routes between adjacent leaf switches. The number of compute nodes selected for each leaf switch need not be evenly divided. For example, if a job is submitted for 27 compute nodes for a 64 compute node 2:1 blocking InfiniBand network, job scheduler 120 may select seven compute nodes coupled to the first leaf switch, seven compute nodes coupled to the second leaf switch, seven compute nodes coupled to the third leaf switch, and six compute nodes coupled to the fourth leaf switch. Alternatively, job scheduler 120 may select eight compute nodes coupled to the first leaf switch, eight compute nodes coupled to the second leaf switch, eight compute nodes coupled to the third leaf switch, and three compute nodes coupled to the fourth leaf switch.

[0018] Once job scheduler 120 has determined the number of compute nodes coupled to each leaf switch to select, job scheduler 120 selects the particular compute nodes for each leaf switch. The particular compute nodes selected for the first leaf switch, which is designated as acting as the source, may be randomly selected or selected in another suitable manner. The second leaf switch for which particular compute nodes will be selected is designated as acting as the destination. Once the particular compute nodes for the second leaf switch acting as destination are selected, the second compute node is designated as acting as the source and the third leaf switch is designated as acting as the destination and the process repeats until all the compute nodes for each leaf switch have been selected for the job.

[0019] The particular compute nodes to be selected for any pair of leaf switches with one leaf switch acting as the source (i.e., source leaf switch) and the other leaf switch acting as a destination (i.e., destination leaf switch) are selected by job scheduler 120 as indicated in the following Table 1 .

Table 1 :

For each source leaf switch {

for a selected destination leaf switch {

find route to each compute node in destination leaf switch; and

arrange the destination compute nodes into multiple groups such that compute nodes in each group are reachable using distinct routes.

}}

[0020] When job scheduler 120 selects the compute nodes coupled to the destination leaf switch for jobs, the job scheduler first selects compute nodes in one of the groups prior to selecting compute nodes from any other group. If the number of compute nodes for a job is less than or equal to the number of uplinks used, the available bandwidth between subgroups of compute nodes selected across the leaf switches is maximized. Particular examples of the compute node selection process for a job for 2:1 and 4:1 blocking InfiniBand networks will be described in further detail below with reference to Figures 3 and 4.

[0021] Figure 2 is a block diagram illustrating one example of a system 200. System 200 may include at least one computing device and may provide head node 106 previously described and illustrated with reference to Figure 1 .

System 200 includes a processor 202 and a machine-readable storage medium 206. Processor 202 is communicatively coupled to machine-readable storage medium 206 through a communication path 204. Although the following description refers to a single processor and a single machine-readable storage medium, the description may also apply to a system with multiple processors and multiple machine-readable storage mediums. In such examples, the instructions may be distributed (e.g., stored) across multiple machine-readable storage mediums and the instructions may be distributed (e.g., executed by) across multiple processors.

[0022] Processor 202 includes one or more Central Processing Units (CPUs), microprocessors, and/or other suitable hardware devices for retrieval and execution of instructions stored in machine-readable storage medium 206. Processor 202 may fetch, decode, and execute instructions 208 to discover compute nodes, instructions 210 to retrieve routing tables, and instructions 212 to select compute nodes for a job in a fat tree blocking InfiniBand network. As an alternative or in addition to retrieving and executing instructions, processor 202 may include one or more electronic circuits comprising a number of electronic components for performing the functionality of one or more of the instructions in machine-readable storage medium 206. With respect to the executable instruction representations (e.g., boxes) described and illustrated herein, it should be understood that part or all of the executable instructions and/or electronic circuits included within one box may, in alternate examples, be included in a different box illustrated in the figures or in a different box not shown. [0023] Machine-readable storage medium 206 is a non-transitory storage medium and may be any suitable electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, machine-readable storage medium 206 may be, for example, Random Access Memory (RAM), an

Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. Machine-readable storage medium 206 may be disposed within system 200, as illustrated in Figure 2. In this case, the executable instructions may be installed on system 200. Alternatively, machine- readable storage medium 206 may be a portable, external, or remote storage medium that allows system 200 to download the instructions from the

portable/external/remote storage medium. In this case, the executable

instructions may be part of an installation package.

[0024] Machine-readable storage medium 206 stores instructions to be executed by a processor (e.g., processor 202) including instructions 208 to discover compute nodes, instructions 210 to retrieve routing tables, and instructions 212 to select compute nodes for a job. Processor 202 may execute instructions 208 to discover compute nodes coupled to each leaf switch in a fat tree blocking InfiniBand network, such as compute nodes 1 10i-1 1 ON, 1 12 1 12_n, ... 1 14 1 14_N previously described and illustrated with reference to Figure 1 . Processor 202 may execute instructions 210 to retrieve routing tables indicating the routes between the compute nodes in the InfiniBand network. Processor 202 may execute instructions 212 to select compute nodes for a job by selecting compute nodes having unshared routes between them prior to selecting compute nodes having shared routes between them as previously described with reference to Figure 1 .

[0025] Figure 3 is a block diagram illustrating one example of allocating compute nodes in a fat tree 2:1 blocking InfiniBand network 300. The fat tree 2:1 blocking InfiniBand network 300 includes a spine switch 302, a leaf switch 304 acting as a source leaf switch, and a leaf switch 306 acting as a destination leaf switch. Leaf switch 304 is communicatively coupled to spine switch 302 through eight uplinks as indicated at 308. Leaf switch 306 is communicatively coupled to spine switch 302 through eight uplinks as indicated at 310. While one spine switch and two leaf switches are illustrated in Figure 3, the allocation of compute nodes described below is applicable to any suitable fat tree 2:1 blocking

InfiniBand network including any suitable number of spine switches and leaf switches.

[0026] In the example illustrated in Figure 3, 16 compute nodes (i.e., n1 -n16) may be coupled to leaf switch 304 and 16 compute nodes (i.e., n17-n32) may be coupled to leaf switch 306. For each job submitted, the job scheduler will select compute nodes coupled to each leaf switch as previously described with reference to Figure 1 . For example, for a job submitted for 16 compute nodes, the job scheduler will select eight compute nodes coupled to leaf switch 304 and eight compute nodes coupled to leaf switch 306. The job scheduler may select the eight compute nodes coupled to leaf switch 304 acting as the initial source leaf switch randomly or by any other suitable process. For example, the job scheduler may select compute nodes n1 -n8, n9-16, the even compute nodes, the odd compute nodes, or any other combination of eight compute nodes.

[0027] Once the job scheduler has selected the compute nodes from leaf switch 304 for the job, the job scheduler uses the routing tables to determine which compute nodes of leaf switch 306 are reachable using unshared routes from leaf switch 304 (i.e., unshared uplinks 308 and 310). In this example, based on the routing tables and as illustrated in Figure 3, the first uplink is used to communicate with compute nodes n17 and n18, the second uplink is used to communicate with compute nodes n19 and n20, the third uplink is used to communicate with compute nodes n21 and n22, the fourth uplink is used to communicate with compute nodes n23 and n24, the fifth uplink is used to communicate with compute nodes n25 and n26, the sixth uplink is used to communicate with compute nodes n27 and n29, the seventh uplink is used to communicate with compute nodes n28 and n30, and the eighth uplink is used to communicate with compute nodes n31 and n32.

[0028] Accordingly, based on this information, the job scheduler groups compute nodes n17, n20, n22, n24, n25, n27, n28, and n32 into a first group as indicated at 312, and groups compute nodes n18, n19, n21 , n23, n26, n29, n30, and n31 into a second group as indicated at 314. Each compute node in the first group 312 is reachable from leaf switch 304 using an unshared route and each compute node in the second group 314 is reachable from leaf switch 304 using an unshared route, while the first group 312 has compute nodes that share routes with compute nodes of the second group 314.

[0029] Once the job scheduler has grouped the compute nodes of leaf switch 306 acting as the destination leaf switch into groups of compute nodes having unshared routes to leaf switch 304 acting as the source leaf switch, the job scheduler selects compute nodes for the job by first selecting compute nodes within a group (i.e., group 312 or 314) prior to selecting any compute nodes from another group. For example, where eight compute nodes coupled to leaf switch 306 are to be selected, the job scheduler will select all of the compute nodes of group 312 or all of the compute nodes of group 314. In another example, where nine compute nodes coupled to leaf switch 306 are to be selected, the job scheduler will select all of the compute nodes of one group and one compute node of the other group (e.g., all compute nodes of group 312 and compute node n18).

[0030] In examples where compute nodes for a job are to be selected from three or more leaf switches, leaf switch 306 becomes the source leaf switch and a third leaf switch acts as the destination leaf switch and the process is repeated for selecting compute nodes for the third leaf switch. The job scheduler selects compute nodes for the third leaf switch that have unshared routes to leaf switch 306. The process repeats to select compute nodes coupled to each leaf switch to be used for the job. In this way, the bandwidth through the fat tree blocking InfiniBand network is maximized for the job.

[0031] Figure 4 is a block diagram illustrating one example of allocating compute nodes in a fat tree 4:1 blocking InfiniBand network 400. The fat tree 4:1 blocking InfiniBand network 400 includes a spine switch 402, a leaf switch 404 acting as a source leaf switch, and a leaf switch 406 acting as a destination leaf switch. Leaf switch 404 is communicatively coupled to spine switch 402 through four uplinks as indicated at 408. Leaf switch 406 is communicatively coupled to spine switch 402 through four uplinks as indicated at 410. While one spine switch and two leaf switches are illustrated in Figure 4, the allocation of compute nodes described below is applicable to any suitable fat tree 4:1 blocking

[0032] In the example illustrated in Figure 4, 16 compute nodes (i.e., n1 -n16) may be coupled to leaf switch 404 and 16 compute nodes (i.e., n17-n32) may be coupled to leaf switch 406. For each job submitted, the job scheduler will select compute nodes coupled to each leaf switch as previously described with reference to Figure 1 . For example, for a job submitted for eight compute nodes, the job scheduler will select four compute nodes coupled to leaf switch 404 and four compute nodes coupled to leaf switch 406. The job scheduler may select the four compute nodes coupled to leaf switch 404 acting as the initial source leaf switch randomly or by any other suitable process. For example, the job scheduler may select compute nodes n1 -n4, n5-n8, n9-n12, n13-n16, four even compute nodes, four odd compute nodes, or any other combination of four compute nodes.

[0033] Once the job scheduler has selected the compute nodes from leaf switch 404 for the job, the job scheduler uses the routing tables to determine which compute nodes of leaf switch 406 are reachable using unshared routes from leaf switch 404 (i.e., unshared uplinks 408 and 410). In this example, based on the routing tables and as illustrated in Figure 4, the first uplink is used to communicate with compute nodes n17, n18, n20, and n26, the second uplink is used to communicate with compute nodes n21 , n19, n25, and n27, the third uplink is used to communicate with compute nodes n28, n30, n31 , and n22, and the fourth uplink is used to communicate with compute nodes n23, n24, n32, and n29.

[0034] Accordingly, based on this information, the job scheduler groups compute nodes n17, n21 , n28, and n23 into a first group as indicated at 412, compute nodes n18, n19, n30, and n24 into a second group as indicated at 414, compute nodes n20, n25, n31 , and n32 into a third group as indicated at 416, and compute nodes n26, n27, n22, and n29 into a fourth group as indicated at 418. Each compute node in the first group 412 is reachable from leaf switch 404 using an unshared route, each compute node in the second group 414 is reachable from leaf switch 404 using an unshared route, each compute node in the third group 416 is reachable from leaf switch 404 using an unshared route, and each compute node in the fourth group 418 is reachable from leaf switch 404 using an unshared route. The compute nodes of each of the first group 412, the second group 414, the third group 416, and the fourth group 418 share routes with compute nodes of the other groups.

[0035] Once the job scheduler has grouped the compute nodes of leaf switch 406 acting as the destination leaf switch into groups of compute nodes having unshared routes to leaf switch 404 acting as the source leaf switch, the job scheduler selects compute nodes for the job by first selecting compute nodes within a group (i.e., group 412, 414, 416, or 418) prior to selecting any compute nodes from another group. For example, where four compute nodes coupled to leaf switch 406 are to be selected, the job scheduler will select all the compute nodes of any one of groups 412, 414, 416, and 418. In another example, where five or more compute nodes coupled to leaf switch 406 are to be selected, the job scheduler will select all the compute nodes of a first group and then select compute nodes from a second group. If more compute nodes are to be selected after all the compute nodes of the first and second groups have been selected, the job scheduler will select compute nodes from a third group and so on.

[0036] In examples where compute nodes for a job are to be selected from three or more leaf switches, leaf switch 406 becomes the source leaf switch and a third leaf switch acts as the destination leaf switch and the process is repeated for selecting compute nodes for the third leaf switch. The job scheduler selects compute nodes for the third leaf switch that have unshared routes to leaf switch 406. The process repeats to select compute nodes coupled to each leaf switch to be used for the job. In this way, the bandwidth through the fat tree blocking InfiniBand network is maximized for the job.

[0037] Figures 5 are 6 are flow diagrams illustrating one example of a method 500 for allocating compute nodes in a fat tree blocking InfiniBand network. At 502, compute nodes connected to each leaf switch in a fat tree blocking

InfiniBand network are discovered. At 504, routing tables indicating the routes between the compute nodes through each leaf switch and spine switch in the fat tree blocking InfiniBand network are retrieved. In one example, the routing tables are static routing tables, which are based on a local identifier of each leaf switch, spine switch, and compute node. At 506, compute nodes for a job are selected from a first leaf switch acting as source.

[0038] At 508, the compute nodes connected to a second leaf switch acting as destination are grouped into a plurality of groups, each group including compute nodes having unshared routes between the first leaf switch and the second leaf switch. In one example, the compute nodes are grouped such that each group includes a number of compute nodes equal to a number of uplinks between the second leaf switch and a spine switch. At 510, compute nodes for the job are selected from the second leaf switch by selecting all the compute nodes of a group prior to selecting any compute nodes of another group. Selecting all the compute nodes of one group prior to selecting any compute nodes of another group maximizes the bandwidth through the fat tree blocking InfiniBand network for the job.

[0039] As illustrated in Figure 6, method 500 may further include at 512, grouping the compute nodes connected to a third leaf switch acting as destination into a plurality of groups, each group including compute nodes having unshared routes between the second leaf switch acting as source and the third leaf switch. At 514, compute nodes for the job are selected from the third leaf switch by selecting all the compute nodes of a group for the third leaf switch prior to selecting any compute nodes of another group for the third leaf switch. Steps 512 and 514 may be repeated to select compute nodes for the job from other leaf switches until all compute nodes for the job are selected. In each case, the leaf switch acting as destination becomes the leaf switch acting as source and the next leaf switch for which compute nodes are to be selected becomes the leaf switch acting as destination and so on.

[0040] Although specific examples have been illustrated and described herein, a variety of alternate and/or equivalent implementations may be substituted for the specific examples shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific examples discussed herein. Therefore, it is intended that this disclosure be limited only by the claims and the equivalents thereof.

Claims

1 . A system comprising:

at least one spine switch communicatively coupled to a plurality of leaf switches via uplinks in a fat tree blocking InfiniBand network;

a plurality of compute nodes communicatively coupled to each leaf switch;

a subnet manager to configure routing tables for the at least one spine switch and each leaf switch to route communications between the plurality of compute nodes via the at least one spine switch and the plurality of leaf switches; and

a job scheduler to select compute nodes for each job by selecting compute nodes having unshared routes between them prior to selecting compute nodes having shared routes between them.

2. The system of claim 1 , wherein the fat tree blocking InfiniBand network comprises an N:1 fat tree blocking InfiniBand network, where "N" is an integer greater than 1 .

3. The system of claim 1 , wherein the job scheduler is incorporated into a node of a cluster in the fat tree blocking InfiniBand network.

4. The system of claim 1 , wherein the job scheduler selects less than all of the compute nodes communicatively coupled to two leaf switches for a job.

5. The system of claim 1 , wherein the subnet manager configures static routing tables, the static routing tables based on a local identifier for the at least one spine switch, each leaf switch, and each compute node.

6. A machine-readable storage medium encoded with instructions, the instructions executable by a processor of a system to cause the system to: discover compute nodes coupled to each leaf switch in a fat tree blocking InfiniBand network;

retrieve routing tables indicating the routes between the compute nodes; and

select compute nodes for a job by selecting compute nodes having unshared routes between them prior to selecting compute nodes having shared routes between them.

7. The machine-readable storage medium of claim 6, wherein the

instructions are executable by the processor to further cause the system to: group the compute nodes connected to each leaf switch into a plurality of groups, each group for each leaf switch including compute nodes having unshared routes to compute nodes of a group for another leaf switch while compute nodes of one group for a leaf switch share routes with compute nodes of another group for the leaf switch, and

wherein the compute nodes are selected for the job by selecting all the compute nodes of one group for a leaf switch prior to selecting any compute nodes of another group for the leaf switch.

8. The machine-readable storage medium of claim 7, wherein the number of compute nodes of each group is equal to a number of uplinks of each leaf switch.

9. The machine-readable storage medium of claim 6, wherein the compute nodes are selected to maximize the bandwidth for the job through the fat tree blocking InfiniBand network.

10. The machine-readable storage medium of claim 6, wherein the routing tables are retrieved indicating static routes between the compute nodes through each leaf switch and spine switch in the fat tree blocking InfiniBand network.

1 1 A method comprising: discovering compute nodes connected to each leaf switch in a fat tree blocking InfiniBand network;

retrieving routing tables indicating the routes between the compute nodes through each leaf switch and spine switch in the fat tree blocking InfiniBand network;

selecting compute nodes for a job from a first leaf switch acting as source;

grouping the compute nodes connected to a second leaf switch acting as destination into a plurality of groups, each group including compute nodes having unshared routes between the first leaf switch and the second leaf switch; and

selecting compute nodes for the job from the second leaf switch by selecting all the compute nodes of a group prior to selecting any compute nodes of another group.

12. The method of claim 1 1 , further comprising:

grouping the compute nodes connected to a third leaf switch acting as destination into a plurality of groups, each group including compute nodes having unshared routes between the second leaf switch acting as source and the third leaf switch; and

selecting compute nodes for the job from the third leaf switch by selecting all the compute nodes of a group for the third leaf switch prior to selecting any compute nodes of another group for the third leaf switch.

13. The method of claim 1 1 , wherein grouping the compute nodes comprises grouping the compute nodes such that each group includes a number of compute nodes equal to a number of uplinks between the second leaf switch and a spine switch.

14. The method of claim 1 1 , wherein retrieving the routing tables comprises retrieving static routing tables, the static routing tables based on a local identifier of each leaf switch, spine switch, and compute node.

15. The method of claim 1 1 , wherein selecting compute nodes for the job comprises selecting all the compute nodes of one group prior to selecting any compute nodes of another group to maximize the bandwidth through the fat tree blocking InfiniBand network for the job.