WO2008057828A2 - Computer system and method using efficient module and backplane tiling - Google Patents

Computer system and method using efficient module and backplane tiling Download PDF

Info

Publication number
WO2008057828A2
WO2008057828A2 PCT/US2007/082851 US2007082851W WO2008057828A2 WO 2008057828 A2 WO2008057828 A2 WO 2008057828A2 US 2007082851 W US2007082851 W US 2007082851W WO 2008057828 A2 WO2008057828 A2 WO 2008057828A2
Authority
WO
WIPO (PCT)
Prior art keywords
node
module
inter
connections
nodes
Prior art date
Application number
PCT/US2007/082851
Other languages
French (fr)
Other versions
WO2008057828A3 (en
Inventor
Judson S. Leonard
Matthew H. Reilly
Lawrence C. Stewart
Washington Taylor
Original Assignee
Sicortex, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/594,423 external-priority patent/US7751344B2/en
Priority claimed from US11/594,416 external-priority patent/US7660270B2/en
Application filed by Sicortex, Inc. filed Critical Sicortex, Inc.
Publication of WO2008057828A2 publication Critical patent/WO2008057828A2/en
Publication of WO2008057828A3 publication Critical patent/WO2008057828A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17381Two dimensional, e.g. mesh, torus

Abstract

Computer systems and methods using efficient module and backplane tiling to interconnect computer nodes via a Kautz-like digraph. A multinode computing system includes a large plurality of computing nodes interconnected via a Kautz topology having order O, diameter n, and degree k. The order equals (k +1)n-1. The interconnections from a node x to a node y in the topology satisfy the relationship y = (-x*k-j) mod O, where 1 ≤ j ≤ k, and the computing nodes are arranged onto a plurality of modules. Each module has an equal plurality of computing nodes on it. A majority of the inter-node connections are contained on the plurality of modules and a minority of the inter-node connections are inter- module connections. Inter-module connections are routed among modules in parallel on an inter-module connection plane.

Description

Computer System and Method Using Efficient Module and Backplane Tiling to Interconnect Computer Nodes via a Kautz-like Digraph
Cross-reference to Related Applications
[0001] This application is related to the following U.S. patent applications, the contents of which are incorporated herein in their entirety by reference:
U.S. Pat. Appl. No. 11/335,421, filed January 19, 2006, entitled SYSTEMAND
METHOD OFMULTI-CORE CACHE COHERENCY; U.S. Pat. Appl. No. 11/594,426, filed on November 8, 2006, entitled SYSTEMAND
METHOD FOR PREVENTING DEADLOCK IN RICHLY-CONNECTED
MULTI-PROCESSOR COMPUTER SYSTEM USING DYNAMIC
ASSIGNMENTOF VIRTUAL CHANNELS; U.S. Pat. Appl. No. 11/594,421, filed on November 8, 2006, entitled LARGE SCALE
MULTI-PROCESSOR SYSTEM WITHA LINK-LEVEL INTERCONNECT
PROVIDING IN-ORDER PACKET DELIVERY; U.S. Pat. Appl. No. 11/594,442, filed on November 8, 2006, entitled
MESOCHRONOUS CLOCK SYSTEMAND METHOD TO MINIMIZE
LATENCYAND BUFFER REQUIREMENTS FOR DATA TRANSFER INA
LARGEMULTI-PROCESSOR COMPUTING SYSTEM; U.S. Pat. Appl. No. 11/594,427, filed on November 8, 2006, entitled REMOTEDMA
SYSTEMSAND METHODS FOR SUPPORTING SYNCHRONIZATION OF
DISTRIBUTED PROCESSES IN A MULTIPROCESSOR SYSTEM USING
COLLECTIVE OPERATIONS; U.S. Pat. Appl. No. 11/594,420, filed on November 8, 2006, entitled SYSTEMAND
METHOD FOR ARBITRATION FOR VIRTUAL CHANNELS TO PREVENT
LIVELOCKINA RICHLY-CONNECTED MULTI-PROCESSOR
COMPUTER SYSTEM; U.S. Pat. Appl. No. 11/594,441, filed on November 8, 2006, entitled LARGE SCALE
COMPUTING SYSTEM WITH MULTI-LANE MESOCHRONOUS DA TA
TRANSFERSAMONG COMPUTER NODES; U.S. Pat. Appl. No. 11/594,405, filed on November 8, 2006, entitled SYSTEMAND
METHOD FOR COMMUNICATING ONA RICHLY CONNECTED MULTI- PROCESSOR COMPUTER SYSTEM USING A POOL OFBUFFERS FOR
DYNAMICASSOCIATION WITHA VIRTUAL CHANNEL; U.S. Pat. Appl. No. 11/594,443, filed on November 8, 2006, entitled RDMA
SYSTEMSAND METHODS FOR SENDING COMMANDS FROMA
SOURCENODE TO A TARGETNODE FOR LOCAL EXECUTION OF
COMMANDSATTHE TARGETNODE; U.S. Pat. Appl. No. 11/594,447, filed on November 8, 2006, entitled SYSTEMSAND
METHODSFOR REMOTE DIRECTMEMORYACCESS TO PROCESSOR
CACHESFOR RDMA READSAND WRITES; and U.S. Pat. Appl. No. 11/594,446, filed on November 8, 2006, entitled SYSTEMAND
METHOD FOR REMOTE DIRECT MEMORY ACCESS WITHOUT PAGE
LOCKING BY THE OPERATING SYSTEM.
Background of the Invention
1. Field of the Invention
[0002] The present invention relates to massively parallel computing systems and, more specifically, to computing systems in which computing nodes are interconnected via a Kautz- like topology and with an efficient tiling.
2. Discussion of Related Art
[0003] Massively parallel computing systems have been proposed for scientific computing and other compute-intensive applications. The computing system typically includes many nodes, and each node may contain several processors. Various forms of interconnect topologies have been proposed to connect the nodes, including Hypercube topologies, butterfly and omega networks, tori of various dimensions, fat trees, and random networks.
[0004] One problem that has been observed with certain architectures is the issue of scalability. That is, due to inherent limitations, certain architectures are not easily scalable in any practical way. For example, one cannot simply add processing power by including another module of computing nodes into the system ,or more commonly, the expense and/or performance of the network becomes unacceptable as it grows larger. Moreover, different sized systems might need totally different module designs. For example, hypercube topologies had nodes in which the number of ports or links was dependent on the overall size of the system. Thus a node made for one size system could not be used, as a general matter, on a system with a different size.
[0005] Another problem that has been observed is that of routing the connections among nodes. Large systems typically cannot be fully connected because of inherent difficulty in routing. Thus switching architectures have been proposed, but these introduce latency from the various "hops" among nodes that may be necessary for two arbitrary nodes to communicate with one another. Reducing this latency is desirable but has proven difficult.
Summary
[0006] The invention provides computer systems and methods using efficient module and backplane tiling to interconnect computer nodes via a Kautz-like digraph. The invention also provides computer system and method using a Kautz-like digraph to interconnect computer nodes and having control back channel between nodes.
[0007] Under one aspect of the invention, a multinode computing system includes a large plurality of computing nodes interconnected via a Kautz topology having order O, diameter n, and degree k. The order equals (k + I)F"1. The interconnections from a node x to a node y in the topology satisfy the relationship y = (-x*k-j) mod O, where 1 < j < k, and the computing nodes are arranged onto a plurality of modules. Each module has an equal plurality of computing nodes on it.
[0008] Under another aspect of the invention, a majority of the inter-node connections are contained on the plurality of modules and a minority of the inter-node connections are intermodule connections.
[0009] Under another aspect of the invention, the amount of inter-node connections contained on the plurality of modules is a substantially optimal amount. [0010] Under another aspect of the invention, a subset of the inter-node connections are inter-module connections and the subset are routed among modules in parallel on an intermodule connection plane.
[0011] Under another aspect of the invention, a multinode computing system includes a large plurality of computing nodes interconnected via a Kautz topology having order O, diameter n, and degree k. The order equals (k + l)kn-l ; The data interconnections from a node x to a node y in the topology satisfy the relationship y = (-x*k-j) mod O, where i < j < k; and each x,y pair includes a unidirectional control link from node y to node x to convey flow control and error information from a receiving node y to a transmitting node x.
[0012] Under another aspect of the invention, a receiving node y transmits control packets on a control link to transmitting node x to identify the last correctly received data packet, and to identify whether an error has been detected in transmission
[0013] Under another aspect of the invention, a transmitting node x stores transmitted packets and keeps them available for replay in response to control messages on the control link.
[0014] Under another aspect of the invention, a receiving node y transmits buffer status information to a transmitting node x to identify buffer availability of downstream computing nodes.
Brief Description of the Drawings
[0015] In the Drawing,
Figures IA-C depict Kautz topologies of various order, degree and diameter; Figure ID depicts a module tiling of an embodiment of the invention to illustrate module interconnectivity; Figure 2 depicts a module containing a plurality of nodes according to certain embodiments of the invention;
Figure 3 depicts a module tiling with inferior inter-module connectivity; Figure 4 illustrates parallel routing of inter-module signals according to certain embodiments of the invention; and Figure 5 depicts data and control links for an inter-node link or connection according to certain embodiments of the invention.
Detailed Discussion
[0016] Preferred embodiments of the invention provide massively parallel computer systems in which processor nodes are interconnected in a Kautz-like topology. Preferred embodiments provide a computing system having O nodes (i.e., order O) equally divided on M modules, each module having N nodes, N = O/M. By appropriately selecting the size N of the module and appropriately selecting the specific set of nodes to be included on a module, the inter-node routing problem may be significantly reduced. Specifically, the inter- node routing may be arranged so that a high percentage of the inter-node connections or links may remain on a module (i.e., intra-module) and avoid inter-module connections, thus reducing the amount of inter-node connections that must involve a backplane, cables, or the like. Moreover, the inter-node connections that must be inter-module (and thus require a backplane or cables, or the like) may be arranged in a parallel fashion. These features facilitate the creation of larger systems and yield inter-node connections with shorter paths and latencies. That is, preferred embodiments provide efficient and effective logical routing (i.e., the number of hops between nodes) and also provide efficient and effective physical routing (i.e., allowing high-speed interconnect to be used on large systems). [0017] Certain embodiments of the invention use a Kautz topology for data links and data flow to interconnect the node, but they are not purely directed graphs because they include a control link back channel link from receiver to sender. This link is used for flow control and status, among other things.
[0018] Kautz interconnection topologies are unidirectional, directed graphs (digraphs). Kautz digraphs are characterized by a degree k and a diameter n. The degree of the digraph is the maximum number of arcs (or links or edges) input to or output from any node. The diameter is the maximum number of arcs that must be traversed from any node to any other node in the topology.
[0019] The order O of a graph is the number of nodes it contains. The order of a Kautz digraph is (Zc + I)F"1. The diameter of a Kautz digraph increases logarithmically with the order of the graph.
[0020] Figure IA depicts a very simple Kautz topology for descriptive convenience. The system 100 has degree three; that is, each node has three ingress links 110 and three egress links 112. The topology has diameter one, meaning that any node can communicate with any other node in a maximum of one hop. The topology is order 4, meaning that there are 4 nodes.
[0021] Figure IB shows a system that is order 12 and diameter two. By inspection, one can verify that any node can communicate with any other node in a maximum of two hops. Figure 1C shows a system that is degree three and diameter three, having order 36. One quickly sees that the complexity of the system grows quickly. It would be counter-productive to depict and describe preferred systems such as those having hundreds of nodes or more. [0022] The table below shows how the order O of a system changes as the diameter n grows for a system of fixed degree k.
Figure imgf000007_0001
[0023] With nodes numbered from zero to 0-1 , the digraph can be constructed by running a link from any node x to any other node y that satisfies the following equation:
y = (-x*A>j) mod O, where 1 < j < k (D
Thus, any (x,y) pair satisfying (1) specifies a direct egress link from node x. For example, with reference to figure 1C node 1 has egress links to the set of nodes 30, 31 and 32. Iterating through this procedure for all nodes in the system will yield the interconnections, links, arcs or edges needed to satisfy the Kautz topology. (As stated above, communication between two arbitrarily selected nodes may require multiple hops through the topology but the number of hops is bounded by the diameter of the topology.)
[0024] Under certain embodiments of the invention, the system is arranged into multiple modules. The modules are created to have a particular size (i.e., number of nodes on the module) and a particular set of nodes on the module. It has been observed by the inventors that careful selection of the module size and careful attention to the selection of the set of nodes to include on a given module can significantly reduce wiring problems in systems built with the Kautz topology.
[0025] More specifically, under preferred embodiments of the invention, the Kautz topology is uniformly tiled. To do this, the Kautz graph is one-to-one mapped to satisfy the following equation.
t: V0 → I x Vτ (2) In the above, VQ is the set of vertices of a Kautz graph; VT is the set of vertices of a tile (i.e., a smaller graph, implemented as a module of nodes); and I is an index set. Moreover, if (x,y) is an edge within tile T then (f '(J5X), t"'(i,y) is an edge of Kautz graph G. [0026] The tiles or modules are arranged to maximize the number of edges of the tile T. That is, the tiles or modules are arranged so that a maximum number of edges, arc, or links in the Kautz topology are contained on the tiles. All the remaining edges by necessity are inter- tile (or inter-module). By doing this, node interconnections will be maximized to remain intra-module.
[0027] Conventionally a Kautz graph of degree k and diameter n can label the vertices of the topology as follows, with each integer s being base k+1. Adjacent integers must differ.
Figure imgf000008_0001
[0028] A de Bruijn graph is closely related to a Kautz graph. A de Bruijn graph has vertices that may be labeled by strings of n integers base k, as follows:
c\C2 ' " Cn £ Lk
(4)
[0029] The vertices of a degree k, diameter n Kautz graph can be mapped to the vertices of a degree k, diameter n-\ de Bruijn graph as follows:
r : si " - sn -→ ci • • - Cn-I, Ci = (si+i - Si) (mod k + 1) — 1
(5)
[0030] Consequently, the edges, links or arcs in a Kautz graph may be expressed as follows:
(S0CiC2 • • • Cn- 1 , [s0 + C1 + I]C2C3 • • - Cn)
(P)
where [so + ci + 1] is taken modulo k+l. [0031] To make the tiling scalable to arbitrary diameter graphs, the tile M must be equivalent to a subgraph of a de Bruijn graph of diameter m and degree k containing all the nodes of the de Bruijn graph but only a subset of the edges subject to the condition that the edges on the tile cannot form any directed loops. In order to minimize inter-module wiring, the subgraph with the maximal number of intra-module edges (without directed loops) should be chosen subject to the condition that the tile can be extended to form a complete tiling of the system.
[0032] To generate a complete tiling, it is possible to use a map FI: G — > M from the nodes of the complete graph G to the nodes of the tile M which respects the edge structure of the de Bruijn graph of diameter m on which the tile is based. This map may in particular be chosen to satisfy the following conditions:
π ( P (u ) ) - P (H (U) ), 'iu € G U(C[Ii)) - C(M(U) ), ΨU £ G
where C(u) denotes the set of nodes which are reached from edges beginning at node u and P(u) denotes the set of nodes from which node u can be reached by following a single edge. [0033] Under certain embodiments of the invention, each module has K" nodes, and each node on the module can be assigned a label d| ...dm e Zk"1 such that inter-node connections that are intra-module correspond to a subset of the edges (di ...dm> d2...dm+i ) of a de Bruijn graph of diameter m and degree k, subject to the condition that there are no directed closed loops formed from the inter-node connections on a module.
[0034] Under certain embodiments of the invention, maps II satisfying the conditions stated above for P(u) and C(u) may be defined by expressing dj's as a "discrete differential" function of node labels the so...Sn of the Kautz graph through di = f (ci+n-m , Ci ) (7) wherein f(x,y) is a function which for fixed X acts a permutation on Zk through y → f (X,y) and which for fixed Y acts as a permutation on Zk through x → f (x,Y) and where Cj's encode the Kautz coordinates Sj through
Ci = Si - Si-1 -I mOd Ck +!) (8) [0035] Under certain embodiments, f(x,y) equals x + y mod k, or f(x,y) equals x - y mod k.
[0036] Given a map Ti with the conditions defined above, the tiling may then be defined as follows. Choose a vertex x0 = dj ...d^.n of the tile (or module) T. Associated with this vertex of T is a set of vertices in the larger Kautz graph each of which has the same value of π(u) = X0. Define the index set by the remaining indices on this set of vertices (i.e., SoCi ...Cn). This defines f'ø, X0) for all i. If there are any edges in T containing X0 the definition is extended. For example, consider if T contains the edge (x0 , xi). For each / in I, there is a unique vertex in the Kautz graph which is reached by an edge from t"'(i, X0) and which has dj ...dN-n = Xi. Define this vertex to be t"'(i, X|). Continue in the same way for further edges containing either X0 or xi. Each time a new edge is included the map f ' is defined for the new value of x. hi this fashion the complete tiling may be completed.
[0037] Tiling constructed in the fashion of the previous discussion automatically have the parallel routing property. The benefits of parallel routing are described below. [0038] Figure 1 D for example shows a module or tile for a very simple Kautz topology of order 36 and degree three. Each module has nine nodes, as depicted. [0039] The table shows how the nodes and modules connect. Notice how the linear labels are distributed among modules. For example, linearly labeled nodes 0-9 are not all assigned to module 0. As mentioned above the interconnection among nodes is defined by equation 1, and the assignment among modules is a result of the tiling method employed. This example of figure ID is particularly simple in comparison to the larger systems of preferred embodiments. The size of preferred embodiments is prohibitively large to depict by figures or tables and instead is explained by the mathematics above. This example is utilized to illustrate the complexity of module assignment and the interconnections among nodes.
Linea label
Figure imgf000011_0001
[0040] Under preferred embodiments, module size is an integral power of the degree (k). Certain embodiments maximize this size as described above, i.e., largest subgraph without directed loops, but others may be smaller for practical considerations in building modules. These are substantially optimal in terms of maximizing edges to be intra-module. [0041] Certain embodiments use a module size of 27 nodes where each node is of degree 3. Each module has a particular sets of nodes thereon (as described above) and may be used to build Kautz topologies of 108, 324, 972 or more nodes, or de Bruijn topologies with multiples of 27 nodes. [0042] Figure 2 depicts a module arrangement having 27 nodes, numbered 0 through 26 in the upper right corner of nodes. These node numbers are, in certain embodiments, the numbering schema of equations 7 and 8. That is, the node numbers shown are adjacent in the number space provided by the discrete differential numbering scheme outlined above, though they need not be adjacent in the numbering of nodes of the Kautz topology as expressed in equation 1. The node identifier is expressed in the upper right comer of the node in decimal form, and in the middle of the node it is expressed in ternary form. [0043] As illustrated, each node identifies the egress links 202 and ingress links 204. Focusing on egress links for the time being (with the explanation extending to ingress links too), node 7 has egress links going to nodes 21, 22, and 23 (upper right notation, i.e., node identifier) on other modules in the system. The figure depicts just the numbering scheme and not the node identification within the Kautz topology. As mentioned above, the actual interconnectivity is defined by equation 1. Thus, some connections depicted on figure 2 identify node numbers (via its number identifier), which are the same, even though in the larger system the node numbers will go to different nodes. For example, the figure shows nodes 17, 26 and 8, each with output links to another node (off module) identified by number 26. However, the node 26 driven by nodes 17 and 26 (upper right of figure 2) is on a different module than the node 26 driven by node 8. The actual nodes involved are governed by the above equations.
[0044] Figure 4 depicts a simplified diagram, drawn in perspective, to illustrate the parallel routing that results from the tiling approach discussed above. A first module 402 has an output pin 404 in communication with backplane trace 408 on backplane 406. (A backplane layer is illustrated, but other structures such as midplanes or the like may be used.) The trace 408 is parallel and horizontal to pin 410 on module 412. That is, the backplane trace has no vertical runs. Under preferred embodiments of the invention, every backplane run will be parallel in a similar manner. Though many layers may be needed for the backplane when there are a significant number of modules, the backplane traces will not need vertical runs to connect the relevant pins and links, and instead runs will be horizontal and parallel. (Alternatively if things were rotated the runs could all be vertical and parallel.) This routing greatly facilitates the ability to keep high signal integrity, which in turn greatly improves the ability to run the inter-node and inter-module connections at very high speed. It also enables larger systems to be built while maintaining satisfactory signal integrity (i.e., designs don't need to decrease bus speed to enable large systems). Using the example of figure 2, the trace 408 may correspond to the connection from the node with discrete differential number 5 (lower part of figure) to another node on a different module (412) with discrete differential identifier 17. Notice in the upper right of figure 2 that every node 17 receives an input from another node 5 (discrete differential number). In certain embodiments, such as a 972 node system with modules like that shown in figure 2, each module will have 39 pins (e.g., 404 and 410), and every backplane trace will run horizontal and parallel to other traces. Only one backplane layer 406 is shown in figure 4 for clarity, but a system of 972 nodes may require about 20 such layers. Such a backplane, however, will be faster and have better signal integrity than one that did not have parallel routes and which needed vertical runs, vias and the likes to provide connectivity among modules. [0045] Referring back to figure 2, Each node on the system may communicate with any other node on the system by appropriately routing messages onto the communication fabric via an egress link 202. Some of these egress links will be inter-module, such as the ones depicted in connection with node 7. Others will be intra-module, such as those being depicted in connection with node 2 which go to nodes 6, 7, and 8 on the same module. Some nodes have some links intra-module and some inter-module, see for example node 12. [0046] Under certain embodiments, any data message on the fabric includes routing information in the header of the message (among other information). The routing information specifies the entire route of the message. In certain degree three embodiments, the routing information is a bit string of 2-bit routing codes, each routing code specifying whether a message should be received locally (i.e., this is the target node of the message) or identifying one of three egress links. Naturally other topologies may be implemented with different routing codes and with different structures and methods under the principles of the invention. Under certain embodiments, each node has tables programmed with the routing information. For a given node x to communicate with another node z, node x accesses the table and receives a bit string for the routing information. As will be explained below, this bit string is used to control various switches along the message's route to node z, in effect specifying which link to utilize at each node during the route. Another node j may have a different bit string when it needs to communicate with node z, because it will employ a different route to node z and the message may utilize different links at the various nodes in its route to node z. Thus, under certain embodiments, the routing information is not literally an "address" (i.e., it doesn't uniquely identify node z) but instead is a set of codes to control switches for the message's route. [0047] Under certain embodiments, the routes are determined a priori based on the interconnectivity of the Kautz topology as expressed in equation 1. That is, the Kautz topology is defined, and the various egress links for each node are assigned a code (i.e., each link being one of three egress links). Thus, the exact routes for a message from node x to node z are known in advance, and the egress link selections may be determined in advance as well. These link selections are programmed as the routing information. These tables may be reprogrammed as needed, for example, to route around faulty links or nodes. [0048] Certain embodiments modify the routing information in the message header en route for easier processing. For example, a node will analyze a 2 bit field of the routing information to determine which link the message should use, e.g., one of three egress links or it should be kept local (i.e., this is the destination node). This could be the least significant numeral, digits or bits of the routing field, but it need not be limited to such (i.e., it depends on the embodiment). Once a node determines that a message should be forwarded on one of the egress links, the node shifts the routing bit string accordingly (e.g., by 2 bits) so the next node in the route can perform an exactly similar set of operation: i.e., process the lowest two bits of the route code to determine if the message should be handled locally or forwarded on a specific one of three egress links).
[0049] The routing information, in these embodiments, is used to identify portions in a cross point buffer to hold the data so that the message may be stored until it may be forwarded on the appropriate link. (Certain embodiments support cut-through routing to avoid the buffer if the appropriate link is not busy when the message arrive or becomes free during reception of the message.)
[0050] In certain embodiments, the messages also contain other information such as virtual channel identification information. As explained in more detail in the related and incorporated applications, virtual channel information is used so that each link may be associated with multiple virtual channels and so that deadlock avoidance techniques may be implemented.
[0051] Experimentation shows that with a preferred arrangement 48% of the inter-node links may be routed inter-module, and 52% can be routed intra-module. Other degrees, diameters, orders, and modules sizes maybe used using the principles of the invention. [0052] In contrast, other methods of selecting nodes may yield significantly less intra- module connections (and as a result more inter-module connections). Figure 3 for example shows an arrangement also involving 27 nodes per module. However, even though the arrangement seems well-organized (e.g., tree like) only about 30% of the inter-node connections remains on module, meaning more of the inter-node connections will require a backplane or the like, inhibiting the ability to build larger systems. [0053] Under certain embodiments the computing system is not configured as a Kautz digraph in pure form in that the communication is not purely unidirectional. Instead, certain preferred embodiments have data communication implemented on unidirectional directed links (or circuits) and use a back channel control link (or circuit) for flow control and maintenance purposes.
[0054] Figure 5 for example shows two nodes, sender 502 and receiver 504, following the unidirectional convention used above in discussing Kautz topologies. These nodes could correspond, for example, to two intra-module nodes such as nodes 18 and 2 in figure 2. The link 506 connecting the two nodes includes unidirectional data lanes 508 and unidirectional control lanes 510. The direction of the data lanes 508 is consistent with the convention used above in discussing the unidirectional flow of the Kautz digraph. The direction of the control link is in the opposite direction, i.e., from data receiving node 504 to data transmitting node 502. The arrangement is asymmetric in the sense that there are more forward data lane circuits than there are reverse control lane circuits. Ln certain embodiments there are eight data circuits and one control circuit between two connected nodes.
[0055] In certain embodiments each sender 502 assigns a link sequence number (LSN) to every outgoing packet. The LSN is included in the packet header. The sender 502 also keeps transmitted packets in a replay buffer until it has been confirmed (more below) that the packets have been successfully received.
[0056] Receiver nodes receive packets and keep track of the LSN of the most recently received error free packet as part of its buffer status. Periodically, the receiver node 504 transmits buffer status back to the sender using the control circuit 510. In certain embodiments, this status is transmitted as frequently as possible. The LSN corresponds to the most recently received packet if there has been no error. If there has been an error detected, the buffer status will indicate error and include the LSN of the last packet correctly received. [0057] In response the sending node 502 identifies the LSN in the buffer status packet and from this realizes that all packets up to and including the identified LSN have been received at the receiving node 504 in acceptable condition. The sender 502 may then delete packets from the replay buffer with LSNs up to and including the LSN received in the status packet. If an error has been detected, the sender will resend all packets in the replay buffer starting after the LSN of the buffer status (the receiving node will have dropped such in anticipation of the replay and to ensure that all packets from the same source, going to the same destination, along the same route, with the same virtual channel are delivered and kept in order). Thus, packet error detection and recovery is performed at the link level. Likewise packets are guaranteed to be delivered in order at the link-level. [0058] The control circuits are also used to convey buffer status information for downstream nodes to indicate whether buffer space associated with virtual channels are free or busy. As is explained in the incorporated patent applications, the nodes use a cross point buffer to store data from the links and to organize and control the data flow as virtual channel assignments over the links to avoid deadlock. More specifically, a debit/credit mechanism is used in which the receiving node 504 informs the sending node 502 of how much space is available in the buffers (not shown) of the receiving node 504 for each virtual channel and port. Under certain embodiments a sender 502 will not send information unless it knows that there is buffer space for the virtual channel in the next downstream node along the route. The control packet stream carries a current snapshot of the cross point buffer entry utilization for each of the crosspoint buffers it has (which depends on the degree of the system). [0059] The control link may also be used for out-of-band communication between connected nodes by using otherwise unused fields in the packet to communicate. This mechanism may be used for miscellaneous purposes.
[0060] In a Kautz network no single or (if degree three or higher) double failure can isolate any working node or subset of nodes from the rest of the network. No single link or node failure increases the network diameter by more than one hop. Certain embodiments of the invention use multiple paths in the topology to avoid congestion and faulty links or nodes. [0061] Many of the teachings here may be extended to other topologies including de Bruijn topologies. Likewise, though the description was in relation to large-scale computing system, the principles may apply to other digital systems.
[0062] Certain embodiments used discrete differential in the low order positions of the label identification. This is particularly helpful for parallel routing. [0063] The above discussion concerning Kautz tilings are applicable to de Bruijn topologies as well.
[0064] Certain embodiments of the invention allow what are above described as a tile to be combined on to module. For example, two tiles may be formed on a module, and a module under these arrangement will have pF1 nodes where p is an integer. [0065] Appendix A (attached) is a listing of a particular 972 node, 36 module, degree three system. The columns identify the Kautz number (0-971), the node identification (per module) and specify the other nodes to which each node connects. From this, one can determine node-to-node interconnectivity for each node in the system.
(0066] While the invention has been described in connection with certain preferred embodiments, it will be understood that it is not intended to limit the invention to those particular embodiments. On the contrary, it is intended to cover all alternatives, modifications and equivalents as may be included in the appended claims. Some specific figures and source code languages are mentioned, but it is to be understood that such figures and languages are, however, given as examples only and are not intended to limit the scope of this invention in any manner.
[0067] What is claimed is:

Claims

1. A multinode computing system, comprising: a large plurality of computing nodes interconnected via a Kautz topology having order O, diameter n, and degree k; wherein the order = (k + I)F"1 ; wherein the interconnections from a node x to a node y in the topology satisfies the relationship y = (-x*k-j) mod O, where 1 < j < k; and wherein the computing nodes are arranged onto a plurality of modules each module having an equal plurality of computing nodes thereon.
2. The system of claim 1 wherein a majority of the inter-node connections are contained on the plurality of modules and a minority of the inter-node connections are intermodule connections.
3. The system of claim 1 wherein the amount of inter-node connections contained on the plurality of modules is a substantially optimal amount.
4. The system of claim 1 wherein a subset of the inter-node connections are intermodule connections and the subset are routed among modules in parallel on an inter-module connection plane.
5. A multinode computing system, comprising: a large plurality of computing nodes interconnected via a Kautz topology having order O, diameter n, and degree k, wherein the order = (k + i)/Λ' ; wherein the interconnections from a node x to a node y in the topology satisfies the relationship y = (-x*k-j) mod O, where 1 < j < k; and wherein the computing nodes are arranged onto a plurality of modules; wherein each module has Id" nodes, and each node on the module can be assigned a label di ...dm e Zk" such that inter-node connections that are intra- module correspond to a subset of the edges (di ...dm> d2...dra+i ) of a de Bruijn graph cf diameter m and degree k, subject to the condition that there are no directed closed loops formed from the inter-node connections on a module.
6. The system of claim 5 wherein the number of intra-module connections is substantially optimal.
7. The system of claim 5 wherein the d,'s are expressed as a function of the node labels So...Sn of the Kautz graph through
Figure imgf000019_0001
) wherein f(x,y) is a function which for fixed X acts a permutation on Zi through y —> f (X,y) and which for fixed Y acts as a permutation on Zk through x → f (x,Y) and where c,'s encode the Kautz coordinates S, through
C1 = S1 - S1-I -I mod (k +l)
8. The system of claim 7 wherein f(x,y) equals x + y mod k.
9. The system of claim 7 wherein f(x,y) equals x - y mod k.
10. The system of claim 5 wherein each module is degree 3 and contains 27 computing nodes.
11. The system of claim 5 wherein a subset of the inter-node connections are intermodule connections and the subset are routed among modules in parallel on an inter-module connection plane.
12. A multinode computing system, comprising: a large plurality of computing nodes interconnected via a de Bruijn topology having order O, diameter n, and degree k; wherein the order O = k"; wherein the interconnections from a node x to a node y in the topology satisfies the relationship y = (x*k+j) mod O, where 0 < j < k-1 ; and wherein the computing nodes are arranged onto a plurality of modules each module having an equal plurality of computing nodes thereon.
13. A multinode computing system, comprising: a large plurality of computing nodes interconnected via a Kautz topology having order O, diameter n, and degree k; wherein the order = (Tc + I)A""1; wherein data interconnections from a node x to a node y in the topology satisfies the relationship y = (-x*k-j) mod O, where 1 < j < k; and wherein each x,y pair includes a unidirectional control link from node y to node x to convey flow control and error information from a receiving node y to a transmitting node x.
14. The system of claim 13 wherein a receiving node y transmits control packets on a control link to transmitting node x to identify the last correctly received data packet, and to identify whether an error has been detected in transmission.
15. The system of claim 14 wherein a transmitting node x stores transmitted packets and keeps them available for replay in response to control messages on the control link.
16. The system of claim 13 wherein a receiving node y transmits buffer status information to a transmitting node x to identify buffer availability of downstream computing nodes.
17. The system of claim 16 wherein a transmitting node x includes logic to transmit a packet downstream only if all necessary buffers are available.
PCT/US2007/082851 2006-11-08 2007-10-29 Computer system and method using efficient module and backplane tiling WO2008057828A2 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US11/594,423 2006-11-08
US11/594,423 US7751344B2 (en) 2006-11-08 2006-11-08 Computer system and method using a kautz-like digraph to interconnect computer nodes and having control back channel between nodes
US11/594,416 US7660270B2 (en) 2006-11-08 2006-11-08 Computer system and method using efficient module and backplane tiling to interconnect computer nodes via a Kautz-like digraph
US11/594,416 2006-11-08

Publications (2)

Publication Number Publication Date
WO2008057828A2 true WO2008057828A2 (en) 2008-05-15
WO2008057828A3 WO2008057828A3 (en) 2008-09-12

Family

ID=39365210

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/082851 WO2008057828A2 (en) 2006-11-08 2007-10-29 Computer system and method using efficient module and backplane tiling

Country Status (1)

Country Link
WO (1) WO2008057828A2 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5134690A (en) * 1989-06-26 1992-07-28 Samatham Maheswara R Augumented multiprocessor networks
US5513371A (en) * 1992-01-07 1996-04-30 International Business Machines Corporation Hierarchical interconnection network architecture for parallel processing, having interconnections between bit-addressible nodes based on address bit permutations
US20060056308A1 (en) * 2004-05-28 2006-03-16 International Business Machines Corporation Method of switching fabric for counteracting a saturation tree occurring in a network with nodes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5134690A (en) * 1989-06-26 1992-07-28 Samatham Maheswara R Augumented multiprocessor networks
US5513371A (en) * 1992-01-07 1996-04-30 International Business Machines Corporation Hierarchical interconnection network architecture for parallel processing, having interconnections between bit-addressible nodes based on address bit permutations
US20060056308A1 (en) * 2004-05-28 2006-03-16 International Business Machines Corporation Method of switching fabric for counteracting a saturation tree occurring in a network with nodes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SMIT ET AL.: 'An Algorithm for Generating Node Disjoint Routes in Kautz Digraphs' PARALLEL PROCESSING SYMPOSIUM, FIFTH INTERNATIONAL, [Online] May 1991, XP010034084 Retrieved from the Internet: <URL:http://www.doc.utwente.nl/18709/1/Kautz_digraphs_smit.pdf> *

Also Published As

Publication number Publication date
WO2008057828A3 (en) 2008-09-12

Similar Documents

Publication Publication Date Title
US9674116B2 (en) Data distribution packet-flow interconnect fabric modular management optimized system
US5347450A (en) Message routing in a multiprocessor computer system
US9965429B2 (en) Method and apparatus to manage the direct interconnect switch wiring and growth in computer networks
US20170118139A1 (en) Fabric interconnection for memory banks based on network-on-chip methodology
US8184626B2 (en) High-radix interprocessor communications system and method
KR900006791B1 (en) Packet switched multiport memory nxm switch node and processing method
US4797882A (en) Mesh-based switching network
US7660270B2 (en) Computer system and method using efficient module and backplane tiling to interconnect computer nodes via a Kautz-like digraph
US6304568B1 (en) Interconnection network extendable bandwidth and method of transferring data therein
US5398317A (en) Synchronous message routing using a retransmitted clock signal in a multiprocessor computer system
CN107959643B (en) Switching system constructed by switching chip and routing algorithm thereof
WO2015066367A1 (en) Network topology of hierarchical ring with recursive shortcuts
US7239606B2 (en) Scalable configurable network of sparsely interconnected hyper-rings
Daneshtalab et al. Low-distance path-based multicast routing algorithm for network-on-chips
US7751344B2 (en) Computer system and method using a kautz-like digraph to interconnect computer nodes and having control back channel between nodes
JP2007532052A (en) Scalable network for management of computing and data storage
Xie et al. Mesh-of-Torus: a new topology for server-centric data center networks
CN113204423A (en) Method, system and readable medium for synchronizing computing task data
KR101942194B1 (en) Network topology system and building methods for topologies and routing tables thereof
WO2008057828A2 (en) Computer system and method using efficient module and backplane tiling
Rahman et al. Dynamic communication performance of a TESH network under the nonuniform traffic patterns
CN113268338A (en) Communication implementation method and system of computing equipment
CN113204422A (en) Efficient inter-chip interconnect topology for distributed parallel deep learning
KR100617386B1 (en) Asynchronous Switch Based on Butterfly Fat-Tree for Network On Chip Applications
JP4613296B2 (en) A scalable multipath wormhole interconnect network.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07844688

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07844688

Country of ref document: EP

Kind code of ref document: A2