CN115380271A

CN115380271A - Topology aware multi-phase method for trunked communication

Info

Publication number: CN115380271A
Application number: CN202080098259.8A
Authority: CN
Inventors: 叶剑西; 彭立伟; 宋东洋; 唐陵波; 王绍创; 冉仟元; 冯飞; 闫磊; 董建波; 段建军; 杨健
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2022-11-22
Also published as: WO2021195987A1

Abstract

In distributed training, a first compute node may divide a global reduction operation into a plurality of sub-operations. The first computing node may perform a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm, a global reduction sub-operation between the first set of processing units in the first computing node and a second set of processing units in a second computing node according to a second cluster communication algorithm, and a global aggregation sub-operation between the first set of processing units of the first computing node according to the first cluster communication algorithm.

Description

Topology aware multi-phase method for trunked communication

Background

As Neural networks, such as Deep Neural Networks (DNNs), rapidly develop, various application fields (e.g., computer vision, natural language processing, speech recognition, etc.) have developed and would benefit from the inherent versatility and flexibility of Neural networks. However, due to the increasing complexity and stricter accuracy requirements of neural network applications, the size of the neural network models and the size of the training data required to train the models also increase significantly, which inevitably results in longer and longer training times, thereby adversely affecting the effectiveness and timeliness of the training models to meet the changing application environment.

To reduce the time to train the neural network model, a distributed training system using parallel training may be used. In general, a distributed training system may include a large number of computing nodes or servers distributed over a network and distribute a subset of computing tasks to the computing nodes or servers for performing computations employing parallel training. However, data communications between computing nodes or servers in the distributed training system create a lower bound or bottleneck to the amount of training time reduction that may occur in the distributed training system. This is particularly true when the distributed training system includes various types of heterogeneous connections or interconnections within and between compute nodes or servers that exhibit different characteristics in terms of latency, bandwidth, topology, and the like. This heterogeneity of connections or interconnections increases the difficulty and complexity of designing a data communications network for the computing nodes or servers in the distributed training system.

In addition, network congestion may be incurred due to excessive data flow through particular network switches or connections between computing nodes or servers in the distributed training system, which may result in extended training times due to delays in processing the training results. An excessive amount of data flow through a particular network switch or connection may be due to loss of control over routing data sent between computing nodes or servers.

Drawings

The detailed description is made with reference to the accompanying drawings. In the drawings, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears. The use of the same reference symbols in different drawings indicates similar or identical items.

FIG. 1 illustrates an example environment in which a distributed training system may be applied.

FIG. 2 illustrates an exemplary compute node in greater detail.

Fig. 3A shows a ring configuration in which a preset number of nodes are connected to each other.

Fig. 3B shows a halved double configuration in which a preset number of nodes are connected to each other.

Fig. 4 shows a schematic diagram of an exemplary cluster communication library.

FIG. 5 illustrates an exemplary topology aware multi-stage algorithm.

Fig. 6 illustrates an exemplary ring-based algorithm for calculating an intra-node reduction dissemination phase for a node.

Fig. 7 illustrates an exemplary halving doubling algorithm for calculating an intra-node reduction scatter phase of a node.

FIG. 8 illustrates an exemplary halving doubling algorithm for the inter-node global reduction phase.

FIG. 9 illustrates an exemplary halving doubling algorithm for a more detailed inter-node global reduction phase.

Fig. 10 illustrates an exemplary ring-based cluster communication algorithm.

FIG. 11 illustrates an example scenario in which the intra-node reduction dissemination phase, the inter-node global reduction phase, and the intra-node global aggregation phase are performed in a parallel or overlapping manner.

FIG. 12 shows an exemplary fat tree (fat-tree) network topology.

Fig. 13 illustrates an example scenario in which a first congestion avoidance method is used.

Fig. 14 illustrates an example scenario using the second congestion avoidance method.

FIG. 15 illustrates an exemplary topology aware multi-stage approach.

Fig. 16 illustrates a first exemplary network congestion avoidance method.

Fig. 17 illustrates a second exemplary network congestion avoidance method.

FIG. 18 illustrates an exemplary parallel approach based on a hybrid architecture in distributed training.

Detailed Description

SUMMARY

As described above, existing distributed training systems pose performance bottlenecks for good scalability due to data communication among the compute nodes in the distributed training system. Furthermore, due to the wide variety of network architectures (including, for example, ethernet, infiniBand, PCIe, NVLink, NVSwitch, QPI/UPI, etc.) and the high degree of differences in network characteristics (e.g., latency, bandwidth, topology, etc.), distributed training systems are generally not well-suited to utilizing such heterogeneous types of connections or interconnects to perform cluster data operations in and between compute nodes and data transfers between compute nodes. In addition, network congestion occurs due to the possibility of out of control path selection for routing data sent between compute nodes, resulting in excessive data flow through specific network switches or connections between compute nodes in a distributed training system and extended training times caused by delays in processing training results. Furthermore, existing distributed training systems fail to distinguish between different types of underlying structured algorithms for cluster operations, thus resulting in poor performance.

The present disclosure describes an example distributed training system. In implementation, an exemplary distributed training system may employ a structure-aware cluster communication library that enables the distributed training system to be linearly extended. In implementations, the clustered communication library may customize the communication algorithm based at least in part on an analysis of the infrastructure and supporting network architecture to achieve a desired or maximum efficiency. In implementation, the distributed training system may divide the basic operation into a plurality of sub-operations, each using one type of structure.

In implementation, the exemplary distributed training system may implement a hybrid algorithm that allows multiple algorithms to coexist in a single cluster operation, and selectively employs the algorithms for a particular architecture to improve or maximize the efficiency of the overall communication path. In implementation, the distributed training system may employ a two-process parallel algorithm that starts two concurrent processes and streamlines the use of intra-node and inter-node connections, thereby improving communication efficiency by overlapping intra-node and inter-node communications.

In implementations, an exemplary distributed training system may employ a probe-based routing control mechanism that generates a mapping from a connection to a path to distribute or disseminate connections to different aggregation or intermediate switches in a communication network by reordering participants or processes in a clustering operation and mapping data flows on the distributed training system to specific physical links to avoid network congestion.

This application describes a number of different embodiments and implementations. The following sections describe exemplary frameworks suitable for practicing various embodiments. Next, this application describes exemplary systems, devices, and processes for implementing a distributed training system.

Exemplary Environment

FIG. 1 illustrates an exemplary environment 100 that may be used to implement a distributed training system. The environment 100 may include a distributed training system 102. In this example, distributed training system 102 may include a plurality of computing nodes or servers 104-1, 104-2. In implementation, multiple computing nodes 104 may communicate data to each other over a communication network 106.

Computing nodes 104 may be implemented as any of a variety of computing devices with computing/processing and communication capabilities, including but not limited to servers, desktops, laptops or laptops, handhelds, netbooks, internet appliances, tablets, mobile devices (e.g., mobile phones, personal digital assistants, smart phones, etc.), and the like, or combinations thereof.

The communication network 106 may be a wireless or wired network, or a combination thereof. The network 106 may be a collection of independent networks interconnected with each other and functioning as a single large network (e.g., the internet or an intranet). Examples of such independent networks include, but are not limited to, telephone networks, cable networks, local Area Networks (LANs), wide Area Networks (WANs), and Metropolitan Area Networks (MANs). Further, the independent network may be a wireless or wired network, or a combination thereof. A wired network may include electrical carrier connections (e.g., communication cables, etc.) and/or optical carriers or connections (e.g., fiber optic connections, etc.). The wireless network may include, for example, a WiFi network, other radio frequency networks (e.g., bluetooth)

Purple peaks (Zigbee), etc.). In an implementation, communication network 106 may include a plurality of inter-node interconnects or switches 108-1, 108-2, \ 8230;, 108-L (hereinafter collectively referred to as inter-node switches 108) for providing connectivity between computing nodes 104, where L is a positive integer greater than 1.

In implementations, the environment 100 may further include a client device 110. The user may instruct the distributed training system 102 to perform training on a particular learning model (e.g., a deep neural network model) based on data sent from the client device 110 to the distributed training system 102.

Exemplary computing node

Fig. 2 shows a more detailed computing node 104. In an implementation, a compute node 104 may include, but is not limited to, one or more processing units 202, input/output (I/O) interfaces 204, and/or one or more network interfaces 206, and memory 208. In an implementation, the compute node 104 may further include one or more intra-node interconnects or switches 210.

In implementations, the processing unit 202 may be configured to execute instructions stored in the memory 208 and/or received from the input/output interface 204 and/or the network interface 206. In implementation, processing Unit 202 may be implemented as one or more hardware processors including, for example, a microprocessor, a special purpose instruction set processor, a Physical Processing Unit (PPU), a Central Processing Unit (CPU), a graphics Processing Unit, a digital signal processor, a tensor Processing Unit, or the like. Additionally or alternatively, the functions described herein may be performed, at least in part, by one or more hardware logic components. For example, but not limited to, example types of hardware Logic components that may be used include Field Programmable Gate Arrays (FPGAs), application-Specific Integrated circuits (ASICs), application-Specific Standard products (ASSPs), system-on-a-Chip Systems (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

The Memory 208 may include machine-readable media in the form of volatile Memory, such as Random Access Memory (RAM), and/or non-volatile Memory, such as Read Only Memory (ROM) or flash RAM. Memory 208 is one example of a machine-readable medium.

A machine-readable medium may include volatile or nonvolatile types of removable or non-removable media, which may implement any method or technology for storage of information. The information may include machine-readable instructions, data structures, program modules, or other data. Examples of machine-readable media include, but are not limited to, phase-Change Memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read-Only Memory (ROM), electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other internal storage technology, compact Disk Read-Only Memory (CD-ROM), digital Versatile Disk (DVD) or other optical storage, magnetic cassettes, magnetic Disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing node. As defined herein, a machine-readable medium does not include any transitory medium such as a modulated data signal and a carrier wave.

In an implementation, the network interface 206 may be configured to connect the computing node 104 to other computing nodes over the communication network 106. In an implementation, network Interface 206 may be established through a Network Interface Controller (NIC) that may use hardware and software to connect computing node 104 to communication Network 106. In implementations, each type of NIC may use a different type of fabric or connector to connect to the physical media associated with the communication network 106. Examples of various types of structures or connectors are found in the IEEE 802 specification and may include, for example, ethernet (defined in 802.3), token ring (defined in 802.5), wireless networks (defined in 802.11), infiniBand (InfiniBand), and so on.

In implementations, the intra-node switches 210 may include various types of interconnects or switches, which may include, but are not limited to, a high-speed serial computer expansion bus (e.g., PCIe, etc.), a serial multi-channel near field communication link (e.g., nolan, which is a wire-based communication protocol serial multi-channel near field communication link), a switch chip with multiple ports (e.g., NVSwitch, etc.), a point-to-point processor interconnect (e.g., intel QPI/UPI, etc.), and the like.

Although only hardware components are depicted in the compute node 104 in this example, in other examples the anomaly detection system 110 may further include other hardware components and/or other software components, such as a program module 212 that executes instructions stored in the memory 208 to perform various operations, and program data 214 for storing data received for training, intermediate and final results computed during training, and so forth.

Exemplary Cluster communication Algorithm

Fig. 3A and 3B illustrate an exemplary cluster communication algorithm that may be used in the distributed training system 102. In implementations, the group communication algorithm may include, but is not limited to, a ring-based communication algorithm, a halving and doubling communication algorithm, and the like.

Fig. 3A illustrates a ring configuration that interconnects a predetermined number of nodes (e.g., N nodes, where N is a positive integer greater than 1) with multiple connections (i.e., N connections) and divides data (e.g., a packet or message) into multiple data blocks (i.e., N data blocks) for transmission, and requires communication of multiple steps (N-1 steps in this example) to complete a clustering operation. In each step, a node may receive data from one of its neighboring nodes, perform a particular operation on the received data to obtain a local result, and forward the received data to another one of the neighboring nodes. After N-1 steps, each node in the ring has data from other nodes in the ring, and the final result is spread out to all nodes, which requires an additional N-1 steps to broadcast the respective local results. For each node, the total data size forwarded is 2S, where S represents the data size or message size.

Fig. 3B shows a halved double configuration interconnecting a predetermined number of nodes (e.g., N nodes, where N is a positive integer greater than 1). In this halved configuration, the nodes communicate with each other in pairs, requiring only N/2 connections per step of communication. In a first step, adjacent nodes are paired together, half of the messages or data are sent to the respective paired nodes, and the other half of the messages are receivedOr data processing. Thus, intermediate results may be disseminated to the paired nodes. In a subsequent step, a new pair is formed with increasing or doubling distance and the data size for processing is halved. At log ₂ After N steps of communication, the results are spread among all nodes in the halved double configuration. The local results in the node are then passed through an additional log ₂ The N-step communication is broadcast to other nodes.

Exemplary Cluster communication library

Fig. 4 shows a schematic diagram depicting an example clustered communication library 400 that may be used by the distributed training system 102. In implementation, a clustered communication library is a communication library designed to provide high performance, high scalability and strong availability, and may be configured to provide support not only for standard clustered operations such as global reduce (Allreduce) and global gather (Allgather) operations, but also for other custom operations that customize an application. In an implementation, the cluster communication library 400 may employ different types of interconnects or switches having different characteristics (e.g., delay, bandwidth, topology) and provide a mechanism to collect information of the underlying hardware in the network and compute nodes so that a design of a topology-aware algorithm may be developed based on one or more pieces of the collected information.

In an implementation, the cluster communication library 400 may provide flexibility to allow multiple algorithms to be executed in a single operation and improve performance (e.g., performance of communication and training, etc.) by exploiting parallelism between intra-node and inter-node communications. Further, the clustered communication library 400 may utilize multiple NICs in a computing node with traditional or new mapping algorithms and eliminate network congestion through topology-aware placement of connections.

In an implementation, the cluster communication library 400 may include a software stack 402. In an implementation, the software stack 402 may include a number of components that may include, but are not limited to, a transport component 404, an arithmetic component 406, a communicator component 408, and a library context component 410. In implementation, the software stack 402 may be designed in a modular fashion to allow for versatility and extensibility.

In an implementation, the transport component 404 may be responsible for the transfer or transmission of Peer-to-Peer (P2P) data in intra-node and inter-node communications. By way of example and not limitation, the cluster communication library 400 may support TCP (Transmission Control Protocol) and RDMA (Remote Direct Memory Access) for inter-node communication, as well as P2P fabrics for intra-node communication, such as PCIe (Peripheral Component Interconnect Express), NVLink/NVSwitch, and QPI/QPI (Quick Path Interconnect/Ultra Path Interconnect), among others. For RDMA communications, the transport component 404 may also be configured to manage a Processing unit (e.g., a Graphics Processing Unit (GPU) device) and a Memory Region (MR) in host Memory and corresponding Memory buffers.

In implementation, the operations component 406 may provide a set of basic operations and various network algorithms. For example, the basic operations may be configured with algorithms supported by the clustered communication library 400. Further, the operations component 406 can allow a user to define new operations based on these basic operations to implement a heterogeneous aware operation that can employ an optimal or better algorithm for each type of structure.

In implementations, the communicator component 408 can be associated with a software process and can be configured to perform manipulations and processing on a processing unit (such as a GPU device). The communicator component 408 can maintain or record information (e.g., sort ID, IP address, etc.) about other peer components and maintain connectivity with peer components. In an implementation, the communicator component 408 can further collect intra-node and inter-node topology information and use the information to guide algorithm design. In implementations, the intra-node information may include, but is not limited to, the type of interconnect, the distance between the locations of the processing units, the distance between the processing units and the network interface controller, and the like. In implementations, the inter-node information may include, but is not limited to, for example, the number of available network interface controllers, the topology of the cluster or computing nodes, the locations of the computing nodes in the cluster.

In implementations, the library context component 410 can be configured to be open (expose) for setting system configurations (e.g., environment variables), managing one or more application interfaces of the communicator component 408, and providing other functionality such as logging.

Further, in some cases, the clustered communication library 400 may further include or provide a plurality of tools and utilities 412 for topology aware design, testing and evaluation, and usability improvement. By way of example and not limitation, tools and utilities 412 can include performance testing tools for transport component 404 to facilitate algorithm design and evaluation, probe-based routing mechanisms for ensuring system availability, and other functions, such as device management functions that can be extended to support devices other than GPUs.

Exemplary topology aware multi-stage algorithm for cluster communication

In an implementation, a group communication may be defined as a communication involving a group of processing units or processes, and the operations of the group communication may be performed by all the processing units or processes included in the group together. Examples of cluster communication operations may include, but are not limited to, global Reduce operations, global gather operations, reduce-Scatter (Reduce-Scatter) operations, and the like. In implementation, global reduction operations are one of many important bases for cluster communication in distributed training and involve performing reductions on data across processes in groups. Examples of reduction may include, but are not limited to, a summation operation, an averaging operation, a maximum operation, a minimum operation, and the like.

By way of example and not limitation, the division of a clustered operation into micro-operations or sub-operations is illustrated here as a global reduction operation. In implementation, the distributed training system 102 may employ a topology-aware multi-stage algorithm that divides the global reduction operation into a plurality of micro-operations or sub-operations and selectively chooses one or more of the micro-operations or sub-operations as needed, thereby reducing the amount of data transferred by eliminating micro-operations or sub-operations that may not be needed. In implementation, the distributed training system 102 may separate the cluster communication algorithm from the micro-operations or sub-operations and allow independent or separate matching between the algorithm and the micro-operations or sub-operations based on the underlying structural information, thereby maximizing or optimizing bandwidth utilization by reducing the amount of data transmitted.

Fig. 5 illustrates an exemplary topology aware multi-stage algorithm 500 that may be used for the distributed training system 102. In an implementation, the topology aware multi-stage algorithm 500 may include multiple stages, such as an intra-node reduction dissemination stage 502, an inter-node global reduction stage 504, and an intra-node global aggregation stage 506.

In implementation, the distributed training system 102 may first assign portions of pending data for training to the plurality of computing nodes 204 such that each computing node 104 of the plurality of computing nodes 204 may receive a respective portion of the data. In an implementation, each compute node 104 may divide a respective portion of data into a plurality of data slices (e.g., N data slices, where N is a positive integer) and assign the plurality of data slices to a plurality of local processing units or processes (e.g., N local processing units or processes) included in the respective compute node 104.

In implementation, in the intra-node reduction dissemination phase 502, each local processing unit or process included in each computing node 104 may divide the data slice allocated to it into a plurality of data chunks (e.g., M chunks). The local processing units or processes included in each computing node 104 may then cooperatively perform an intra-node reduction scatter sub-operation to obtain all reduction results for the plurality of data blocks in the respective computing node 104 in a plurality of steps or iterations according to a particular cluster communication algorithm. At the end of the intra-node reduction spread phase 502, the local processing units or processes included in a compute node 104 may have the reduction results (or referred to as reduction spread results) of all the processing units or processes included in that compute node 104 in different data blocks.

By way of example and not limitation, a particular cluster communication algorithm is described with two exemplary cluster communication algorithms, a ring-based algorithm and a halving and doubling algorithm, as examples to illustrate a particular mechanism or operation in the intra-node reduction dissemination phase 502. However, other cluster communication algorithms may be used in the intra-node reduction dissemination phase 502. For example, the distributed training system 102 may select a particular cluster communication algorithm for use in the intra-node reduction dissemination phase 502 based on information for a plurality of factors collected by the cluster communication library 400. In an implementation, the plurality of factors may include, but are not limited to, the type of interconnection between processing units (or processes) in a compute node, the number of interconnections in a compute node, and the like.

For example, in the intra-node reduction dissemination phase 502, the distributed training system 102 may employ a first cluster communication algorithm for a first computing node and a second cluster communication algorithm for a second computing node, where the second computing node has the same or different processing and connection capabilities as the first computing node, and the first cluster communication algorithm may be the same or different than the second cluster communication algorithm. By way of example and not limitation, for a computing node interconnected using NVSwitch or PCIe and including a number of processing units or processes for training that is a power of 2, the distributed training system 102 may employ a halving doubling algorithm, while for another computing node interconnected using NVLink or otherwise and using a number of processing units or processes for training that is a power of not 2, the distributed training system 102 employs a ring-based algorithm, and so on.

Fig. 6 illustrates an exemplary ring-based algorithm 600 for computing nodes in the intra-node reduction scatter phase 502. For purposes of brevity and description, an exemplary ring-based algorithm includes a configuration of only one ring. However, any ring-based algorithm including configurations of more than one ring may be employed, e.g., each ring processes a portion of a data block.

In this example, the depicted compute node includes M processing units or processes (with sort identifiers or

numbers

1, 2, \ 8230;, M), and the data assigned to each processing unit or process is divided into M data chunks. In a first step, a processing unit or process (e.g., P1) may send one of its M data chunks to a next processing unit or process (e.g., P2) in the ring, receive another data chunk from a previous processing unit or process (e.g., PM) in the ring, and reduce the received data chunk with a corresponding local data chunk to obtain a partially reduced result. In each subsequent step (e.g., the kth step), the processing unit or process (e.g., P1) may send a partial reduction result (in this example, the partial reduction result obtained by P1 at the kth step) to the next processing unit or process (e.g., P2) in the ring, receive the partial reduction result (in this example, the partial reduction result obtained by PM at the kth step) from the previous processing unit or process (e.g., PM), and reduce the received partial reduction result with another local data block that was not previously sent or reduced with other data.

As shown in fig. 6, different data blocks may be received and reduced or sent by different processing units or processes in the compute node at each step. In addition, each processing unit or process may send or receive and reduce different blocks of data (or partial results) at different steps. At the end of the intra-node reduction dissemination phase 502 (i.e., after M-1 steps), each processing unit or process may include a result data block that stores the reduction results for M corresponding data blocks for the M processing units or processes in the compute node. For example, after M-1 steps, the data block of P1 for the "top position" stores the reduction results for all data blocks of the M processing units or processes corresponding to the "top position", as shown in FIG. 6.

Fig. 7 shows an exemplary halving doubling algorithm 700 for computing nodes in the intra-node reduction dissemination phase 502. In this example, the described compute node includes M processing units or processes (described with M set to 8 in this example). In a first step, a processing unit or process (e.g., P1) may send half of the data allocated to it to another processing unit or process (e.g., P2) nearby, receive half of the data allocated to the other processing unit or process (e.g., P2), and reduce the received data with the other half of the data allocated to the processing unit or process (e.g., P1) to obtain a partial reduction result. In each subsequent step, the processing unit or process (e.g., P1) may send half of the partial reduction results obtained locally in the previous step to a different processing unit or process located further and further away from the processing unit or process (i.e., P1), and reduce the received partial reduction results with the other half of the partial reduction results obtained locally in the previous step to reduce the received partial reduction resultsA new partial reduction result for the processing unit or process (i.e., P1) is obtained. At the end of the in-node reduction scatter phase 502 (i.e., at log) ₂ After M steps, i.e. after 3 steps in this example as shown in fig. 7), each processing unit or process may comprise a result data block storing the reduction results of M corresponding data blocks of M (in this example, 8 as shown in fig. 7) processing units or processes in the compute node. For example, at log ₂ After the M steps, the data block of P1 of the "bottom position" stores the reduction results of all the data blocks of the M (in this example, 8 as shown in fig. 7) processing units or processes corresponding to the "bottom position" as shown in fig. 7.

In implementation, in the inter-node global reduction phase 504, the inter-node global reduction sub-operation is node-based (i.e., between different compute nodes) and may be performed between processing units (or processes) included in the different compute nodes. In implementation, the processing units (or processes) of different compute nodes that hold data blocks of the same reduction result (or reduction spread result) form the same group, and communicate the respective results to each other in the same group to perform the inter-node global reduction sub-operation. At the end of inter-node global reduction phase 504, each processing unit or process of each compute node in a particular group may own a particular data block of the reduction results of all processing units or processes in the same group, and processing units or processes of different groups own different data blocks of the reduction results of the corresponding processing units or processes in different groups.

In an implementation, the distributed training system 102 may select a particular cluster communication algorithm based on one or more selection criteria, and may implement an inter-node global reduction sub-operation based on the selected cluster communication algorithm. Examples of a particular cluster communication algorithm may include, but are not limited to, a ring-based algorithm (e.g., a hierarchical ring algorithm, a multi-ring algorithm, etc.), a halving and doubling algorithm, and the like. In an implementation, the one or more selection criteria may include, but are not limited to, a topology of a communication network (e.g., communication network 206) connecting the computing nodes, a number of switches used in the communication network, a type of switches used in the communication network, a network type of the communication network, and so forth.

By way of example and not limitation, a particular cluster communication algorithm is described with two exemplary cluster communication algorithms, a ring-based algorithm and a halving and doubling algorithm, as examples to illustrate a particular mechanism or operation in the inter-node global reduction phase 504. However, other cluster communication algorithms may be used in the inter-node global reduction phase 504 based on one or more of the selection criteria described above.

Fig. 8 and 9 show an exemplary halving and doubling algorithm for the internode global reduction phase 504. In this example, as shown in FIG. 8, for purposes of brevity and description, the described distributed training system 102 includes a plurality of compute nodes (i.e., node 0, node 1, node 2, \8230; node N-1, N being shown as 4 in FIG. 8 for illustration), where each compute node includes eight processing units or processes, with corresponding rank numbers (i.e., rank 0, rank 1, rank 2, \8230; rank M-1, M being shown as 8 in FIG. 8 for illustration). As shown in fig. 8, processing units or processes having the same sort number in the corresponding compute nodes include data blocks of the same reduction result (or reduction spread result), and they form the same group. For example, the processing unit or process having an order number of 0 in the corresponding compute node includes the data block of the reduction result at the first position in each local data block, and it forms the same group (e.g., group 0). In implementation, processing units or processes in different groups may not communicate with each other.

In an implementation, the inter-node global reduction sub-operation may be performed separately between the processing units (or processes) in each group, so that each processing unit (or process) in a group may obtain all reduction results for the same data block of all processing units (or processes) in the same group. Similar to the mechanism of the halving-doubling algorithm described above for the intra-node reduction-spreading phase, the processing units or processes in each group may iteratively send local reduction results for the corresponding data block with other processing units or processes in the respective group, receive respective local reduction results for the corresponding data block from other processing units or processes in the doubled or increased distance, and perform a reduction operation on the received reduction results with the local reduction results.

FIG. 9 illustrates an example scenario in which a halving doubling algorithm is applied to eight compute nodes. In this example, as shown in FIG. 8, the number of steps performed in the internode global reduction phase 504 using the halving and doubling algorithm is log ₂ N＝log ₂ 8=3, where N is the number of compute nodes. In a first step, a first processing unit or process (e.g., processing unit or process with an order number of 0) of a certain group in a first computing node (e.g., node 0) may send its local reduction result to a second processing unit or process of the same group in a second computing node (e.g., node 1), receive the local reduction result from the second processing unit or process of the same group in the second computing node, and perform a reduction operation on its local reduction result and the received local reduction result to obtain a new local reduction result.

In a second step, a first processing unit or process (e.g., processing unit or process with rank number 0) in a first computing node (e.g., node 0) may send its new local reduction result to a third processing unit or process (e.g., rank number 0) of the same group in a third computing node (i.e., node 2 in this example), receive the local reduction result from the third processing unit or process of the same group in the first computing node, and perform a reduction operation on its new local reduction result and the received local reduction result to obtain another new local reduction result.

In the third (or last) step, the same operations are performed on the first processing unit or process, but this time on a fourth processing unit or process of the same group in a fourth computing node (i.e., node 4 in this example).

At the end of inter-node global reduction phase 504, each processing unit or process of each compute node in a particular group may own a particular data block of the reduction results of all processing units or processes in the same group, and processing units or processes of different groups own different data blocks of the reduction results of the corresponding processing units or processes in different groups.

Similar to the halving and doubling algorithm, the inter-node global reduction sub-operation may be performed separately between processing units (or processes) in each group of a plurality of computing nodes (e.g., N computing nodes) using a ring-based algorithm, such that each processing unit (or process) in a group may obtain all reduction results for the same data block for all processing units (or processes) in the same group. Similar to the mechanism of the ring-based algorithm described above for the intra-node reduction dissemination phase, the processing units or processes of each group of compute nodes may iteratively send the local reduction results for the corresponding data chunk to the processing units or processes of the corresponding group in the next compute node, receive the local reduction results for the corresponding data chunk from the processing units or processes of the corresponding group in the previous compute node, and perform a reduction operation on the received reduction results using their local reduction results. At the end of inter-node global reduction phase 504 (i.e., after N-1 steps), each processing unit or process of each compute node in a particular group may own a particular data block of the reduction results of all processing units or processes in the same group, and processing units or processes of different groups own different data blocks of the reduction results of the corresponding processing units or processes in different groups.

In implementation, similar to the intra-node reduction dissemination phase 502, in the intra-node global aggregation phase 506, the global aggregation sub-operations may be performed across local processing units or processes in each of the plurality of compute nodes of the distributed training system 102 to locally broadcast, in the same compute node, the respective reduction results obtained in the inter-node global reduction phase 504 to each other. At the end of the intra-node global aggregation phase 506, each processing unit or process in each computing node of the distributed training system 102 may have a reduction result of the entire data distributed among the multiple computing nodes.

By way of example and not limitation, a ring-based algorithm is used herein to illustrate how the reduction results obtained by local processing units or processes in the compute nodes of the distributed training system 102 (in the inter-node global reduction phase 504) are broadcast. However, the distributed training system 102 may employ different or the same cluster communication algorithms (e.g., a halving and doubling algorithm, etc.) for different computing nodes. For example, the distributed training system 102 may employ different or the same cluster communication algorithm for different computing nodes based on a number of factors associated with each individual computing node. In an implementation, the plurality of factors may include, but are not limited to, the type of interconnection between processing units (or processes) in a compute node, the number of interconnections in a compute node, and the like.

FIG. 10 illustrates an exemplary ring-based cluster communication algorithm 1000 for broadcasting individual reduction results of processing units or processes to each other within the computing nodes of the distributed training system 102. As shown in fig. 10, in a first step, each of the M processing units or processes in the compute node (e.g., P1) may send its reduction result obtained in the inter-node global reduction phase 504 to one of the two neighboring processing units or processes (e.g., P2 in this example) and receive the reduction result from the other of the two neighboring processing units or processes (e.g., PM in this example) according to a torus configuration. In each subsequent step, each processing unit or process (e.g., P1) may send the newly received reduction result to one of the two neighboring processing units or processes (e.g., P2 in this example) and receive the other reduction result from the other of the two neighboring processing units or processes (e.g., PM in this example) according to a ring configuration. At the end of the intra-node global aggregation phase 506 (i.e., after M-1 steps), each processing unit or process in the compute node may have a reduction result that is the reduction result of all processing units or processes in the compute node.

Exemplary parallel Algorithm

In an implementation, the distributed training system 102 may perform multiple stages included in a topology-aware multi-stage algorithm, namely, an intra-node reduction dissemination stage 502, an inter-node global reduction stage 504, and an intra-node global aggregation stage 506, in that order, and so on. In implementation, the distributed training system 102 may alternatively partially or substantially overlap some of the intra-node reduction dissemination phase 502, the inter-node global reduction phase 504, and the intra-node global aggregation phase 506, and perform some portions of these phases in parallel.

For example, because the intra-node reduction dissemination phase 502 and the intra-node global aggregation phase 506 involve intra-node data communications or transmissions (i.e., data communications or transmissions within the compute nodes), and the inter-node global reduction phase 504 involves inter-node data communications or transmissions (i.e., data communications or transmissions between the compute nodes), in an implementation, the distributed training system 102 may allow at least a portion of the intra-node reduction dissemination phase 502 and the inter-node global reduction phase 504 to be executed in parallel, as well as a portion of the inter-node global reduction phase 504 and the intra-node global aggregation phase 506, thereby increasing utilization of intra-node and inter-node links (or connections) and avoiding intra-node links from being idle when inter-node links are used, and vice versa.

FIG. 11 illustrates an example scenario in which the intra-node reduction dissemination phase, the inter-node global reduction phase, and the intra-node global aggregation phase are performed in a parallel or overlapping manner. As shown in fig. 11, a processing unit or process of a compute node may divide a block of data into a plurality of chunks (in this example, 4 chunks as shown in fig. 11) and distribute the chunks to at least two concurrent threads (e.g., a first thread 1102 and a second thread 1104). In this manner, the processing unit or process may pipeline intra-node and inter-node sub-operations for execution by at least two concurrent threads (in this example, a first thread 1102 and a second thread 1104).

By way of example and not limitation, a first thread 1102 may perform an inter-node global reduction sub-operation (i.e., the operation in inter-node global reduction stage 504) on a first data block (e.g., data block 1106) while a second thread 1104 performs an intra-node reduction scatter sub-operation (i.e., the operation in intra-node reduction scatter stage 502) on a second data block (e.g., data block 1108). For example, the first thread 1102 may perform an intra-node global gather sub-operation (i.e., an operation in the intra-node global gather phase 506) on a third data block (e.g., data block 1110), while the second thread 1104 performs an inter-node global reduce sub-operation on a fourth data block (e.g., data block 1112).

By way of example and not limitation, another operation involved in the distributed neural network training may be further used as an example. In implementation, the distributed training system 102 may divide the global clustering operations involved in the distributed neural network training into multiple sub-operations, i.e., inter-node global clustering sub-operations, intra-node global clustering sub-operations, and data replication sub-operations. In implementations, the inter-node global gather sub-operation may be similar to the inter-node global reduce sub-operation described above, but does broadcast data (e.g., the result of the reduction) rather than the reduce operation (e.g., reduce the received result with the local result), and the inter-node global gather sub-operation may be similar or identical to the inter-node global gather sub-operation described above. In implementations, the data replication sub-operation may include an operation that replicates result data (e.g., final reduction result) as an output parameter.

In an implementation, a processing unit or process of a compute node may divide a block of data into a plurality of blocks (e.g., four blocks) and distribute the blocks to at least two concurrent threads (e.g., a first thread and a second thread), and pipeline intra-node and inter-node sub-operations for execution by the at least two concurrent threads.

For example, a first thread may perform an inter-node global gather sub-operation on a first data block while a second thread performs an intra-node global gather sub-operation on a second data block. In addition, the first thread may perform a data replication sub-operation on the third data block while the second thread performs an inter-node global gather sub-operation on the fourth data block.

Exemplary Congestion avoidance method

In implementation, data or traffic congestion may occur at some switches or links in the communication network 206 due to data transmission between multiple computing nodes in the distributed training system 102. To avoid congestion, the distributed training system 102 may employ a predetermined congestion avoidance strategy to distribute or transfer data traffic among the various switches or links in the communication network 206, thereby avoiding excessive amounts of data from passing through a switch or link in the communication network 206 during training (e.g., an inter-node global reduction sub-operation or phase, or an inter-node global aggregation sub-operation or phase).

In implementation, the distributed training system 102 may employ a first congestion avoidance method that includes ring generation policies followed by route management of network flows. Additionally or alternatively, the distributed training system 102 may employ a second congestion avoidance method that includes a policy to reorder node identifications and subsequent routing management of network flows. Depending on the type of network topology of the communication network 206 and the processing and communication capabilities of the plurality of computing nodes 204, etc., the distributed training system 102 may select one or more of the first congestion avoidance method or the second congestion avoidance method for routing data flows between all or a portion of the plurality of computing nodes in the distributed training system 102. Further, the distributed training system 102 may selectively combine portions of the first and second congestion avoidance methods to implement a new congestion avoidance method. In an implementation, both the first and second congestion avoidance methods may aim to specify a dedicated network path for each direction of inter-node data flow in such a way that the inter-node data flows have no or little conflict with each other.

In implementation, the distributed training system 102 may obtain or establish a mapping relationship between communication connections and routing paths (e.g., physical links) in advance. In implementation, a connection path data structure in the form of a table, linked list, or the like may be created and used to store information for the mapping. In implementations, the distributed training system 102 may selectively or strategically use particular paths to establish a connection between any two computing nodes based on the connection path data structure.

In an implementation, the distributed training system 102 may determine the mapping between communication connections and routing paths by enabling each computing node of the distributed training system 102 to send probe packets to other computing nodes by varying the source/destination ports of the probe packets to exhaust the possible communication connections between the computing nodes of the distributed training system 102. It should be appreciated that the distributed training system 102 may employ other methods to explore the mapping between communication connections and routing paths, and is not limited herein.

By way of example and not limitation, a first computing node may send a plurality of probe packets to a second computing node, each probe packet having a different combination of source and destination ports, with source and destination addresses being the addresses of the first and second computing nodes, respectively. Each probe packet may record the switch through which the respective probe packet passes, so that when the respective probe packet is returned to the first computing node, the first computing node may know the entire routing path for the mapped respective probe packet. Accordingly, a connection path data structure (e.g., a connection path data structure) may be established between the first computing node and the second computing node. Similarly, the mapping relationships (and connection path data structures) between the communication connections and routing paths of other pairs of computing nodes in the distributed training system 102 may be established accordingly.

For purposes of brevity and explanation, an exemplary network topology, namely a fat tree network (or in particular a two-tier Clos network architecture in a full mesh topology), is used herein as an exemplary network topology for the communication network 206 associated with the distributed training system 102. However, the example congestion avoidance policies described herein may also be applicable to other network topologies.

Fig. 12 shows an exemplary fat tree network topology 1200. In this example, the exemplary fat tree network topology is a two-tier Clos network architecture in a full mesh topology. One tier corresponds to one tier of chip switches 1202 connected directly to compute nodes 1204, with each chip switch 1202 connected to one or more compute nodes 1204. In an implementation, compute node 1204 may include one or more network interface controllers (e.g., four network interface controllers) connected to one or more ports (e.g., four ports) of slice switch 1202. In an implementation, the number of network interface controllers for each compute node 1204 may be the same or different. The other tier corresponds to a tier of aggregation switches 1206 (or backbone switches 1206) connected to one or more of the slice switches 1202.

In an implementation, if two processing units or processes included in different compute nodes are connected under the same chip switch, packets transmitted between the two processing units or processes will pass through the same chip switch without passing through any aggregation switch. Alternatively, if two processing units or processes included in different compute nodes are connected under different slice switches, the data packets transmitted between the two processing units or processes will pass through one of the aggregation switches. Using the connection path data structure described above, packets transmitted between two processing units or processes can be made to flow through a designated aggregation switch by setting appropriate combinations of source and destination ports in the packets. In implementations, the routing management of the first congestion avoidance method and/or the second congestion avoidance method may aim to enable data flows from the same piece of switch to different destination piece switches to pass through different aggregation switches, and/or to enable data flows from different source piece switches to the same destination piece switch to pass through different aggregation switches, thereby avoiding collisions between data flows and enabling no network congestion at the aggregation switches.

In implementation, as described in the foregoing description, the first congestion avoidance method may include a ring generation policy, and subsequent route management of network flows. The first congestion avoidance method may support various ring-based algorithms including, but not limited to, a ring algorithm, a ring blocking algorithm, a multi-ring algorithm, a hierarchical ring algorithm, an algorithm involving multiple hierarchical rings, a node-aware ring algorithm, and the like.

In an implementation, the policy for ring generation may include a topology aware policy for ring generation. By way of example and limitation, a topology-aware policy for ring generation may include a plurality of rules to establish a ring or ring configuration for a processing unit or process. In implementation, a processing unit or process in a computing node may send/receive data to/from a processing unit or process in another computing node through a network interface controller. In an implementation, a processing unit or process in a computing node may be associated with a single network interface controller or multiple network interface controllers to transmit data to processing units or processes in other computing nodes. Additionally or alternatively, multiple processing units or processes may be associated with a single network interface controller and used to transmit data to processing units or processes in other computing nodes.

In an implementation, the plurality of rules may include, but are not limited to, priorities of processing units or processes in the first computing node to select neighboring processing units or processes, conditions of a network interface controller in the first computing node to send or receive data, conditions of a network interface controller in the first computing node to route data to/from a network interface controller in the second computing node, and so on.

In an implementation, the selecting of the priorities of neighboring processing units or processes by a processing unit or process in a first computing node may include selecting the processing unit or process in the first computing node and using inter-process communication (if applicable) in descending order of priority, selecting the processing unit or process in a second computing node connected to the same chip switch as the chip switch to which the first computing node is connected, and selecting the processing unit or process in a third computing node connected to a different chip switch than the chip switch to which the first computing node is connected, wherein the first computing node is different from the second computing node and the third computing node.

In an implementation, the condition under which the network interface controller in the first computing node sends or receives data may include, for example, the network interface controller being capable of sending data only to the network interface controller in the second computing node, and/or the network interface controller being capable of receiving data only from the network interface controller in a third computing node, wherein the first computing node is different from the second computing node and the third computing node, and the second computing node may be the same as or different from the third computing node.

In an implementation, the conditions under which the network interface controller in the first computing node routes data to/from the network interface controller in the second computing node may include, for example, routing data sent by processing units or processes belonging to the plurality of rings to the network interface controller in the second computing node if the data is sent through the network interface controller in the first computing node. In an implementation, the condition for the network interface controller in the first computing node to route data to/from the network interface controller in the second computing node may further include receiving data through the network interface controller in the first computing node if the data is sent by a processing unit or process belonging to the plurality of rings through the network interface controller in the second computing node.

In an implementation, the route management of the first congestion avoidance method may assign a Network Interface Controller (NIC) identifier to each Network Interface Controller connected or linked to the same switch. The route management of the first congestion avoidance method may also assign an aggregation identifier to each aggregation switch in the communication network 206. For a processing unit or process in a ring, route management may determine a route identifier for routing packets from the processing unit or process.

For example, if the network interface controller of a processing unit or process and the network interface controller of the next processing unit or process in the ring are located in the same compute node or are directly connected or linked to the same slice switch, the routing identifier may be determined to be a default value or default identifier. The default route identifier indicates that the data is either routed within the compute node or through the chip switch, but not through any aggregation switch in the communication network. Otherwise, the route identifier may be determined to be equal to the NIC identifier or other predefined value for the processing unit or process. Based on the mapping relationship between the route identifiers and the aggregation identifiers, the aggregation identifiers may be determined based on the determined route identifiers. In an implementation, for example, the mapping between the routing identifier and the aggregation identifier may be predetermined using a probe-based routing mechanism (e.g., sending probe packets between computing nodes as described in the foregoing description).

In other words, data flows between processing units (or processes) included in the same compute node or network interface controllers of the same chip switch will not pass through any aggregation switches in the communication network. On the other hand, data flows between processing units (or processes) included in different computing nodes and network interface controllers of different slice switches will pass through a designated aggregation switch based on a predetermined mapping relationship, thereby implementing routing control and management of the data flows, and distributing the data flows to different aggregation switches to avoid network congestion.

Fig. 13 shows an example scenario using a first congestion avoidance method. In this example, four inter-node rings (or ring configurations, R0, R1, R2, and R3) containing eight compute nodes (node 0, node 1, \8230;, node 7) are generated, and each ring uses a different aggregation switch to send and receive data (e.g., during inter-node global reduction phase 504). Therefore, there is no conflict for these four rings. In addition, each chip switch of any ring has only one data flow entering and one data flow leaving, thereby avoiding the occurrence of network congestion.

In implementation, as described above, the second congestion avoidance method may include a policy on reordering of node identities, and subsequent route management of network flows. In implementation, to minimize communication costs, the second congestion avoidance method may reorder identifiers of the compute nodes and the processing units (or processes) according to a network topology that connects the compute nodes and the processing units (or processes) based on a plurality of rules.

In an implementation, the plurality of rules may include grouping the compute nodes, for example, by respective chip switches. For example, compute nodes connected to the same piece of switch (e.g., compute nodes having network interface controllers linked to the same piece of switch) are grouped into a group and each compute node is assigned a node identifier. Since the compute nodes are connected to the same switch, the compute nodes are (physically) adjacent to each other.

In implementations, the plurality of rules may also include assigning an ordering identifier (or ordering number) to each processing unit or process in the compute node using the same order sequence. For example, k processing units (or processes) in a first computing node may be assigned the rank identifiers 0, 1, \ 8230;, k-1, while k processing units (or processes) in a second computing node may be assigned the rank identifiers k, k +1, \ 8230;, 2k-1, etc., as well as for other computing nodes. The processing units (or processes) in a compute node may be ordered according to their respective network interface controllers used by the processing units (or processes), and the processing units (or processes) using the same network interface controllers are (physically) adjacent to each other.

In this case, the previous log ₂ In L steps, data flow between processing units (or processes) in a compute node may be constrained to pass through the corresponding slice switch, which has a better latency than the aggregation switch, and therefore does not create network congestion. In an implementation, L is the number of compute nodes per switch for a node-aware halving-doubling algorithm as in the above description. In implementation, for a conventional halving and doubling algorithm, L is the product of the number of compute nodes per switch chip and the number of processing units (or processes) per compute node.

In an implementation, route management of the second congestion avoidance method may include determining an aggregate identifier of a data flow or packet sent from a first processing unit (or process) having a first ordering identifier in a first compute node having a first node identifier to a second processing unit (or process) having a second ordering identifier in a second compute node having a second node identifier, where the first compute node may be the same as or different from the second compute node.

In an implementation, the aggregation identifier may be determined based at least in part on at least some of the ordering identifier, the node identifier, the number of network interface controllers per compute node, and the maximum number of compute nodes at each switch. By way of example and not limitation, the aggregation identifier may be determined as a first ordering identifier of a first processing element (or process) sending the data stream or packet + (first node identifier of a first compute node having the first processing element (or process)% of a maximum number of compute nodes at each chip switch) x a number of network interface controllers per compute node, where% represents a modulus operator. Obviously, other methods of computing the aggregation identifier are also applicable, as long as consistent results are obtained. For example, the aggregation identifier may be determined based on a preset mapping relationship between the aggregation identifier and a combination of the rank identifier and the node identifier, and the like.

In an implementation, route management of the second congestion avoidance method may include pre-assigning an aggregation identifier to each aggregation switch in the communication network 206 associated with the distributed training system 102. If the first processing unit (or process) and the second processing unit (or process) are linked to or under the same chip switch (e.g., through respective network controllers), the data flow or data packet will pass through the chip switch without passing through any aggregation switches in the communication network 206. If the first processing unit (or process) and the second processing unit (or process) are not linked to or under the same switch, the data flow or packet sent by the first processing unit (or process) to the second processing unit (or process) will pass through the aggregation switch with the determined aggregation identifier. In this example, the number of network interface controllers included in each compute node described is 4.

Fig. 14 illustrates an example scenario using the second congestion avoidance method. In this example, all compute nodes include the same number of processing units (or processes) and the same number of network interface controllers, each having the same number of processing units (or processes) to be associated. In addition, the number of network interface controllers linked to the chip switches is less than the number of aggregation switches in the network. In this example, the number of network interface controllers per compute node is 4, and the number of compute nodes at each chip switch is at most 2. In an implementation, for a node-aware halving doubling algorithm, the number of compute nodes under the same switch may be a power of 2 and the number of network interface controllers included in compute nodes under the same switch may be a power of 2, and for a traditional halving doubling algorithm, the number of processing units (or processes) using the same network interface controller may be a power of 2. In this example, the number of network interface controllers included in each compute node is depicted as 4.

In this example, during the inter-node global reduction phase in the node-aware halving doubling algorithm, the processing units (or processes) of the compute nodes (node 0, node 2, node 4, and node 6) will use the aggregation switches (e.g., A1, A2, A3, and A4) with aggregation identifiers, and the processing units (or processes) of the compute nodes (node 1, node 3, node 5, and node 7) will use the aggregation switches (e.g., A5, A6, A7, and A8) with aggregation identifiers. Accordingly, there are no conflicts in data flow between compute nodes, thereby avoiding network congestion at any aggregation switch in the network.

In implementation, at each step of the inter-node global reduction phase, a processing unit (or process) may be in data communication with a new processing unit (or process). In an implementation, synchronization may be performed to ensure that the data flow performed at the current step by a processing unit (or process) using a network interface controller does not overlap with the data flow performed at the previous step by a neighboring processing unit (or process) using the same network interface controller to avoid the occurrence of micro-burst (incast), and thus, to avoid the occurrence of micro-burst (incast) congestion.

Exemplary method

FIG. 15 shows a schematic diagram of an exemplary topology aware multi-stage approach. Fig. 16 shows a schematic diagram of a first exemplary network congestion avoidance method. Fig. 17 shows a schematic diagram of a second exemplary network congestion avoidance method. FIG. 18 shows an exemplary parallel approach based on a hybrid architecture in distributed training. The methods of fig. 15-18 may be, but are not necessarily, implemented in the environment illustrated in fig. 1 with the aid of the methods and scenarios illustrated in fig. 3-14 using the computing node illustrated in fig. 2. For ease of description, methods 1500-1800 are described with reference to FIGS. 1-14. However, the methods 1500-1800 may alternatively be practiced in other environments and/or with other systems.

Methods 1500-1800 are described in the general context of machine-executable instructions. Generally, machine-executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types. Further, each example method is represented as a collection of blocks in a logical flow graph, which represents a sequence of operations that can be implemented in hardware, software, firmware, or a combination thereof. The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or an alternate method. Moreover, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein. In the context of software, the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations. In the context of hardware, some or all of the blocks may represent Application Specific Integrated Circuits (ASICs) or other physical components that perform the described operations.

Referring to fig. 15, in block 1502, a first computing node (e.g., computing node 104) may perform a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm.

In an implementation, prior to performing the reduction scatter sub-operation, the first computing node may select a first cluster communication algorithm based at least in part on a type or bandwidth of an intra-node connection between a first set of processing units in the first computing node. In an implementation, the first cluster communication algorithm may include, but is not limited to, a ring-based algorithm or a halving and doubling algorithm.

In an implementation, performing a reduction scatter sub-operation between a first set of processing units in a first compute node according to a first cluster communication algorithm may include dividing data into a plurality of data blocks; assigning a plurality of data blocks to a first set of processing units; receiving, at a first processing unit of the first set of processing units, a data block from a second processing unit of the first set of processing units in accordance with a first cluster communication algorithm; and reducing the received data block with the local data block at the first processing unit.

In block 1504, the first computing node may perform a global reduction sub-operation between a first set of processing units in the first computing node and a second set of processing units in the second computing node according to a second cluster communication algorithm.

In an implementation, prior to performing the global reduction sub-operation, the first computing node may select the second cluster communication algorithm based at least in part on a type or bandwidth of an inter-node connection between the first computing node and the other computing nodes, and/or a connection topology of the first computing node and the other computing nodes. In an implementation, the first cluster communication algorithm may include, but is not limited to, a ring-based algorithm, or a halving doubling algorithm (such as a node-aware halving doubling algorithm), or the like.

In an implementation, performing the global reduction suboperation between the first set of processing units in the first compute node and the second set of processing units in the second compute node according to the second clustering algorithm may include: the first set of processing units receiving portions of a reduction spread result obtained by a second set of processing units in the second compute node according to a second clustering algorithm, each processing unit of the first set of processing units grouped with a respective processing unit of the second set of processing units and receiving a respective portion of the reduction spread result from the respective processing unit; the first set of processing units performs reduction on portions of the reduction spread result by corresponding local portions of the reduction spread result obtained after performing the reduction spread sub-operation between the first set of processing units.

In block 1506, the first computing node may perform a global aggregation sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm.

In an implementation, performing a global aggregation sub-operation between a first set of processing units in a first compute node according to a first cluster communication algorithm may include: receiving, at a first processing unit of the first set of processing units, a data block from a second processing unit of the first set of processing units in accordance with a first cluster communication algorithm; and reducing the received data block with the local data block at the first processing unit.

Referring to fig. 16, in block 1602, a first computing node (e.g., computing node 104) or a first process may determine a routing identifier to route data from the first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in the same computing node or are linked to the same switch.

In an implementation, the first process and the second process may belong to a particular inter-node ring connecting a plurality of different nodes under a particular network topology. By way of example and not limitation, the particular network topology may include a fat tree topology.

In an implementation, a network interface controller associated with the first process is configured to send data to or receive data from only a second computing node in the ring topology, the second computing node being different from the first computing node.

In an implementation, the network interface controller associated with the first process is also associated with one or more processes, wherein all data sent from the first process and the one or more processes is sent through the network interface controller.

In an implementation, the route identifier may be set or determined to be a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or are linked to the same switch.

In an implementation, in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in different computing nodes or are linked to different chip switches, the routing identifier may be set or determined to be equal to the identifier of the network interface controller associated with the first process.

In block 1604, the first computing node or first process may route data from the first process to the second process according to the route identifier.

In an implementation, routing data from a first process to a second process according to a routing identifier may include routing data from the first process to the second process through at least a slice switch connected with a network interface controller associated with the first process and an aggregation switch having an identifier with a correspondence to the identifier of the network interface controller.

Referring to fig. 17, in block 1702, a first computing node (e.g., computing node 104) or a first process may determine an aggregation identifier for sending a data packet from the first process to a second process according to a node-aware halving doubling algorithm, the first process and the second process belonging to different nodes connected to different sheet switches under a particular network topology.

In an implementation, a first compute node may assign different aggregation identifiers to data packets directed to compute nodes connected to different slice switches to enable routing of the data packets through the different aggregation switches to the nodes connected to the different slice switches.

In an implementation, the first computing node may allocate a source port and a destination port corresponding to an aggregation switch associated with an aggregation identifier based at least in part on a predetermined correspondence. In an implementation, the correspondence may record a relationship between aggregation identifiers of the plurality of aggregation switches and corresponding pairs of source and destination ports. In an implementation, the particular network topology may include a fat tree topology.

In block 1704, the first computing node may send a data packet from the first process to the second process through the aggregation switch corresponding to the aggregation identifier.

In an implementation, the first computing node may further send each data packet from a first set of processes included in the first computing node to a second set of processes included in the second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to each data packet.

In an implementation, the first computing node may also receive, by a first set of processes included in the first computing node, the data packets from a second set of processes included in the second computing node via a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to the data packets.

Referring to FIG. 18, in block 1802, a first compute node (e.g., compute node 104) or a processing unit may divide a data block allocated to the processing unit into a plurality of data segments including at least a first data segment and a second data segment.

In block 1804, the first compute node or processing unit may assign the plurality of data segments to a plurality of threads, the plurality of threads including at least a first thread and a second thread.

In block 1806, the first compute node or processing unit may perform an intra-node sub-operation on a portion of the first data segment using a first thread and perform an inter-node sub-operation on a portion of the second data segment using a second thread in parallel.

In an implementation, performing an intra-node sub-operation on a portion of a first data segment using a first thread may include transferring the portion of the first data segment between a processing unit included in the first compute node and another processing unit via an intra-node connection.

In an implementation, performing an inter-node sub-operation on a portion of a second segment of data using a second thread may include transferring the portion of the second segment of data between a processing unit and another processing unit included in a second computing node different from the first computing node via an inter-node connection.

In an implementation, the intra-node sub-operation may include a reduction scatter sub-operation or a global gather sub-operation performed within a first compute node, and the inter-node sub-operation may include a global reduction sub-operation performed between the first compute node and a second compute node different from the first compute node.

In an implementation, the intra-node sub-operations may include global aggregation sub-operations or replication sub-operations performed within a first computing node, and the inter-node sub-operations may include global aggregation sub-operations performed between the first computing node and a second computing node different from the first computing node.

In an implementation, the first compute node or processing unit may perform another inter-node sub-operation on a portion of the first segment of data using a first thread and perform another intra-node sub-operation on a portion of the second segment of data using a second thread in parallel.

In an implementation, the intra-node sub-operation is performed on the portion of the first segment of data using a first thread and the inter-node sub-operation is performed on the portion of the second segment of data using a second thread in parallel, such that a portion of the first segment of data is transferred to another processing unit included in a first compute node using an intra-node connection and a portion of the second segment of data is concurrently transferred to another processing unit included in a second compute node different from the first compute node using an inter-node connection.

Although the method blocks described above are performed in a certain order, in some implementations, some or all of the method blocks may be performed in other orders or in parallel.

To summarize

Although embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that the claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed subject matter. Additionally or alternatively, some or all of the operations may be implemented by one or more ASICS, FPGAs, or other hardware.

The present disclosure may be further understood with the following clauses:

clause 1: a method implemented by a first computing node, the method comprising: performing a reduction scatter sub-operation between a first set of processing units in a first compute node according to a first cluster communication algorithm; performing a global reduction sub-operation between a first set of processing units in a first compute node and a second set of processing units in a second compute node according to a second cluster communication algorithm; and performing a global aggregation sub-operation between a first set of processing units in the first compute node according to a first cluster communication algorithm.

Clause 2: the method of clause 1, further comprising: the first cluster communication algorithm is selected based at least in part on a type or bandwidth of an intra-node connection between a first set of processing units in the first computing node.

Clause 3: the method of clause 1, further comprising: the second cluster communication algorithm is selected based at least in part on a type or bandwidth of an inter-node connection between the first computing node and the other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.

Clause 4: the method of clause 1, wherein the first cluster communication algorithm comprises a ring-based algorithm, or a halving and doubling algorithm.

Clause 5: the method of clause 1, wherein performing a reduction scatter sub-operation between a first set of processing units in a first computing node according to a first cluster communication algorithm comprises: dividing data into a plurality of data blocks; assigning a plurality of data blocks to a first set of processing units; receiving, at a first processing unit of a first set of processing units, a data block from a second processing unit of the first set of processing units according to a first cluster communication algorithm; and reducing the received data block with the local data block at the first processing unit.

Clause 6: the method of clause 1, wherein performing a global reduction sub-operation between a first set of processing units in a first compute node and a second set of processing units in a second compute node according to a second clustering algorithm comprises: the first set of processing units receiving portions of a reduction spread result obtained by a second set of processing units in the second compute node according to a second clustering algorithm, wherein each processing unit of the first set of processing units is grouped with a respective processing unit of the second set of processing units and receiving a respective portion of the reduction spread result from the respective processing unit; the first set of processing units performs reduction on portions of the reduction scatter result by corresponding local portions of the reduction scatter result obtained after performing the reduction scatter sub-operation between the first set of processing units.

Clause 7: the method of clause 1, wherein performing a global aggregation sub-operation between a first set of processing units in a first computing node according to a first cluster communication algorithm comprises: receiving, at a first processing unit of the first set of processing units, a data block from a second processing unit of the first set of processing units in accordance with a first cluster communication algorithm; and reducing the received data block with the local data block at the first processing unit.

Clause 8: one or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: performing a reduction scatter sub-operation between a first set of processing units in a first compute node according to a first cluster communication algorithm; performing a global reduction sub-operation between a first set of processing units in a first compute node and a second set of processing units in a second compute node according to a second cluster communication algorithm; and performing a global aggregation sub-operation between a first set of processing units in the first compute node according to a first cluster communication algorithm.

Clause 9: the one or more machine-readable media of clause 8, the acts further comprising: a first cluster communication algorithm is selected based at least in part on a type or bandwidth of an intra-node connection between a first set of processing units in the first computing node.

Clause 10: the one or more machine-readable media of clause 8, the acts further comprising: the second cluster communication algorithm is selected based at least in part on a type or bandwidth of an internode connection between the first computing node and the other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.

Clause 11: the one or more machine readable media of clause 8, wherein the first group communication algorithm comprises a ring-based algorithm, or a halving and doubling algorithm.

Clause 12: the one or more machine-readable media of clause 8, wherein performing the reduction scatter sub-operation between the first set of processing units in the first computing node according to the first cluster communication algorithm comprises: dividing data into a plurality of data blocks; assigning a plurality of data blocks to a first set of processing units; receiving, at a first processing unit of the first set of processing units, a data block from a second processing unit of the first set of processing units in accordance with a first cluster communication algorithm; and reducing the received data block with the local data block at the first processing unit.

Clause 13: the one or more machine-readable media of clause 8, wherein performing the global reduction sub-operation between the first set of processing units in the first compute node and the second set of processing units in the second compute node according to the second clustering algorithm comprises: the first set of processing units receiving portions of a reduction spread result obtained by a second set of processing units in the second compute node according to a second clustering algorithm, each processing unit of the first set of processing units grouped with a respective processing unit of the second set of processing units and receiving a respective portion of the reduction spread result from the respective processing unit; the first set of processing units performs reduction on portions of the reduction scatter result by corresponding local portions of the reduction scatter result obtained after performing the reduction scatter sub-operation between the first set of processing units.

Clause 14: one or more of clause 8, and readable media thereof, wherein performing the global aggregation sub-operation between the first set of processing units in the first computing node according to the first cluster communication algorithm comprises: receiving, at a first processing unit of the first set of processing units, a data block from a second processing unit of the first set of processing units in accordance with a first cluster communication algorithm; and reducing the received data block with the local data block at the first processing unit.

Clause 15: a first computing node comprising: a first set of processing units; a memory storing machine-executable instructions that, when executed by a first set of processing units, cause the first set of processing units to perform actions comprising: performing a reduction scatter sub-operation between a first set of processing units in a first compute node according to a first cluster communication algorithm; performing a global reduction sub-operation between a first set of processing units in a first compute node and a second set of processing units in a second compute node according to a second cluster communication algorithm; and performing a global aggregation sub-operation between a first set of processing units in the first compute node according to a first cluster communication algorithm.

Clause 16: the first computing node of clause 15, the actions further comprising: selecting a first cluster communication based at least in part on a type or bandwidth of an intra-node connection between a first set of processing units of a first computing node; the second cluster communication algorithm is selected based at least in part on a type or bandwidth of an internode connection between the first computing node and the other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.

Clause 17: the first computing node of clause 15, wherein the first cluster communication algorithm comprises a ring-based algorithm, or a halving and doubling algorithm.

Clause 18: the first computing node of clause 15, wherein performing a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm comprises: dividing data into a plurality of data blocks; assigning a plurality of data blocks to a first set of processing units; receiving, at a first processing unit of a first set of processing units, a data block from a second processing unit of the first set of processing units according to a first cluster communication algorithm; and reducing the received data block with the local data block at the first processing unit.

Clause 19: the first computing node of clause 15, wherein performing a global reduction sub-operation between a first set of processing units in the first computing node and a second set of processing units in the second computing node according to a second clustering algorithm comprises: the first set of processing units receiving portions of a reduction spread result obtained by a second set of processing units in the second compute node according to a second clustering algorithm, each processing unit of the first set of processing units grouped with a respective processing unit of the second set of processing units and receiving a respective portion of the reduction spread result from the respective processing unit; the second set of processing units performs reduction on portions of the reduction spread results by performing respective local portions of the reduction spread results obtained after performing the reduction spread sub-operations between the first set of processing units.

Clause 20: the first computing node of clause 15, wherein performing the global aggregation sub-operation between the first set of processing units in the first computing node according to the first cluster communication algorithm comprises: receiving, at a first processing unit of the first set of processing units, a data block from a second processing unit of the first set of processing units in accordance with a first cluster communication algorithm; and reducing the received data block with the local data block at the first processing unit.

Clause 21: a method implemented by a first computing node, the method comprising: determining a routing identifier for routing data from a first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same switch, the first process and the second process belonging to a particular inter-node ring connecting a plurality of different nodes under a particular network topology; and routing data from the first process to the second process according to the routing identifier.

Clause 22: the method of clause 21, wherein the network interface controller associated with the first process is configured to send data to or receive data from only a second computing node in the ring topology, the second computing node being different from the first computing node.

Clause 23: the method of clause 21, wherein the network interface controller associated with the first process is further associated with one or more processes, wherein all data sent from the first process and the one or more processes is sent through the network interface controller.

Clause 24: the method of clause 21, wherein the particular network topology comprises a fat tree topology.

Clause 25: the method of clause 21, further comprising: the routing identifier is set to a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same compute node or are linked to the same switch.

Clause 26: the method of clause 21, further comprising: in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located at different computing nodes or are linked to different slice switches, the route identifier is set equal to the identifier of the network interface controller associated with the first process.

Clause 27: the method of clause 26, wherein routing data from the first process to the second process according to the routing identifier comprises: data is routed from a first process to a second process at least through a slice switch connected to a network interface controller associated with the first process and an aggregation switch having an identifier with a correspondence to the identifier of the network interface controller.

Clause 28: one or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: determining a routing identifier for routing data from a first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same switch, the first process and the second process belonging to a particular inter-node ring connecting a plurality of different nodes under a particular network topology; and routing data from the first process to the second process according to the routing identifier.

Clause 29: the one or more machine-readable media of clause 28, wherein the network interface controller associated with the first process is configured to send data to or receive data from only a second computing node in the ring topology, the second computing node being different from the first computing node.

Clause 30: the one or more machine-readable media of clause 28, wherein the network interface controller associated with the first process is further associated with one or more processes, wherein all data sent from the first process and the one or more processes is sent through the network interface controller.

Clause 31: the one or more machine-readable media of clause 28, wherein the particular network topology comprises a fat tree topology.

Clause 32: the one or more machine-readable media of clause 28, wherein the actions further comprise: the routing identifier is set to a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or linked to the same switch.

Clause 33: the one or more machine-readable media of clause 28, wherein the actions further comprise: in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located at different computing nodes or are linked to different chip switches, the route identifier is set equal to the identifier of the network interface controller associated with the first process.

Clause 34: the one or more machine readable media of clause 33, wherein routing data from the first process to the second process according to the routing identifier comprises: data is routed from a first process to a second process at least through a slice switch connected to a network interface controller associated with the first process and an aggregation switch having an identifier with a correspondence to the identifier of the network interface controller.

Clause 35: a first computing node comprising: one or more processing units; and a memory storing machine-executable instructions that, when executed by the one or more processing units, cause the one or more processing units to perform acts comprising: determining a routing identifier for routing data from a first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same switch, the first process and the second process belonging to a particular inter-node ring connecting a plurality of different nodes under a particular network topology; and routing data from the first process to the second process according to the routing identifier.

Clause 36: the first computing node of clause 35, wherein the network interface controller associated with the first process is configured to send data to or receive data from only a second computing node in the ring topology, the second computing node being different from the first computing node.

Clause 37: the first computing node of clause 35, wherein the network interface controller associated with the first process is further associated with one or more processes, wherein all data sent from the first process and the one or more processes is sent through the network interface controller.

Clause 38: the first computing node of clause 35, wherein the actions further comprise: the routing identifier is set to a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or linked to the same switch.

Clause 39: the first computing node of clause 35, wherein the actions further comprise: in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located at different computing nodes or are linked to different slice switches, the route identifier is set equal to the identifier of the network interface controller associated with the first process.

Clause 40: the first computing node of clause 39, wherein routing data from the first process to the second process according to the routing identifier comprises: data is routed from a first process to a second process at least through a slice switch connected to a network interface controller associated with the first process and an aggregation switch having an identifier with a correspondence to the identifier of the network interface controller.

Clause 41: a method implemented by a first computing node, the method comprising:

determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving and doubling algorithm, the first process and the second process belonging to different nodes connected to different piece switches under a specific network topology; and sending the data packet from the first process to the second process through the aggregation switch corresponding to the aggregation identifier.

Clause 42: the method of clause 41, further comprising: data packets directed to nodes connected to different slice switches are assigned different aggregation identifiers to enable routing of the data packets through the different aggregation switches to the nodes connected to the different slice switches.

Clause 43: the method of clause 41, further comprising: the source and destination ports corresponding to the aggregation switch associated with the aggregation identifier are assigned based at least in part on the predetermined correspondence.

Clause 44: the method of clause 43, wherein the correspondence records a relationship between aggregation identifiers of the plurality of aggregation switches and corresponding pairs of source and destination ports.

Clause 45: the method of clause 41, wherein the particular network topology comprises a fat tree topology.

Clause 46: the method of clause 41, further comprising: each data packet is sent from a first set of processes included in a first computing node to a second set of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to each data packet.

Clause 47: the method of clause 41, further comprising: a first set of processes included by a first computing node receives data packets from a second set of processes included by a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to the data packets.

Clause 48: one or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving and doubling algorithm, the first process and the second process belonging to different nodes connected to different piece switches under a specific network topology; and sending the data packet from the first process to the second process through the aggregation switch corresponding to the aggregation identifier.

Clause 49: the one or more machine-readable media of clause 48, wherein the actions further comprise: data packets directed to nodes connected to different slice switches are assigned different aggregation identifiers to enable routing of the data packets through the different aggregation switches to the nodes connected to the different slice switches.

Clause 50: the one or more machine-readable media of clause 48, wherein the actions further comprise: the source and destination ports corresponding to the aggregation switch associated with the aggregation identifier are assigned based at least in part on the predetermined correspondence.

Clause 51: the one or more machine readable media of clause 50, wherein the correspondence records a relationship between the aggregation identifiers of the plurality of aggregation switches and the corresponding source and destination port pairs.

Clause 52: the one or more machine-readable media of clause 48, wherein the particular network topology comprises a fat tree topology.

Clause 53: the one or more machine-readable media of clause 48, wherein the actions further comprise: each data packet is sent from a first set of processes included in a first computing node to a second set of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to each data packet.

Clause 54: the one or more machine-readable media of clause 48, wherein the actions further comprise: a first set of processes included by a first computing node receives data packets from a second set of processes included by a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to the data packets.

Clause 55: a first computing node, comprising: one or more processing units; and a memory storing machine-executable instructions that, when executed by the one or more processing units, cause the one or more processing units to perform acts comprising: determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving and doubling algorithm, the first process and the second process belonging to different nodes connected to different piece switches under a specific network topology; and sending the data packet from the first process to the second process through the aggregation switch corresponding to the aggregation identifier.

Clause 56: the first computing node of clause 55, wherein the actions further comprise: data packets directed to nodes connected to different blade switches are assigned different aggregation identifiers to enable routing of the data packets through the different aggregation switches to the nodes connected to the different blade switches.

Clause 57: the first computing node of clause 55, wherein the actions further comprise: the source and destination ports corresponding to the aggregation switch associated with the aggregation identifier are assigned based at least in part on the predetermined correspondence.

Clause 58: the first computing node of clause 57, wherein the correspondence records a relationship between the aggregation identifiers of the plurality of aggregation switches and the corresponding source port and destination port pairs.

Clause 59: the first computing node of clause 55, wherein the actions further comprise: each data packet is sent from a first set of processes included in a first computing node to a second set of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to each data packet.

Clause 60: the first computing node of clause 55, wherein the actions further comprise: a first set of processes included by a first computing node receives data packets from a second set of processes included by a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to the data packets.

Clause 61: a method implemented by a first computing node, the method comprising: dividing a data packet allocated to a processing unit into a plurality of data segments including at least a first data segment and a second data segment; allocating a plurality of data segments to a plurality of threads, the plurality of threads including at least a first thread and a second thread; an intra-node sub-operation is performed on a portion of the first segment of data using a first thread, and an inter-node sub-operation is performed on a portion of the second segment of data using a second thread in parallel.

Clause 62: the method of clause 61, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread comprises: the portion of the first data segment is transmitted between a processing unit included in the first computing node and another processing unit through intra-node connections.

Clause 63: the method of clause 61, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread comprises: the portion of the second data segment is transmitted between the processing unit and another processing unit included in a second computing node different from the first computing node over the inter-node connection.

Clause 64: the method of clause 61, wherein the intra-node sub-operation comprises: a reduction scatter sub-operation or a global gather sub-operation performed within a first compute node, and the inter-node sub-operation includes a global reduction sub-operation performed between the first compute node and a second compute node different from the first compute node.

Clause 65: the method of clause 61, wherein the intra-node sub-operation comprises a global aggregation sub-operation or a replication sub-operation performed within a first computing node, and the inter-node sub-operation comprises a global aggregation sub-operation performed between the first computing node and a second computing node different from the first computing node.

Clause 66: the method of clause 66, further comprising: another inter-node sub-operation is performed on a portion of the first segment of data using the first thread and another intra-node sub-operation is performed on a portion of the second segment of data using the second thread in parallel.

Clause 67: the method of clause 61, wherein the intra-node sub-operation is performed on the portion of the first segment of data using a first thread and the inter-node sub-operation is performed on the portion of the second segment of data using a second thread in parallel, such that the intra-node connection is used to transmit the portion of the first segment of data to another processing unit included in a first compute node and the inter-node connection is used concurrently to transmit the portion of the second segment of data to another processing unit included in a second compute node different from the first compute node.

Clause 68: one or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: dividing a data packet allocated to a processing unit into a plurality of data segments, the plurality of data segments including at least a first data segment and a second data segment; assigning a plurality of data segments to a plurality of threads, the plurality of threads including at least a first thread and a second thread; an intra-node sub-operation is performed on a portion of the first segment of data using a first thread, and an inter-node sub-operation is performed on a portion of the second segment of data using a second thread in parallel.

Clause 69: the one or more machine readable media of clause 68, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread comprises transferring the portion of the first data segment between another processing unit included in the first compute node and the processing unit through an intra-node connection.

Clause 70: the one or more machine readable media of clause 68, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread comprises: the portion of the second data segment is transmitted between the processing unit and another processing unit included in a second computing node different from the first computing node over the inter-node connection.

Clause 71: the one or more machine-readable media of clause 68, wherein the intra-node sub-operation comprises a reduction scatter sub-operation or a global gather sub-operation performed within a first compute node, and the inter-node sub-operation comprises a global reduction sub-operation performed between the first compute node and a second compute node different from the first compute node.

Clause 72: the one or more machine-readable media of clause 68, wherein the intra-node sub-operation comprises a global aggregation sub-operation or a replication sub-operation performed within a first computing node, and the inter-node sub-operation comprises a global aggregation sub-operation performed between the first computing node and a second computing node different from the first computing node.

Clause 73: the one or more machine-readable media of clause 68, the acts further comprising: another inter-node sub-operation is performed on the portion of the first data segment using the first thread and another intra-node sub-operation is performed on the portion of the second data segment using the second thread in parallel.

Clause 74: the one or more machine-readable media of clause 68, wherein the performing the intra-node sub-operation on the portion of the first segment of data using the first thread and the performing the inter-node sub-operation on the portion of the second segment of data using the second thread in parallel causes the transferring of the portion of the first segment of data to another processing unit included in a first compute node using an intra-node connection and the concurrently transferring of the portion of the second segment of data to another processing unit included in a second compute node different from the first compute node using an inter-node connection.

Clause 75: a first computing node, comprising: one or more processing units; a memory storing machine-executable instructions that, when executed by the one or more processing units, cause the one or more processing units to perform acts comprising: dividing a data packet allocated to a processing unit into a plurality of data segments including at least a first data segment and a second data segment; allocating a plurality of data segments to a plurality of threads, the plurality of threads including at least a first thread and a second thread; an intra-node sub-operation is performed on a portion of the first segment of data using a first thread and an inter-node sub-operation is performed on a portion of the second segment of data using a second thread in parallel.

Clause 76: the first computing node of clause 75, wherein performing an intra-node sub-operation on a portion of a first data segment using a first thread comprises: the portion of the first data segment is transmitted between another processing unit included in the first computing node and the processing unit through an intra-node connection.

Clause 77: the first computing node of clause 75, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread comprises: the portion of the second data segment is transmitted between the processing unit and another processing unit included in a second computing node different from the first computing node over the inter-node connection.

Clause 78: the first computing node of clause 75, wherein the intra-node sub-operation comprises a reduction scatter sub-operation or a global gather sub-operation performed within the first computing node, and the inter-node sub-operation comprises a global reduction sub-operation performed between the first computing node and a second computing node different from the first computing node.

Clause 79: the first computing node of clause 75, wherein the intra-node sub-operation comprises a global aggregation sub-operation or a replication sub-operation performed within the first computing node, and the inter-node sub-operation comprises a global aggregation sub-operation performed between the first computing node and a second computing node different from the first computing node.

Clause 80: the first computing node of clause 75, wherein the intra-node sub-operation is performed on the portion of the first segment of data using a first thread and the inter-node sub-operation is performed on the portion of the second segment of data using a second thread in parallel, such that the intra-node connection is used to transmit the portion of the first segment of data to another processing unit included in the first computing node and the inter-node connection is used concurrently to transmit the portion of the second segment of data to another processing unit included in a second computing node different from the first computing node.

Claims

1. A method implemented by a first computing node, comprising:

performing a reduction scatter sub-operation between a first set of processing units in the first compute node according to a first cluster communication algorithm;

performing a global reduction sub-operation between the first set of processing units in the first compute node and a second set of processing units in a second compute node according to a second cluster communication algorithm; and

performing a global aggregation sub-operation between the first set of processing units in the first compute node according to the first cluster communication algorithm.

2. The method of claim 1, further comprising: selecting the first cluster communication algorithm based at least in part on a type or bandwidth of an intra-node connection between the first set of processing units in the first computing node.

3. The method of claim 1, further comprising: selecting the second cluster communication algorithm based at least in part on a type or bandwidth of an inter-node connection between the first computing node and other computing nodes, and/or a connection topology of the first computing node and other computing nodes.

4. The method of claim 1, wherein the first cluster communication algorithm comprises a ring-based algorithm, or a halving and doubling algorithm.

5. The method of claim 1, wherein performing a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm comprises:

dividing data into a plurality of data blocks;

assigning the plurality of data blocks to the first set of processing units;

receiving, at a first processing unit of the first set of processing units, a data block from a second processing unit of the first set of processing units in accordance with the first cluster communication algorithm; and

reducing, at the first processing unit, the received data block with a local data block.

6. The method of claim 1, wherein performing a global reduction sub-operation between the first set of processing units in the first computing node and a second set of processing units in a second computing node according to a second cluster communication algorithm comprises:

the first set of processing units receives portions of a reduction spread obtained by the second set of processing units in the second computing node according to the second cluster communication algorithm, wherein each processing unit of the first set of processing units forms a group with a respective processing unit of the second set of processing units and receives a respective portion of the reduction spread from the respective processing unit; and

the first set of processing units performs reduction on portions of the reduction spread result by corresponding local portions of the reduction spread result obtained after performing the reduction spread sub-operation between the first set of processing units.

7. The method of claim 1, wherein performing a global aggregation sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm comprises:

8. One or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising:

9. The one or more machine-readable media of claim 8, the acts further comprising: selecting the first cluster communication algorithm based at least in part on a type or bandwidth of an intra-node connection between the first set of processing units in the first computing node.

10. The one or more machine-readable media of claim 8, the acts further comprising: selecting the second cluster communication algorithm based at least in part on a type or bandwidth of an inter-node connection between the first computing node and other computing nodes, and/or a connection topology of the first computing node and other computing nodes.

11. The one or more machine-readable media of claim 8, wherein the first cluster communication algorithm comprises a ring-based algorithm, or a halving and doubling algorithm.

12. The one or more machine-readable media of claim 8, wherein performing a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm comprises:

dividing data into a plurality of data blocks;

assigning the plurality of data blocks to the first set of processing units;

13. The one or more machine-readable media of claim 8, wherein performing a global reduction sub-operation between the first set of processing units in the first compute node and a second set of processing units in a second compute node according to a second cluster communication algorithm comprises:

the first set of processing units receiving portions of a reduction spread result obtained by the second set of processing units in the second computing node according to the second cluster communication algorithm, wherein each processing unit of the first set of processing units is grouped with a respective processing unit of the second set of processing units and receives a respective portion of the reduction spread result from the respective processing unit; and

14. The one or more machine-readable media of claim 8, wherein performing a global aggregation sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm comprises:

15. A first computing node comprising:

a first set of processing units;

a memory storing machine-executable instructions that, when executed by the first set of processing units, cause the first set of processing units to perform actions comprising:

performing a reduction scatter sub-operation between the first set of processing units in the first compute node according to a first cluster communication algorithm;

16. The first computing node of claim 15, the acts further comprising:

selecting the first cluster communication based at least in part on a type or bandwidth of an intra-node connection between the first set of processing units in the first computing node; and

selecting the second cluster communication algorithm based at least in part on a type or bandwidth of inter-node connections between the first computing node and other computing nodes, and/or a connection topology of the first computing node and other computing nodes.

17. The first computing node of claim 15, wherein the first cluster communication algorithm comprises a ring-based algorithm, or a halving and doubling algorithm.

18. The first computing node of claim 15, wherein performing a reduction scatter sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm comprises:

dividing data into a plurality of data blocks;

assigning the plurality of data blocks to the first set of processing units;

19. The first computing node of claim 15, wherein performing a global reduction sub-operation between the first set of processing units in the first computing node and a second set of processing units in a second computing node according to a second cluster communication algorithm comprises:

20. The first computing node of claim 15, wherein performing a global aggregation sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm comprises: