WO2021195988A1 - Network congestion avoidance over halving-doubling collective communication - Google Patents

Network congestion avoidance over halving-doubling collective communication Download PDF

Info

Publication number
WO2021195988A1
WO2021195988A1 PCT/CN2020/082516 CN2020082516W WO2021195988A1 WO 2021195988 A1 WO2021195988 A1 WO 2021195988A1 CN 2020082516 W CN2020082516 W CN 2020082516W WO 2021195988 A1 WO2021195988 A1 WO 2021195988A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing node
node
aggregation
different
data
Prior art date
Application number
PCT/CN2020/082516
Other languages
French (fr)
Inventor
Jianxi YE
Shaochuang WANG
Qianyuan RAN
Fei FENG
Jianbo Dong
Original Assignee
Alibaba Group Holding Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Limited filed Critical Alibaba Group Holding Limited
Priority to PCT/CN2020/082516 priority Critical patent/WO2021195988A1/en
Priority to CN202080098260.0A priority patent/CN115335804A/en
Publication of WO2021195988A1 publication Critical patent/WO2021195988A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • DNNs deep neural networks
  • application domains e.g., computer vision, natural language processing, speech recognition, etc.
  • sizes of neural network models and sizes of training data required for training the models have substantially increased, which would unavoidably lead to increasingly long training times and thus adversely affect the effectiveness and timeliness of the trained models to meet ever-changing application environments.
  • a distributed training system which employs parallel training, may be used.
  • a distributed training system may include a great number of computing nodes or servers distributed over a network, and assign subsets of computing tasks to the computing nodes or servers for performing computations in parallel training.
  • data communications between computing nodes or servers in a distributed training system pose a lower bound or a bottleneck for an amount of reduction in a training time that may happen in the distributed training system. This is especially true when a distributed training system includes various types of heterogeneous connections or inter-connects within and between computing nodes or servers, which exhibit different characteristics in terms of latency, bandwidth, topology, etc. Such heterogeneity in connections or inter-connects increases the difficulty and complexity in designing a network of data communications for the computing nodes or servers in the distributed training system.
  • network congestion may occur due to an excessive amount of data flows passing through a certain network switch or connection between computing nodes or servers in the distributed training system, which may lead to a prolonged training time due to a delay in processing training results.
  • Such excessive amount of data flows that pass through a certain network switch or connection may be caused by a loss of control on path selection for routing data sent between computing nodes or servers.
  • FIG. 1 illustrates an example environment in which a distributed training system may be used.
  • FIG. 2 illustrates an example computing node in more detail.
  • FIG. 3A illustrates a ring configuration that interconnects a predetermined number of nodes.
  • FIG. 3B shows a halving-doubling configuration that interconnects a predetermined number of nodes.
  • FIG. 4 shows a schematic diagram depicting an example collective communication library.
  • FIG. 5 shows an example topology-aware multi-phase algorithm.
  • FIG. 6 shows an example ring-based algorithm for a computing node in an intra-node reduce-scatter phase.
  • FIG. 7 shows an example halving-doubling algorithm for a computing node in an intra-node reduce-scatter phase.
  • FIG. 8 shows an example halving-doubling algorithm in an inter-node allreduce phase.
  • FIG. 9 shows the example halving-doubling algorithm in the inter-node allreduce phase in more detail.
  • FIG. 10 shows an example ring-based collective communication algorithm.
  • FIG. 11 shows an example scenario of performing an intra-node reduce-scatter phase, an inter-node allreduce phase, and an intra-node allgather phase in a parallel or overlapping manner.
  • FIG. 12 shows an example fat-tree network topology.
  • FIG. 13 shows an example scenario of using a first congestion avoidance approach.
  • FIG. 14 shows an example scenario of using a second congestion avoidance approach.
  • FIG. 15 shows an example topology aware multi-phase method.
  • FIG. 16 shows a first example network congestion avoidance method.
  • FIG. 17 shows a second example network congestion avoidance method.
  • FIG. 18 shows an example parallel method based on hybrid architecture in distributed training.
  • network congestion may occur due to a loss of control on path selection for routing data sent between computing nodes, resulting in an excessive amount of data flows passing through a certain network switch or connection between the computing nodes in the distributed training system, and leading to a prolonged training time due to a delay in processing training results.
  • existing distributed training systems fail to distinguish algorithms for different types of underlying fabrics for a collective operation and hence leading to a poor performance.
  • the example distributed training system may employ a fabric-aware collective communication library that enables the distributed training system to scale linearly.
  • the collective communication library may customize communication algorithms based at least in part on analysis of underlying fabrics and supporting network architectures to attain a desired or maximum efficiency.
  • the distributed training system may divide primitive operations into a plurality of sub-operations, with each sub-operation using a type of fabric.
  • the example distributed training system may implement a hybrid algorithm that allows a co-existence of multiple algorithms in a single collective operation, and selectively employ an algorithm for a particular fabric to enhance or maximize the efficiency for an entire communication path.
  • the distributed training system may adopt a two-process parallel algorithm that launches two concurrent processes and pipelines the use of intra-node and inter-node connections, thus improving the efficiency of communications by overlapping intra-node communications with inter-node communications.
  • the example distributed training system may employ a probing-based routing control mechanism that generates mappings from connections to paths, and thereby distribute or scatter the connections to different aggregation or intermediate switches in a communication network by re-ranking participants or processes in collective operations and mapping data flows across the distributed training system to particular physical links, thus avoiding network congestion.
  • the application describes multiple and varied embodiments and implementations.
  • the following section describes an example framework that is suitable for practicing various implementations.
  • the application describes example systems, devices, and processes for implementing a distributed training system.
  • FIG. 1 illustrates an example environment 100 usable to implement a distributed training system.
  • the environment 100 may include a distributed training system 102.
  • the distributed training system 102 may include a plurality of computing nodes or servers 104-1, 104-2, ..., 104-K (which are collectively called hereinafter as computing nodes 104) , where K is a positive integer greater than one.
  • the plurality of computing nodes 104 may communicate data with each other via a communication network 106.
  • the computing node 104 may be implemented as any of a variety of computing devices having computing/processing and communication capabilities, which may include, but not limited to, a server, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , etc., or a combination thereof.
  • a server e.g., a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , etc., or a combination thereof.
  • the communication network 106 may be a wireless or a wired network, or a combination thereof.
  • the network 106 may be a collection of individual networks interconnected with each other and functioning as a single large network (e.g., the Internet or an intranet) . Examples of such individual networks include, but are not limited to, telephone networks, cable networks, Local Area Networks (LANs) , Wide Area Networks (WANs) , and Metropolitan Area Networks (MANs) . Further, the individual networks may be wireless or wired networks, or a combination thereof. Wired networks may include an electrical carrier connection (such a communication cable, etc. ) and/or an optical carrier or connection (such as an optical fiber connection, etc. ) .
  • Wireless networks may include, for example, a WiFi network, other radio frequency networks (e.g., Zigbee, etc. ) , etc.
  • the communication network 106 may include a plurality of inter-node interconnects or switches 108-1, 108-2, ..., 108-L (which are collectively called hereinafter as inter-node switches 108) for providing connections between the computing nodes 104, where L is a positive integer greater than one.
  • the environment 100 may further include a client device 110.
  • a user may instruct the distributed training system 102 to perform training on a particular learning model (such as a deep neural network model) based on data sent from the client device 110 to the distributed training system 102, for example.
  • a particular learning model such as a deep neural network model
  • FIG. 2 illustrates the computing node 104 in more detail.
  • the computing node 104 may include, but is not limited to, one or more processing units 202, an input/output (I/O) interface 204, and/or one or more network interfaces 206, and memory 208.
  • the computing node 104 may further include one or more intra-node interconnects or switches 210.
  • the processing units 202 may be configured to execute instructions that are stored in the memory 208, and/or received from the input/output interface 204, and/or the network interface 206.
  • the processing units 202 may be implemented as one or more hardware processors including, for example, a microprocessor, an application-specific instruction-set processor, a physics processing unit (PPU) , a central processing unit (CPU) , a graphics processing unit, a digital signal processor, a tensor processing unit, etc. Additionally or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components.
  • FPGAs field-programmable gate arrays
  • ASICs application-specific integrated circuits
  • ASSPs application-specific standard products
  • SOCs system-on-a-chip systems
  • CPLDs complex programmable logic devices
  • the memory 208 may include machine readable media in a form of volatile memory, such as Random Access Memory (RAM) and/or non-volatile memory, such as read only memory (ROM) or flash RAM.
  • RAM Random Access Memory
  • ROM read only memory
  • flash RAM flash random Access Memory
  • the machine readable media may include a volatile or non-volatile type, a removable or non-removable media, which may achieve storage of information using any method or technology.
  • the information may include a machine readable instruction, a data structure, a program module or other data.
  • machine readable media examples include, but not limited to, phase-change memory (PRAM) , static random access memory (SRAM) , dynamic random access memory (DRAM) , other types of random-access memory (RAM) , read-only memory (ROM) , electronically erasable programmable read-only memory (EEPROM) , quick flash memory or other internal storage technology, compact disk read-only memory (CD-ROM) , digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission media, which may be used to store information that may be accessed by a computing node.
  • the machine readable media does not include any transitory media, such as modulated data signals and carrier waves.
  • the network interfaces 206 may be configured to connect the computing node 104 to other computing nodes via the communication network 106.
  • the network interfaces 206 may be established through a network interface controller (NIC) , which may employ both hardware and software in connecting the computing node 104 to the communication network 106.
  • NIC network interface controller
  • each type of NIC may use a different type of fabric or connector to connect to a physical medium associated with the communication network 106. Examples of types of fabrics or connectors may be found in the IEEE 802 specifications, and may include, for example, Ethernet (which is defined in 802.3) , Token Ring (which is defined in 802.5) , and wireless networking (which is defined in 802.11) , an InfiniBand, etc.
  • the intra-node switches 210 may include various types of interconnects or switches, which may include, but are not limited to, a high-speed serial computer expansion bus (such as PCIe, etc. ) , a serial multi-lane near-range communication link (such as Nolan, which is a wire-based communications protocol serial multi-lane near-range communication link, for example) , a switch chip with a plurality of ports (e.g., a NVSwitch, etc. ) , a point-to-point processor interconnect (such as an Intel QPI/UPI, etc. ) ., etc.
  • a high-speed serial computer expansion bus such as PCIe, etc.
  • a serial multi-lane near-range communication link such as Nolan, which is a wire-based communications protocol serial multi-lane near-range communication link, for example
  • a switch chip with a plurality of ports e.g., a NVSwitch, etc.
  • a point-to-point processor interconnect such as
  • the anomaly detection system 110 may further include other hardware components and/or other software components, such as program modules 212 to execute instructions stored in the memory 208 for performing various operations, and program data 214 for storing data received for training, intermediate and final results calculated during training, etc.
  • FIGS. 3A and 3B illustrate example collective communication algorithms that may be used in the distributed training system 102.
  • collective communication algorithms may include, but are not limited to, a ring-based communication algorithm, a halving-doubling communication algorithm, etc.
  • FIG. 3A shows a ring configuration that interconnects a predetermined number of nodes (e.g., N nodes, where N is a positive integer greater than one) with multiple connections (i.e., N connections) , and divides data (e.g., a data packet or message) into a plurality of data chunks (i.e., N data chunks) for transmission, needs a number of steps (in this example, N –1 steps) of communications to complete a collective operation.
  • a node may receive data from one of its neighboring nodes, conduct a specific operation with the received data to obtain a local result, and forward the received data to the other of the neighboring nodes.
  • each node in the ring has data from other nodes of the ring, and a final result is scatter to all nodes, which needs another N –1 steps for broadcasting respective local results.
  • a total data size that is forwarded is 2S, where S denotes a data size or message size.
  • FIG. 3B shows a halving-doubling configuration that interconnects a predetermined number of nodes (e.g., N nodes, where N is a positive integer greater than one) .
  • the nodes communicate with each other in a pair-wise manner, with only N/2 connections that are needed in each step of communication.
  • adjacent nodes are paired together, send one half of a message or data to respective peer nodes, and receive other half of the message or data for processing. Therefore, intermediate results may be scattered to the peer nodes.
  • new pairs are formed with increased or doubled distance, and a data size for processing is halved.
  • results are distributed among all the nodes in the halving-doubling configuration. Local results in the nodes are then broadcasted to other nodes through additional log 2 N steps of communication.
  • FIG. 4 shows a schematic diagram depicting an example collective communication library 400 that may be employed by the distributed training system 102.
  • the collective communication library is a communication library designed to provide high performance, high extensibility, and strong availability, and may be configured to provide support not only for standard collective operations such as allreduce and allgather operations, but also other self-defined operations for customized applications.
  • the collective communication library 400 may take different types of interconnects or switches with distinct characteristics (e.g., in terms of latency, bandwidth, topology, etc. ) into account, and provide a mechanism to collect information of underlying hardware in a network and computing nodes, thus enabling a topology-aware algorithm design to be developed based on one or more pieces of this collected information.
  • the collective communication library 400 may provide flexibility for allowing multiple algorithms to be performed in a single operation, and improve the performance (e.g., the performance of communications and training, etc. ) by exploiting parallelism between intra-node communications and inter-node communications. Additionally, the collective communication library 400 may make use of multiple NICs in a computing node with conventional or new mapping algorithms, and eliminate network congestions through a topology-aware arrangement of connections.
  • the collective communication library 400 may include a software stack 402.
  • the software stack 402 may include a plurality of components, which may include, but are not limited to, a transport component 404, an operation component 406, a communicator component 408, and a library context component 410.
  • the software stack 402 may be designed in a modular manner to allow generality and extensibility.
  • the transport component 404 may be responsible for peer-to-peer (P2P) data transfers or transmissions in intra-node and inter-node communications.
  • the collective communication library 400 may support TCP (Transmission Control Protocol) and RDMA (Remote Direct Memory Access) for inter-node communication, and P2P fabrics for intra-node communication, such as PCIe (Peripheral Component Interconnect Express) , NVLink/NVSwitch, and QPI/UPI (Quick Path Interconnect/Ultra Path Interconnect) , etc.
  • the transport component 404 may further be configured to manage memory regions (MRs) and corresponding memory buffers in both processing units (such as graphics processing unit (GPU) devices) and host memories.
  • MRs memory regions
  • GPU graphics processing unit
  • the operation component 406 may provide a set of basic operations and a variety of networking algorithms.
  • the basic operations may be configured with algorithms that are supported by the collective communication library 400.
  • the operation component 406 may allow a user definition of a new operation based on these basic operations to implement a heterogeneity-aware operation that may adopt an optimal or better algorithm for each type of fabrics.
  • the communicator component 408 may be associated with a software process, and may be configured to perform manipulations and processing on a processing unit (such as a GPU device) .
  • the communicator component 408 may keep or record information about other peers (e.g., rank IDs, IP addresses, etc. ) , and maintain connections with the peers.
  • the communicator component 408 may further collect intra-node and inter-node topology information, and use this information to guide an algorithm design.
  • intra-node information may include, but is not limited to, a type of interconnect, a distance between locations of processing units, a distance between a processing unit and a network interface controller, etc.
  • inter-node information may include, but is not limited to, the number of available network interface controllers, a topology of a cluster or computing nodes, locations of computing nodes in the cluster, for example.
  • the library context component 410 may be configured to expose one or more application interfaces for setting system configurations (such as environment variables, for example) , managing communicator component 408, and provide other functionalities such as logging, etc.
  • the collective communication library 400 may further include or provide a plurality of tools and utilities 412 for topology awareness design, testing and evaluation, and availability improvement.
  • the tools and utilities 412 may include performance testing tools for transport component 404 to provide assistance for algorithm designs and evaluations, a probing-based routing mechanism for ensuring the availability of the system, and other functionalities, such as a device management function that is extendable to support devices other than GPUs, for example.
  • a collective communication may be defined as a communication that involves a group of processing units or processes, and an operation of collective communication may be executed by all the processing units or processes included in the group together.
  • Examples of an operation of collective communication may include, but are not limited to, an allreduce operation, an allgather operation, a reduce-scatter operation, etc.
  • an allreduce operation is one of a number of important primitives of collective communication in distributed training, and involves performing a reduction on data across processes in a group. Examples of the reduction may include, but are not limited to, an operation of summation, an operation of obtaining an average, an operation of obtaining a maximum, an operation of obtaining a minimum, etc.
  • an allreduce operation is used herein as an example to illustrate how a collective operation may be divided into a plurality of micro-operations or sub-operations.
  • the distributed training system 102 may employ a topology-aware multi-phase algorithm that divides an allreduce operation into multiple micro-operations or sub-operations, and selectively pick one or more micro-operations or sub-operations on demand, thus reducing an amount of data that is transferred by eliminating micro-operations or sub-operations that may not be needed.
  • the distributed training system 102 may decouple collective communication algorithms from micro-operations or sub-operations, and allow an independent or separate matching between algorithms and micro-operations or sub-operations based on underlying fabric information, thus maximizing or optimizing the bandwidth utilization with a lesser amount of data transferred.
  • FIG. 5 shows an example topology-aware multi-phase algorithm 500 that may be employed by the distributed training system 102.
  • the topology-aware multi-phase algorithm 500 may include a plurality of phases, for example, an intra-node reduce-scatter phase 502, an inter-node allreduce phase 504, and an intra-node allgather phase 506.
  • the distributed training system 102 may first assign respective portions of data to be processed for training to multiple computing nodes 204, so that each computing node 104 of the multiple computing nodes 204 may receive a respective portion of data.
  • each computing node 104 may divide the respective portion of data into multiple data pieces (e.g., N data pieces, where N is a positive integer) , and assign these multiple data pieces to multiple local processing units or processes (e.g., N local processing units or processes) that are included in the respective computing node 104.
  • each of the local processing units or processes included in each computing node 104 may divide a data piece assigned thereto into a plurality of data chunks (e.g., M chunks) .
  • the local processing units or processes included in each computing node 104 may then collaboratively perform an intra-node reduce-scatter sub-operation to obtain allreduce results of the plurality of data chunks in the respective computing node 104 according to a particular collective communication algorithm in a number of steps or iterations.
  • local processing units or processes included in a computing node 104 may have reduced results (or called reduce-scatter results) of all the processing units or processes included in that computing node 104 in different data chunks.
  • the distributed training system 102 may select the particular collective communication algorithm used in the intra-node reduce-scatter phase 502 based on collected information of a number of factors by the collective communication library 400, for example.
  • the number of factors may include, but are not limited to, types of interconnects between processing units (or processes) in a computing node, the number of interconnects in the computing node, etc.
  • the distributed training system 102 may employ a first collective communication algorithm for a first computing node, and employ a second collective communication algorithm for a second computing node having same or different processing and connection capabilities with the first computing node, where the first collective communication algorithm may or may not be the same as the second collective communication algorithm.
  • the distributed training system 102 may employ a halving-doubling algorithm for a computing node that uses NVSwitch or PCIe as interconnects and includes a number of processing units or processes that are used for training to be a power of two, and may employ a ring-based algorithm for another computing node using NVLink or others as interconnects and using a number of processing units or processes that is not a power of two for training, etc.
  • FIG. 6 shows an example ring-based algorithm 600 for a computing node in the intra-node reduce-scatter phase 502.
  • the example ring-based algorithm includes a configuration of one ring only. Nevertheless, any ring-based algorithm including a configuration of more than one ring, with each ring processing a portion of data chunk, for example, may be used.
  • the computing node is described to include M processing units or processes (with rank identifiers or numbers 1, 2, ..., M) , and data assigned to each processing unit or process is divided into M data chunks.
  • a processing unit or process e.g., P1
  • P2 next processing unit or process
  • PM previous processing unit or process
  • the processing unit or process may send a partial reduced result (in this example, a partial reduced result obtained by P1 at k-1 th step) to the next processing unit or process (e.g., P2) in the ring, receive a partial reduced result (in this example, a partial reduced result obtained by PM at k-1 th step) from the previous processing unit or process (e.g., PM) , and reduce the received partial reduced result with another local data chunk that has not previously been sent or reduced with other data.
  • a partial reduced result in this example, a partial reduced result obtained by P1 at k-1 th step
  • PM previous processing unit or process
  • each processing unit or process may send or receive and reduce different data chunks (or partial results) at different steps.
  • each processing unit or process may include a resulting data chunk that stores a reduced result of M respective data chunks of the M processing units or processes in that computing node. For example, after M –1 steps, a data chunk of P1 “at the top position” stores a reduced result of all the data chunks of the M processing units or processes corresponding to “that top position” as shown in FIG. 6.
  • FIG. 7 shows an example halving-doubling algorithm 700 for a computing node in the intra-node reduce-scatter phase 502.
  • the computing node is described to include M processing units or processes (M is set as eight in this example for illustration) .
  • a processing unit or process e.g., P1
  • the processing unit or process may send one half of a partial reduced result that is locally obtained at a previous step to a different processing unit or process that is located at an increasingly further distance from the processing unit or process (i.e., P1) , and reduce the received partial reduced result with another half of the partial reduced result that is locally obtained at the previous step to obtain a new partial reduced result for the processing unit or process (i.e., P1) .
  • the processing unit or process e.g., P1 may send one half of a partial reduced result that is locally obtained at a previous step to a different processing unit or process that is located at an increasingly further distance from the processing unit or process (i.e., P1) , and reduce the received partial reduced result with another half of the partial reduced result that is locally obtained at the previous step to obtain a new partial reduced result for the processing unit or process (i.e., P1) .
  • the intra-node reduce-scatter phase 502 i.e., after log 2 M steps, i.e., 3 steps in this
  • each processing unit or process may include a resulting data chunk that stores a reduced result of M respective data chunks of the M (in this example, eight as shown in FIG. 7) processing units or processes in that computing node. For example, after log 2 M steps, a data chunk of P1 “at the bottom position” stores a reduced result of all the data chunks of the M (in this example, eight as shown in FIG. 7) processing units or processes corresponding to “that bottom position” as shown in FIG. 7.
  • an inter-node allreduce sub-operation is node-based (i.e., between different computing nodes) , and may be performed between processing units (or processes) included in different computing nodes.
  • processing units (or processes) of different computing nodes holding a same data chunk of reduced results (or reduce-scatter results) are formed into a same group, and communicate respective results with each other in the same group to perform an inter-node allreduce sub-operation.
  • each processing unit or process of each computing node in a certain group may possess a particular data chunk of reduced results of all the processing units or processes in that same group, with processing units or processes of different groups possessing different data chunks of reduced results of respective processing units or processes in the different groups.
  • the distributed training system 102 may select a particular collective communication algorithm based on one or more selection criteria, and may implement inter-node allreduce sub-operations based on the selected collective communication algorithm.
  • the particular collective communication algorithm may include, but are not limited to, a ring-based algorithm (such as a hierarchical ring algorithm, a multi-ring algorithm, etc. ) , a halving-doubling algorithm, etc.
  • the one or more selection criteria may include, but are not limited to, a topology of a communication network (e.g., the communication network 206) connecting the computing nodes, the number of switches used in the communication network, types of switches used in the communication network, a network type of the communication network, etc.
  • FIGS. 8 and 9 show an example halving-doubling algorithm in the inter-node allreduce phase 504.
  • the distributed training system 102 is described to include a plurality of computing nodes (i.e., Node 0, Node 1, Node 2, ...Node N-1, where N is shown as four in FIG. 8 for illustration) , with each computing node including eight processing units or processes with corresponding rank numbers (namely, Rank 0, Rank 1, Rank 2, ...Rank M-1, where M is shown as eight in FIG. 8 for illustration) as shown in FIG. 8.
  • FIG. 8 the distributed training system 102 is described to include a plurality of computing nodes (i.e., Node 0, Node 1, Node 2, ...Node N-1, where N is shown as four in FIG. 8 for illustration) , with each computing node including eight processing units or processes with corresponding rank numbers (namely, Rank 0, Rank 1, Rank 2, ...Rank M-1, where M is shown as eight in FIG. 8 for illustration) as shown in FIG. 8.
  • processing units or processes having a same rank number in corresponding computing nodes include a same data chunk of reduced results (or reduce-scatter results) , and are formed into a same group.
  • processing units or processes having a rank number 0 in corresponding computing nodes include a data chunk of reduced results at the first position among respective local data chunks, and are formed into a same group (e.g., group 0) .
  • processing units or processes in different groups may not communicate with each other.
  • an inter-node allreduce sub-operation may be separately performed between processing units (or processes) in each group, so that each processing unit (or process) in a group may obtain all reduced results of a same data chunk of all processing units (or processes) in the same group.
  • a processing unit or process in each group may iteratively send a local reduced result of a corresponding data chunk with other processing units or processes in the respective group, receive respective local reduced results of the corresponding data chunk from the other processing units or processes in doubled or increased distances, and perform a reduction operation on the received reduced results with a local reduced result.
  • FIG. 9 shows an example scenario of applying a halving-doubling algorithm for eight computing nodes.
  • a first processing unit or process of a certain group in a first computing node (e.g., Node 0) may send a local reduced result thereof to a second processing unit or process of the same group in a second computing node (e.g., Node 1) , receive a local reduced result from the second processing unit or process of the same group in the second computing node, and perform a reduction operation on the local reduced result thereof and the received local reduced result to obtain a new local reduced result.
  • the first processing unit or process e.g., the processing unit or process with the rank number 0
  • the first computing node may send the new local reduced result thereof to a third processing unit or process of the same group (e.g., the rank number 0) in a third computing node (i.e., Node 2 in this example) , receive a local reduced result from the third processing unit or process of the same group in the first computing node, and perform a reduction operation on the new local reduced result thereof and the received local reduced result to obtain another new local reduced result.
  • each processing unit or process of each computing node in a certain group may possess a particular data chunk of reduced results of all the processing units or processes in that same group, with processing units or processes of different groups possessing different data chunks of reduced results of respective processing units or processes in the different groups.
  • an inter-node allreduce sub-operation may be separately performed between processing units (or processes) of each group in a plurality of computing nodes (e.g., N computing nodes) using a ring-based algorithm, so that each processing unit (or process) in a group may obtain all reduced results of a same data chunk of all processing units (or processes) in the same group.
  • a plurality of computing nodes e.g., N computing nodes
  • a processing unit or process of each group in a computing node may iteratively send a local reduced result of a corresponding data chunk to a processing unit or process of the respective group in a next computing node, receive a local reduced result of the corresponding data chunk from a processing unit or process of the respective group in a previous computing node, and perform a reduction operation on the received reduced result with its local reduced result.
  • each processing unit or process of each computing node in a certain group may possess a particular data chunk of reduced results of all the processing units or processes in that same group, with processing units or processes of different groups possessing different data chunks of reduced results of respective processing units or processes in the different groups.
  • an allgather sub-operation may be performed across local processing units or processes in each computing node of the plurality of computing nodes of the distributed training system 102, to locally broadcast respective reduced results obtained in the inter-node allreduce phase 504 to each other in the same computing node.
  • each processing unit or process in each computing node of the distributed training system 102 may have a reduced result of the entire data that is distributed among the plurality of computing nodes.
  • a ring-based algorithm is used herein to illustrate how to broadcast reduced results that are obtained (in the inter-node allreduce phase 504) by processing units or processes locally in a computing node of the distributed training system 102.
  • the distributed training system 102 may employ different or same collective communication algorithms (such as the halving-doubling algorithm, etc. ) for different computing nodes.
  • the distributed training system 102 may employ different or same collective communication algorithms for different computing nodes based on a number of factors associated with each individual computing node. In implementations, the number of factors may include, but are not limited to, types of interconnects between processing units (or processes) in a computing node, the number of interconnects in the computing node, etc.
  • FIG. 10 shows an example ring-based collective communication algorithm 1000 used for broadcasting individual reduced results of processing units or processes to each other within a computing node of the distributed training system 102.
  • each processing unit or process e.g., P1 of M processing units or processes in the computing node may send its reduced result obtained in the inter-node allreduce phase 504 to one (e.g., P2 in this example) of two neighboring processing units or processes according to a ring configuration, and receive a reduced result from the other (e.g., PM in this example) of the two neighboring processing units or processes.
  • each processing unit or process may send a newly received reduced result to one (e.g., P2 in this example) of two neighboring processing units or processes according to the ring configuration, and receive another reduced result from the other (e.g., PM in this example) of the two neighboring processing units or processes.
  • each processing unit or process in the computing node may have reduced results of the reduced results of all the processing units or processes in the computing node.
  • the distributed training system 102 may perform the plurality of phases included in the topology-aware multi-phase algorithm, i.e., the intra-node reduce-scatter phase 502, the inter-node allreduce phase 504, and the intra-node allgather phase 506, etc., sequentially.
  • the distributed training system 102 may alternatively partially or substantially overlap some of the intra-node reduce-scatter phase 502, the inter-node allreduce phase 504, and the intra-node allgather phase 506, and perform some parts of these phases in parallel.
  • the distributed training system 102 may allow at least parts of the intra-node reduce-scatter phase 502 and the inter-node allreduce phase 504 to be performed in parallel, and parts of the inter-node allreduce phase 504 and the intra-node allgather phase 506, thereby improving the utilization of intra-node and inter-node links (or connections) , and avoiding intra-node links from being idle while inter-node links are used, and vice versa.
  • FIG. 11 shows an example scenario of performing an intra-node reduce-scatter phase, an inter-node allreduce phase, and an intra-node allgather phase in a parallel or overlapping manner.
  • a processing unit or process of a computing node may divide a data chunk into multiple blocks (in this example, four blocks as shown in FIG. 11) , and distribute these blocks to at least two concurrent threads (e.g., a first thread 1102 and a second thread 1104) .
  • the processing unit or process may pipeline intra-node and inter-node sub-operations for execution by the at least two concurrent threads (in this example, the first thread 1102 and the second thread 1104) .
  • the first thread 1102 may perform an inter-node allreduce sub-operation (i.e., an operation in the inter-node allreduce phase 504) on a first data block (e.g., a data block 1106) while the second thread 1104 performs an intra-node reduce-scatter sub-operation (i.e., an operation in the intra-node reduce-scatter phase 502) on a second data block (e.g., a data block 1108) .
  • an inter-node allreduce sub-operation i.e., an operation in the inter-node allreduce phase 504
  • a first data block e.g., a data block 1106
  • an intra-node reduce-scatter sub-operation i.e., an operation in the intra-node reduce-scatter phase 502
  • the first thread 1102 may perform an intra-node allgather sub-operation (i.e., an operation in the intra-node allgather phase 506) on a third data block (e.g., a data block 1110) , while the second thread 1104 performs an inter-node allreduce sub-operation on a fourth data block (e.g., a data block 1112) .
  • an intra-node allgather sub-operation i.e., an operation in the intra-node allgather phase 506
  • a third data block e.g., a data block 1110
  • the second thread 1104 performs an inter-node allreduce sub-operation on a fourth data block (e.g., a data block 1112) .
  • the distributed training system 102 may divide an allgather operation involved in distributed neural network training into a plurality of sub-operations, namely, an inter-node allgather sub-operation, an intra-node allgather sub-operation, and a data copy sub-operation.
  • the inter-node allgather sub-operation may be similar to the inter-node allreduce sub-operation as described above, except broadcasting data (e.g., reduced results) instead of reduction operation (e.g., reducing received results with local results) being performed, whereas the inter-node allgather sub-operation may be similar or identical to the inter-node allgather sub-operation as described above.
  • the data copy sub-operation may include an operation of copying resulting data (e.g., final reduced results) as parameters for output.
  • a processing unit or a process of a computing node may divide a data chunk into multiple blocks (e.g., four blocks) , and distribute these blocks to at least two concurrent threads (e.g., a first thread and a second thread) , and pipeline intra-node and inter-node sub-operations for execution by the at least two concurrent threads.
  • the first thread may perform an inter-node allgather sub-operation on a first data block while the second thread performs an intra-node allgather sub-operation on a second data block. Furthermore, the first thread may perform a data copy sub-operation on a third data block, while the second thread performs an inter-node allgather sub-operation on a fourth data block.
  • data or traffic congestion may happen at some switches or links in the communication network 206.
  • the distributed training system 102 may adopt a predetermined congestion avoidance strategy to distribute or divert data traffic among various switches or links in the communication network 206, thus avoiding an excessive amount of data from passing through a certain switch or link in the communication network 206 during training (e.g., the inter-node allreduce sub-operation or phase, or the inter-node allgather sub-operation or phase) .
  • the distributed training system 102 may adopt a first congestion avoidance approach that includes a strategy on ring generation, followed by a routing management of network flows. Additionally or alternatively, the distributed training system 102 may adopt a second congestion avoidance approach that includes a strategy on a reordering of node identification, followed by a routing management of network flows. Depending on a type of network topology of the communication network 206, and processing and communication capabilities of the plurality of computing nodes 204, etc., the distributed training system 102 may select one or more of the first congestion avoidance approach, or the second congestion avoidance approach for routing data flows between all or part of the plurality of computing nodes in the distributed training system 102.
  • the distributed training system 102 may selectively combine parts of the first congestion avoidance approach and the second congestion avoidance approach to implement a new congestion avoidance approach.
  • both the first congestion avoidance approach and the second congestion avoidance approach may aim at specifying a dedicated network path for each direction of an inter-node data flow in a way that inter-node data flows have no or little conflict with each other.
  • the distributed training system 102 may obtain or establish mapping relationships between communication connections and routing paths (e.g., physical links) in advance.
  • a connection-path data structure in a form of a table, a linked list, etc., may be created and used for storing information of the mapping relationships.
  • the distributed training system 102 may selectively or strategically use a specific path for establishing a connection between any two computing nodes based on the connection-path data structure.
  • the distributed training system 102 may determine mapping relationships between communication connections and routing paths by enabling each computing node of the distributed training system 102 to send probing data packets to other computing nodes through varying source/destination ports of the probing data packets, to exhaust possible communication connections between computing nodes of the distributed training system 102.
  • mapping relationships between communication connections and routing paths may be employed by the distributed training system 102, which are not limited herein.
  • a first computing node may send a plurality of probing data packets to a second computing node, each probing data packet having a different combination of source and destination ports while a source address and a destination address being an address of the first computing node and an address of the second computing node respectively.
  • Each probing data packet may record switches through which the respective probing data packet passes through, and thus the first computing node may know an entire routing path of the respective probing data packet for mapping when the respective probing data packet is returned to the first computing node.
  • a connection-path data structure e.g., a connection-path data structure
  • mapping relationships between communication connections and routing paths (and hence connection-path data structures) for other pairs of computing nodes in the distributed training system 102 may be established accordingly.
  • an example network topology namely, a fat-tree network (or in particular a two-tier Clos network architecture in a full-mesh topology) is used herein as an example network topology of the communication network 206 that is associated with the distributed training system 102.
  • the example congestion avoidance strategies described herein may also be applicable to other network topologies.
  • FIG. 12 shows an example fat-tree network topology 1200.
  • the example fat-tree network topology is two-tier Clos network architecture in a full-mesh topology.
  • One tier corresponds to a tier of leaf switches 1202 that are directly connected to computing nodes 1204, with each leaf switch 1202 being connected to one or more computing nodes 1204.
  • a computing node 1204 may include one or more network interface controllers (e.g., four network interface controllers) which are connected to one or more ports (e.g., four ports) of a leaf switch 1202.
  • the number of network interface controllers in each computing node 1204 may or may not be the same.
  • Another tier corresponds to a tier of aggregation switches 1206 (or called spine switches 1206) that are connected to one or more leaf switches 1202.
  • connection-path data structure as described above, a data packet that is transmitted between the two processing units or processes can be made to flow through a specified aggregation switch by setting an appropriate combination of source and destination ports in the data packet.
  • the routing management of the first congestion avoidance approach and/or the second congestion avoidance approach may aim at enabling data flows from a same leaf switch to different destination leaf switches to pass through different aggregation switches, and/or data flows from different source leaf switches to a same destination leaf switches to pass through different aggregation switches, thus avoiding collisions between the data flows, and leading to no network congestion at the aggregation switches.
  • the first congestion avoidance approach may include a strategy on ring generation, followed by a routing management of network flows.
  • the first congestion avoidance approach may support a variety of ring-based algorithms, which include, but are limited to, a ring algorithm, a ring chunked algorithm, a multi-ring algorithm, a hierarchical ring algorithm, an algorithm involving multiple hierarchical rings, and a node-aware ring algorithm, etc.
  • the strategy on ring generation may include a topology-aware strategy on ring generation.
  • the topology-aware strategy on ring generation may include a plurality of rules to build up a ring or ring configuration of processing units or processes.
  • a processing unit or process in a computing node may send/receive data to/from a processing unit or process in another computing node through a network interface controller.
  • a processing unit or process in a computing node may be associated with a single network interface controllers or multiple network interface controllers for transmitting data to processing units or processes in other computing nodes. Additionally or alternatively, multiple processing units or processes may be associated with a single network interface controller, and employ that network interface controller for transmitting data to processing units or processes in other computing nodes.
  • the plurality of rules may include, but are not limited to, priorities for a processing unit or process in a first computing node to select a neighboring processing unit or process, conditions for a network interface controller in a first computing node to send or receive data, conditions for a network interface controller in a first computing node to route data to/from a network interface controller in a second computing node, etc.
  • priorities for a processing unit or process in a first computing node to select a neighboring processing unit or process may include, in a descending order of priorities, selecting a processing unit or process in the first computing node and using an inter-process communication if applicable, selecting a processing unit or process in a second computing node connected to a leaf switch that is the same as a leaf switch connected to the first computing node, selecting a processing unit or process in a third computing node connected to a leaf switch that is different from the leaf switch connected to the first computing node, wherein the first computing node is different from the second computing node and the third computing node.
  • conditions for a network interface controller in a first computing node to send or receive data may include, for example, the network interface controller being capable of sending data to a network interface controller in a second computing node only, and/or the network interface controller being capable of receiving data from a network interface controller in a third computing node only, where the first computing node is different from the second computing node and the third computing node, and the second computing node may or may not be the same as the third computing node.
  • conditions for a network interface controller in a first computing node to route data to/from a network interface controller in a second computing node may include, for example, routing data sent by processing units or processes belonging to multiple rings to the network interface controller in the second computing node if the data is sent through the network interface controller in the first computing node.
  • conditions for a network interface controller in a first computing node to route data to/from a network interface controller in a second computing node may further include receiving data through the network interface controller in the first computing node if the data is sent by processing units or processes belonging to multiple rings through the network interface controller in the second computing node.
  • the routing management of the first congestion avoidance approach may assign network interface controller (NIC) identifiers to each network interface controller that is connected or linked to a same leaf switch.
  • the routing management of the first congestion avoidance approach may further assign aggregation identifiers to each aggregation switch in the communication network 206.
  • the routing management may determine a routing identifier for routing a data packet from that processing unit or process.
  • a routing identifier may be determined as a default value or identifier. This default routing identifier indicates that data is either routed within a computing node or through a leaf switch, without passing through any aggregation switch in the communication network. Otherwise, the routing identifier may be determined to be equal to a NIC identifier of that processing unit or process, or other predefined value. Based on a mapping relationship between routing identifiers and aggregation identifiers, an aggregation identifier may be determined based on the determined routing identifier.
  • mapping relationship between routing identifiers and aggregation identifiers may be predetermined in advance using a probing-based routing mechanism (e.g., sending probing data packets between computing nodes as described in the foregoing description) , for example.
  • a probing-based routing mechanism e.g., sending probing data packets between computing nodes as described in the foregoing description
  • data flows between processing units (or processes) which are included in a same computing node or which network interface controllers a same leaf switch will not go through any aggregation switch in the communication network.
  • data flows between processing units (or processes) which are included in different computing nodes and which network interface controllers different leaf switches will pass through a designated aggregation switch based on a predetermined mapping relationship, thus enabling routing control and management of data flows and distributing the data flows to different aggregation switches to avoid network congestion.
  • FIG. 13 shows an example scenario of using the first congestion avoidance approach.
  • four inter-node rings or ring configurations, R0, R1, R2, and R3 involving eight computing nodes (Node 0, Node 1, ..., Node 7) are generated, and each ring uses a different aggregation switch for sending and receiving data (e.g., during the inter-node allreduce phase 504) . Therefore, no conflict exists among these four rings.
  • each leaf switch of any ring has only one data flow coming in, and one data flow coming out, thus avoiding an occurrence of network congestion.
  • the second congestion avoidance approach may include a strategy on a reordering of node identification, followed by a routing management of network flows.
  • the second congestion avoidance approach may reorder identifiers of computing nodes and processing units (or processes) according to a network topology connecting the computing nodes and processing units (or processes) based on a plurality of rules.
  • the plurality of rules may include, for example, grouping computing nodes by respective leaf switches. For example, computing nodes connecting to a same leaf switch (e.g., computing nodes having network interface controllers that are linked to a same leaf switch) are formed into one group, and each computing node is assigned with a node identifier. Since the computing nodes are connected to the same leaf switch, these computing nodes are (physically) adjacent to each other.
  • a same leaf switch e.g., computing nodes having network interface controllers that are linked to a same leaf switch
  • the plurality of rules may further include assigning rank identifiers (or rank numbers) to each processing unit or process in the computing nodes using a same order sequence. For example, the k number of processing units (or processes) in a first computing node may be assigned with rank identifiers as 0, 1, ..., k –1, and the k number of processing units (or processes) in a second computing node may be assigned with rank identifiers as k, k + 1, ..., 2k –1, etc., and so forth for other computing nodes.
  • Processing units (or processes) in a computing node may be ordered according to respective network interface controllers that the processing units (or processes) use, and processing units (or processes) using a same network interface controller are (physically) adjacent to each other.
  • L is the number of computing nodes per leaf switch for a node-aware halving-doubling algorithm as described in the foregoing description.
  • L is a product of the number computing nodes per leaf switch and the number of processing units (or processes) per computing node for a conventional halving-doubling algorithm.
  • the routing management of the second congestion avoidance approach may include determining an aggregation identifier for a data flow or data packet sent from a first processing unit (or process) having a first rank identifier in a first computing node having a first node identifier to a second processing unit (or process) having a second rank identifier in a second computing node having a second node identifier, where the first computing node may or may not be the same as the second computing node.
  • the aggregation identifier may be determined based at least in part on at least some of the rank identifier, the node identifier, the number of network interface controllers per computing node, and a maximum number of computing nodes at each leaf switch.
  • the aggregation identifier may be determined as a first rank identifier of a first processing unit (or process) from which a data flow or data packet is sent + (afirst node identifier of a first computing node having the first processing unit (or process) %a maximum number of computing nodes at each leaf switch) ⁇ a number of network interface controllers per computing node, where %represents a modulus operator.
  • the aggregation identifier may be determined based on a preset mapping relationship between aggregation identifiers and combinations of rank identifier and node identifier, etc.
  • the routing management of the second congestion avoidance approach may include assigning aggregation identifiers to each aggregation switch in the communication network 206 associated with the distributed training system 102 in advance. If the first processing unit (or process) and the second processing unit (or process) are linked to or under a same leaf switch (e.g., through respective network controllers) , the data flow or data packet will pass through that leaf switch without the need of passing through any aggregation switch in the communication network 206.
  • the data flow or data packet sent by the first processing unit (or process) to the second processing unit (or process) will pass through an aggregation switch having the determined aggregation identifier.
  • the number of network interface controllers included in each computing node is described to be four.
  • FIG. 14 shows an example scenario of using the second congestion avoidance approach.
  • all computing nodes includes the same number of processing units (or processes) and the same number of network interface controllers, with each network interface controller having the same number of processing units (or processes) to be associated with.
  • the number of network interface controllers linking to a leaf switch is fewer than the number of aggregation switches in the network.
  • the number of network interface controllers per computing node is four, and the maximum number of computing nodes at each leaf switch is two.
  • the number of computing nodes under a same leaf switch may be a power of two for a node-aware halving-doubling algorithm
  • the number of network interface controllers included in computing nodes under a same leaf switch may be a power of two
  • the number of processing units (or processes) using a same network interface controller may be a power of two for a conventional halving-doubling algorithm.
  • the number of network interface controllers included in each computing node is described to be four.
  • processing units (or processes) of computing nodes (Node 0, Node 2, Node 4, and Node 6) will use aggregation switches with aggregation identifiers (A1, A2, A3, and A4, for example)
  • processing units (or processes) of computing nodes (Node 1, Node 3, Node 5, and Node 7) will use aggregation switches with aggregation identifiers (A5, A6, A7, and A8, for example) . Accordingly, no collision exists among data flows between computing nodes, thus avoiding network congestion at any aggregation switch in the network.
  • a processing unit may communicate data with a new processing unit (or process) .
  • synchronization may be performed to ensure that a data flow conducted at a current step by the processing unit (or process) using a network interface controller does not overlap with a data flow conducted at a previous step by a neighboring processing unit (or process) using the same network interface controller, to avoid an occurrence of an incast and thus avoiding an occurrence of incast congestion.
  • FIG. 15 shows a schematic diagram depicting an example topology aware multi-phase method.
  • FIG. 16 shows a schematic diagram depicting a first example network congestion avoidance method.
  • FIG. 17 shows a schematic diagram depicting a second example network congestion avoidance method.
  • FIG. 18 shows a schematic diagram depicting an example parallel method based on hybrid architecture in distributed training.
  • the methods of FIGS. 15-18 may, but need not, be implemented in the environment of FIG. 1, using the computing node of FIGS. 2, with the help of the methods and scenarios of FIGS 3-14.
  • methods 1500 –1800 are described with reference to FIGS. 1-14. However, the methods 1500 –1800 may alternatively be implemented in other environments and/or using other systems.
  • the methods 1500 –1800 are described in the general context of machine-executable instructions.
  • machine-executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types.
  • each of the example methods are illustrated as a collection of blocks in a logical flow graph representing a sequence of operations that can be implemented in hardware, software, firmware, or a combination thereof.
  • the order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or alternate methods. Additionally, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein.
  • the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations.
  • some or all of the blocks may represent application specific integrated circuits (ASICs) or other physical components that perform the recited operations.
  • ASICs application specific integrated circuits
  • a first computing node (e.g., the computing node 104) may perform reduce-scatter sub-operations between a first plurality of processing units in the first computing node according to a first collective communication algorithm.
  • the first computing node may select the first collective communication algorithm based at least in part on a type or a bandwidth of intra-node connections between the first plurality of processing units in the first computing node.
  • the first collective communication algorithm may include, but is not limited to, a ring-based algorithm, or a halving-doubling algorithm.
  • performing the reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm may include dividing data into a plurality of data chunks; assigning the plurality of data chunks to the first plurality of processing units; receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
  • the first computing node may perform allreduce sub-operations between the first plurality of processing units in the first computing node and a second plurality of processing units in a second computing node according to a second collective communication algorithm.
  • the first computing node may select the second collective communication algorithm based at least in part on a type or a bandwidth of inter-node connections between the first computing node and other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.
  • the first collective communication algorithm may include, but is not limited to, a ring-based algorithm, or a halving-doubling algorithm (such as a node-aware halving-doubling algorithm) , etc.
  • performing the allreduce sub-operations between the first plurality of processing units in the first computing node and the second plurality of processing units in the second computing node according to the second collective communication algorithm may include receiving, by the first plurality of processing units, respective portions of a reduce-scatter result obtained by the second plurality of processing units in the second computing node according to the second collective communication algorithm, each processing unit of the first plurality of processing units forming a group with a respective processing unit of the second plurality of processing units and receiving a respective portion of the reduce-scatter result from the respective processing unit; and reducing, by the first plurality of processing units, the respective portions of the reduce-scatter result with corresponding local portions of a reduce-scatter result obtained after performing the reduce-scatter sub-operations between the first plurality of processing units.
  • the first computing node may perform allgather sub- operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm.
  • performing the allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm may include receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
  • a first computing node e.g., the computing node 104 or a first process may determine a routing identifier for routing data from the first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same leaf switch.
  • the first process and the second process may belong to a particular inter-node ring that connects a plurality of different nodes under a particular network topology.
  • the particular network topology may include a fat-tree topology.
  • the network interface controller associated with the first process is configured to send data to or receive data from a second computing node in the ring topology only, the second computing node being different from the first computing node.
  • the network interface controller associated with the first process is further associated with one or more processes, and wherein all data sent from the first process and the one or more processes are sent through the network interface controller.
  • the routing identifier may be set or determined as a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or are linked to the same leaf switch.
  • the routing identifier may be set or determined to be equal to an identifier of the network interface controller associated with the first process in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in different computing nodes or are linked to different leaf switches.
  • the first computing node or the first process may route the data from the first process to the second process according to the routing identifier.
  • routing the data from the first process to the second process according to the routing identifier may include routing the data from the first process to the second process through at least a leaf switch connected with the network interface controller associated with the first process and an aggregation switch having an identifier that has a correspondence relationship with the identifier of the network interface controller associated with the first process.
  • a first computing node e.g., the computing node 104 or a first process may determine an aggregation identifier for sending a data packet from the first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology.
  • the first computing node may assign different aggregation identifiers for data packets directed to computing nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
  • the first computing node may assign a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
  • the correspondence relationship may record a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
  • the particular network topology may include a fat-tree topology.
  • the first computing node may send a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
  • the first computing node may further send respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  • the first computing node may further receive data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  • a first computing node e.g., the computing node 104 or a processing unit may divide a data chunk assigned to the processing unit into a plurality of data segments, the plurality of data segments comprising at least a first data segment and a second data segment.
  • the first computing node or the processing unit may assign the plurality of data segments to a plurality of threads, the plurality of threads comprising at least a first thread and a second thread.
  • the first computing node or the processing unit may perform an intra-node sub-operation on a portion of the first data segment using the first thread, in parallel with performing an inter-node sub-operation on a portion of the second data segment using the second thread.
  • performing the intra-node sub-operation on the portion of the first data segment using the first thread may include transmitting the portion of the first data segment between the processing unit and another processing unit included in the first computing node through an intra-node connection.
  • performing the inter-node sub-operation on the portion of the second data segment using the second thread may include transmitting the portion of the second data segment between the processing unit and another processing unit included in a second computing node that is different from the first computing node through an inter-node connection.
  • the intra-node sub-operation may include a reduce-scatter sub-operation or an allgather sub-operation performed within the first computing node
  • the inter-node sub-operation may include an allreduce sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
  • the intra-node sub-operation may include an allgather sub-operation or a copy sub-operation performed within the first computing node
  • the inter-node sub-operation may include an allgather sub-operation performed between the first computing node and a second computing node that is different from the first computing node
  • the first computing node or the processing unit may perform another inter-node sub-operation on the portion of the first data segment using the first thread, and performing another intra-node sub-operation on the portion of the second data segment using the second thread in parallel.
  • performing the intra-node sub-operation on the portion of the first data segment using the first thread, in parallel with performing the inter-node sub-operation on the portion of the second data segment using the second thread enables utilizing an intra-node connection for transmitting the portion of the first data segment to another processing unit included in the first computing node and an inter-node connection for transmitting the portion of the second data segment to another processing unit included in a second computing node that is different from the first computing node concurrently.
  • Clause 1 A method implemented by a first computing node, the method comprising: performing reduce-scatter sub-operations between a first plurality of processing units in the first computing node according to a first collective communication algorithm; performing allreduce sub-operations between the first plurality of processing units in the first computing node and a second plurality of processing units in a second computing node according to a second collective communication algorithm; and performing allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm.
  • Clause 2 The method of Clause 1, further comprising selecting the first collective communication algorithm based at least in part on a type or a bandwidth of intra-node connections between the first plurality of processing units in the first computing node.
  • Clause 3 The method of Clause 1, further comprising selecting the second collective communication algorithm based at least in part on a type or a bandwidth of inter-node connections between the first computing node and other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.
  • Clause 4 The method of Clause 1, wherein the first collective communication algorithm comprises a ring-based algorithm, or a halving-doubling algorithm.
  • Clause 5 The method of Clause 1, wherein performing the reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: dividing data into a plurality of data chunks; assigning the plurality of data chunks to the first plurality of processing units; receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
  • Clause 6 The method of Clause 1, wherein performing the allreduce sub-operations between the first plurality of processing units in the first computing node and the second plurality of processing units in the second computing node according to the second collective communication algorithm comprises: receiving, by the first plurality of processing units, respective portions of a reduce-scatter result obtained by the second plurality of processing units in the second computing node according to the second collective communication algorithm, each processing unit of the first plurality of processing units forming a group with a respective processing unit of the second plurality of processing units and receiving a respective portion of the reduce-scatter result from the respective processing unit; and reducing, by the first plurality of processing units, the respective portions of the reduce-scatter result with corresponding local portions of a reduce-scatter result obtained after performing the reduce-scatter sub-operations between the first plurality of processing units.
  • Clause 7 The method of Clause 1, wherein performing the allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
  • Clause 8 One or more machine readable media storing machine readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: performing reduce-scatter sub-operations between a first plurality of processing units in the first computing node according to a first collective communication algorithm; performing allreduce sub-operations between the first plurality of processing units in the first computing node and a second plurality of processing units in a second computing node according to a second collective communication algorithm; and performing allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm.
  • Clause 9 The one or more machine readable media of Clause 8, the acts further comprising selecting the first collective communication algorithm based at least in part on a type or a bandwidth of intra-node connections between the first plurality of processing units in the first computing node.
  • Clause 10 The one or more machine readable media of Clause 8, the acts further comprising selecting the second collective communication algorithm based at least in part on a type or a bandwidth of inter-node connections between the first computing node and other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.
  • Clause 11 The one or more machine readable media of Clause 8, wherein the first collective communication algorithm comprises a ring-based algorithm, or a halving-doubling algorithm.
  • Clause 12 The one or more machine readable media of Clause 8, wherein performing the reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: dividing data into a plurality of data chunks; assigning the plurality of data chunks to the first plurality of processing units; receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
  • Clause 13 The one or more machine readable media of Clause 8, wherein performing the allreduce sub-operations between the first plurality of processing units in the first computing node and the second plurality of processing units in the second computing node according to the second collective communication algorithm comprises: receiving, by the first plurality of processing units, respective portions of a reduce-scatter result obtained by the second plurality of processing units in the second computing node according to the second collective communication algorithm, each processing unit of the first plurality of processing units forming a group with a respective processing unit of the second plurality of processing units and receiving a respective portion of the reduce-scatter result from the respective processing unit; and reducing, by the first plurality of processing units, the respective portions of the reduce-scatter result with corresponding local portions of a reduce-scatter result obtained after performing the reduce-scatter sub-operations between the first plurality of processing units.
  • Clause 14 The one or more machine readable media of Clause 8, wherein performing the allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
  • a first computing node comprising: a first plurality of processing units; and memory storing machine executable instructions that, when executed by the first plurality of processing units, cause the first plurality of processing units to perform acts comprising: performing reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to a first collective communication algorithm; performing allreduce sub-operations between the first plurality of processing units in the first computing node and a second plurality of processing units in a second computing node according to a second collective communication algorithm; and performing allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm.
  • Clause 16 The first computing node of Clause 15, the acts further comprising: selecting the first collective communication algorithm based at least in part on a type or a bandwidth of intra-node connections between the first plurality of processing units in the first computing node; and selecting the second collective communication algorithm based at least in part on a type or a bandwidth of inter-node connections between the first computing node and other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.
  • Clause 17 The first computing node of Clause 15, wherein the first collective communication algorithm comprises a ring-based algorithm, or a halving-doubling algorithm.
  • Clause 18 The first computing node of Clause 15, wherein performing the reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: dividing data into a plurality of data chunks; assigning the plurality of data chunks to the first plurality of processing units; receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
  • Clause 19 The first computing node of Clause 15, wherein performing the allreduce sub-operations between the first plurality of processing units in the first computing node and the second plurality of processing units in the second computing node according to the second collective communication algorithm comprises: receiving, by the first plurality of processing units, respective portions of a reduce-scatter result obtained by the second plurality of processing units in the second computing node according to the second collective communication algorithm, each processing unit of the first plurality of processing units forming a group with a respective processing unit of the second plurality of processing units and receiving a respective portion of the reduce-scatter result from the respective processing unit; and reducing, by the first plurality of processing units, the respective portions of the reduce-scatter result with corresponding local portions of a reduce-scatter result obtained after performing the reduce-scatter sub-operations between the first plurality of processing units.
  • Clause 20 The first computing node of Clause 15, wherein performing the allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
  • Clause 21 A method implemented by a first computing node, the method comprising: determining a routing identifier for routing data from a first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same leaf switch, the first process and the second process belonging to a particular inter-node ring that connects a plurality of different nodes under a particular network topology; and routing the data from the first process to the second process according to the routing identifier.
  • Clause 22 The method of Clause 21, wherein the network interface controller associated with the first process is configured to send data to or receive data from a second computing node in the ring topology only, the second computing node being different from the first computing node.
  • Clause 23 The method of Clause 21, wherein the network interface controller associated with the first process is further associated with one or more processes, and wherein all data sent from the first process and the one or more processes are sent through the network interface controller.
  • Clause 24 The method of Clause 21, wherein the particular network topology comprises a fat-tree topology.
  • Clause 25 The method of Clause 21, further comprising setting the routing identifier as a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or are linked to the same leaf switch.
  • Clause 26 The method of Clause 21, further comprising setting the routing identifier to be equal to an identifier of the network interface controller associated with the first process in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in different computing nodes or are linked to different leaf switches.
  • Clause 27 The method of Clause 26, wherein routing the data from the first process to the second process according to the routing identifier comprises routing the data from the first process to the second process through at least a leaf switch connected with the network interface controller associated with the first process and an aggregation switch having an identifier that has a correspondence relationship with the identifier of the network interface controller associated with the first process.
  • Clause 28 One or more machine readable media storing machine readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: determining a routing identifier for routing data from a first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same leaf switch, the first process and the second process belonging to a particular inter-node ring that connects a plurality of different nodes under a particular network topology; and routing the data from the first process to the second process according to the routing identifier.
  • Clause 29 The one or more machine readable media of Clause 28, wherein the network interface controller associated with the first process is configured to send data to or receive data from a second computing node in the ring topology only, the second computing node being different from the first computing node.
  • Clause 30 The one or more machine readable media of claim Clause 28, wherein the network interface controller associated with the first process is further associated with one or more processes, and wherein all data sent from the first process and the one or more processes are sent through the network interface controller.
  • Clause 31 The one or more machine readable media of Clause 28, wherein the particular network topology comprises a fat-tree topology.
  • Clause 32 The one or more machine readable media of Clause 28, the acts further comprising setting the routing identifier as a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or are linked to the same leaf switch.
  • Clause 33 The one or more machine readable media of Clause 28, the acts further comprising setting the routing identifier to be equal to an identifier of the network interface controller associated with the first process in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in different computing nodes or are linked to different leaf switches.
  • Clause 34 The one or more machine readable media of Clause 33, wherein routing the data from the first process to the second process according to the routing identifier comprises routing the data from the first process to the second process through at least a leaf switch connected with the network interface controller associated with the first process and an aggregation switch having an identifier that has a correspondence relationship with the identifier of the network interface controller associated with the first process.
  • a first computing node comprising: one or more processing units; and memory storing machine executable instructions that, when executed by one or more processing units, cause the one or more processing units to perform acts comprising: determining a routing identifier for routing data from a first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same leaf switch, the first process and the second process belonging to a particular inter-node ring that connects a plurality of different nodes under a particular network topology; and routing the data from the first process to the second process according to the routing identifier.
  • Clause 36 The first computing node of Clause 35, wherein the network interface controller associated with the first process is configured to send data to or receive data from a second computing node in the ring topology only, the second computing node being different from the first computing node.
  • Clause 37 The first computing node of Clause 35, wherein the network interface controller associated with the first process is further associated with one or more processes, and wherein all data sent from the first process and the one or more processes are sent through the network interface controller.
  • Clause 38 The first computing node of Clause 35, the acts further comprising setting the routing identifier as a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or are linked to the same leaf switch.
  • Clause 39 The first computing node of Clause 35, the acts further comprising setting the routing identifier to be equal to an identifier of the network interface controller associated with the first process in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in different computing nodes or are linked to different leaf switches.
  • Clause 40 The first computing node of Clause 39, wherein routing the data from the first process to the second process according to the routing identifier comprises routing the data from the first process to the second process through at least a leaf switch connected with the network interface controller associated with the first process and an aggregation switch having an identifier that has a correspondence relationship with the identifier of the network interface controller associated with the first process.
  • Clause 41 A method implemented by a first computing node, the method comprising: determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology; and sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
  • Clause 42 The method of claim 41, further comprising assigning different aggregation identifiers for data packets directed to nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
  • Clause 43 The method of Clause 41, further comprising assigning a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
  • Clause 44 The method of Clause 43, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
  • Clause 45 The method of Clause 41, wherein the particular network topology comprises a fat-tree topology.
  • Clause 46 The method of Clause 41, further comprising sending respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  • Clause 47 The method of Clause 41, further comprising receiving data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  • Clause 48 One or more machine readable media storing machine readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology; and sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
  • Clause 49 The one or more machine readable media of Clause 48, the acts further comprising assigning different aggregation identifiers for data packets directed to nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
  • Clause 50 The one or more machine readable media of Clause 48, the acts further comprising assigning a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
  • Clause 51 The one or more machine readable media of Clause 50, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
  • Clause 52 The one or more machine readable media of Clause 48, wherein the particular network topology comprises a fat-tree topology.
  • Clause 53 The one or more machine readable media of Clause 48, the acts further comprising sending respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  • Clause 54 The one or more machine readable media of Clause 48, the acts further comprising receiving data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  • a first computing node comprising: one or more processing units; and memory storing machine executable instructions that, when executed by one or more processing units, cause the one or more processing units to perform acts comprising: determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology; and sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
  • Clause 56 The first computing node of Clause 55, the acts further comprising assigning different aggregation identifiers for data packets directed to nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
  • Clause 57 The first computing node of Clause 55, the acts further comprising assigning a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
  • Clause 58 The first computing node of Clause 57, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
  • Clause 59 The first computing node of Clause 55, the acts further comprising sending respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  • Clause 60 The first computing node of Clause 55, the acts further comprising receiving data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  • Clause 61 A method implemented by a first computing node, the method comprising: dividing a data chunk assigned to a processing unit into a plurality of data segments, the plurality of data segments comprising at least a first data segment and a second data segment; assigning the plurality of data segments to a plurality of threads, the plurality of threads comprising at least a first thread and a second thread; and performing an intra-node sub-operation on a portion of the first data segment using the first thread, in parallel with performing an inter-node sub-operation on a portion of the second data segment using the second thread.
  • Clause 62 The method of Clause 61, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread comprises transmitting the portion of the first data segment between the processing unit and another processing unit included in the first computing node through an intra-node connection.
  • Clause 63 The method of Clause 61, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread comprises transmitting the portion of the second data segment between the processing unit and another processing unit included in a second computing node that is different from the first computing node through an inter-node connection.
  • Clause 64 The method of Clause 61, wherein the intra-node sub-operation comprises a reduce-scatter sub-operation or an allgather sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allreduce sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
  • Clause 65 The method of Clause 61, wherein the intra-node sub-operation comprises an allgather sub-operation or a copy sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allgather sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
  • Clause 66 The method of Clause 61, further comprising performing another inter-node sub-operation on the portion of the first data segment using the first thread, and performing another intra-node sub-operation on the portion of the second data segment using the second thread in parallel.
  • Clause 67 The method of Clause 61, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread, in parallel with performing the inter-node sub-operation on the portion of the second data segment using the second thread enables utilizing an intra-node connection for transmitting the portion of the first data segment to another processing unit included in the first computing node and an inter-node connection for transmitting the portion of the second data segment to another processing unit included in a second computing node that is different from the first computing node concurrently.
  • Clause 68 One or more machine readable media storing machine readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: dividing a data chunk assigned to a processing unit into a plurality of data segments, the plurality of data segments comprising at least a first data segment and a second data segment; assigning the plurality of data segments to a plurality of threads, the plurality of threads comprising at least a first thread and a second thread; and performing an intra-node sub-operation on a portion of the first data segment using the first thread, in parallel with performing an inter-node sub-operation on a portion of the second data segment using the second thread.
  • Clause 69 The one or more machine readable media of Clause 68, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread comprises transmitting the portion of the first data segment between the processing unit and another processing unit included in the first computing node through an intra-node connection.
  • Clause 70 The one or more machine readable media of Clause 68, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread comprises transmitting the portion of the second data segment between the processing unit and another processing unit included in a second computing node that is different from the first computing node through an inter-node connection.
  • Clause 71 The one or more machine readable media of Clause 68, wherein the intra-node sub-operation comprises a reduce-scatter sub-operation or an allgather sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allreduce sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
  • Clause 72 The one or more machine readable media of Clause 68, wherein the intra-node sub-operation comprises an allgather sub-operation or a copy sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allgather sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
  • Clause 73 The one or more machine readable media of Clause 68, the acts further comprising performing another inter-node sub-operation on the portion of the first data segment using the first thread, and performing another intra-node sub-operation on the portion of the second data segment using the second thread in parallel.
  • Clause 74 The one or more machine readable media of Clause 68, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread, in parallel with performing the inter-node sub-operation on the portion of the second data segment using the second thread enables utilizing an intra-node connection for transmitting the portion of the first data segment to another processing unit included in the first computing node and an inter-node connection for transmitting the portion of the second data segment to another processing unit included in a second computing node that is different from the first computing node concurrently.
  • a first computing node comprising: one or more processing units; and memory storing machine executable instructions that, when executed by one or more processing units, cause the one or more processing units to perform acts comprising: dividing a data chunk assigned to a processing unit into a plurality of data segments, the plurality of data segments comprising at least a first data segment and a second data segment; assigning the plurality of data segments to a plurality of threads, the plurality of threads comprising at least a first thread and a second thread; and performing an intra-node sub-operation on a portion of the first data segment using the first thread, in parallel with performing an inter-node sub-operation on a portion of the second data segment using the second thread.
  • Clause 76 The first computing node of Clause 75, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread comprises transmitting the portion of the first data segment between the processing unit and another processing unit included in the first computing node through an intra-node connection.
  • Clause 77 The first computing node of Clause 75, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread comprises transmitting the portion of the second data segment between the processing unit and another processing unit included in a second computing node that is different from the first computing node through an inter-node connection.
  • Clause 78 The first computing node of Clause 75, wherein the intra-node sub-operation comprises a reduce-scatter sub-operation or an allgather sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allreduce sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
  • Clause 79 The first computing node of Clause 75, wherein the intra-node sub-operation comprises an allgather sub-operation or a copy sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allgather sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
  • Clause 80 The first computing node of Clause 75, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread, in parallel with performing the inter-node sub-operation on the portion of the second data segment using the second thread enables utilizing an intra-node connection for transmitting the portion of the first data segment to another processing unit included in the first computing node and an inter-node connection for transmitting the portion of the second data segment to another processing unit included in a second computing node that is different from the first computing node concurrently.

Abstract

In a distributed training, in order to avoid network congestion, a first computing node may determine an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology. The first computing node may then send a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.

Description

NETWORK CONGESTION AVOIDANCE OVER HALVING-DOUBLING COLLECTIVE COMMUNICATION BACKGROUND
With the explosive development of neural networks, such as deep neural networks (DNNs) , a variety of application domains (e.g., computer vision, natural language processing, speech recognition, etc. ) have been evolved and are taking advantage of the versatility and flexibility that are inherent in the neural networks. However, due to the increasing complexity and stricter accuracy requirements of neural network applications, sizes of neural network models and sizes of training data required for training the models have substantially increased, which would unavoidably lead to increasingly long training times and thus adversely affect the effectiveness and timeliness of the trained models to meet ever-changing application environments.
In order to reduce the times for training neural network models, a distributed training system, which employs parallel training, may be used. In general, a distributed training system may include a great number of computing nodes or servers distributed over a network, and assign subsets of computing tasks to the computing nodes or servers for performing computations in parallel training. However, data communications between computing nodes or servers in a distributed training system pose a lower bound or a bottleneck for an amount of reduction in a training time that may happen in the distributed training system. This is especially true when a distributed training system includes various types of heterogeneous  connections or inter-connects within and between computing nodes or servers, which exhibit different characteristics in terms of latency, bandwidth, topology, etc. Such heterogeneity in connections or inter-connects increases the difficulty and complexity in designing a network of data communications for the computing nodes or servers in the distributed training system.
Furthermore, network congestion may occur due to an excessive amount of data flows passing through a certain network switch or connection between computing nodes or servers in the distributed training system, which may lead to a prolonged training time due to a delay in processing training results. Such excessive amount of data flows that pass through a certain network switch or connection may be caused by a loss of control on path selection for routing data sent between computing nodes or servers.
BRIEF DESCRIPTION OF THE DRAWINGS
The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit (s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
FIG. 1 illustrates an example environment in which a distributed training system may be used.
FIG. 2 illustrates an example computing node in more detail.
FIG. 3A illustrates a ring configuration that interconnects a predetermined number of nodes.
FIG. 3B shows a halving-doubling configuration that interconnects a predetermined number of nodes.
FIG. 4 shows a schematic diagram depicting an example collective communication library.
FIG. 5 shows an example topology-aware multi-phase algorithm.
FIG. 6 shows an example ring-based algorithm for a computing node in an intra-node reduce-scatter phase.
FIG. 7 shows an example halving-doubling algorithm for a computing node in an intra-node reduce-scatter phase.
FIG. 8 shows an example halving-doubling algorithm in an inter-node allreduce phase.
FIG. 9 shows the example halving-doubling algorithm in the inter-node allreduce phase in more detail.
FIG. 10 shows an example ring-based collective communication algorithm.
FIG. 11 shows an example scenario of performing an intra-node reduce-scatter phase, an inter-node allreduce phase, and an intra-node allgather phase in a parallel or overlapping manner.
FIG. 12 shows an example fat-tree network topology.
FIG. 13 shows an example scenario of using a first congestion avoidance approach.
FIG. 14 shows an example scenario of using a second congestion avoidance approach.
FIG. 15 shows an example topology aware multi-phase method.
FIG. 16 shows a first example network congestion avoidance method.
FIG. 17 shows a second example network congestion avoidance method.
FIG. 18 shows an example parallel method based on hybrid architecture in distributed training.
DETAILED DESCRIPTION
Overview
As noted above, existing distributed training systems suffer from a performance bottleneck for good scalability due to data communications between computing nodes in the distributed training system. Furthermore, due to diversity of fabrics of networks (which include, for example, Ethernet, InfinitBand, PCIe, NVLink, NVSwitch, QPI/UPI, etc. ) and a high discrepancy of characteristics of the networks (such as latency, bandwidth, and topology, etc. ) , the distributed training system normally fails to make good use of such heterogeneous types of connections or inter-connects for performing collective data operations in and between the computing nodes and data transmission between the computing nodes. In addition, network congestion may occur due to a loss of control on path selection for routing data sent between computing nodes, resulting in an excessive amount of data flows passing through a certain network switch or connection between the computing nodes in the distributed training system, and leading to a prolonged training time due to a delay in processing training results. Moreover, existing distributed training systems fail to  distinguish algorithms for different types of underlying fabrics for a collective operation and hence leading to a poor performance.
This disclosure describes an example distributed training system. In implementations, the example distributed training system may employ a fabric-aware collective communication library that enables the distributed training system to scale linearly. In implementations, the collective communication library may customize communication algorithms based at least in part on analysis of underlying fabrics and supporting network architectures to attain a desired or maximum efficiency. In implementations, the distributed training system may divide primitive operations into a plurality of sub-operations, with each sub-operation using a type of fabric.
In implementations, the example distributed training system may implement a hybrid algorithm that allows a co-existence of multiple algorithms in a single collective operation, and selectively employ an algorithm for a particular fabric to enhance or maximize the efficiency for an entire communication path. In implementations, the distributed training system may adopt a two-process parallel algorithm that launches two concurrent processes and pipelines the use of intra-node and inter-node connections, thus improving the efficiency of communications by overlapping intra-node communications with inter-node communications.
In implementations, the example distributed training system may employ a probing-based routing control mechanism that generates mappings from connections to paths, and thereby distribute or scatter the connections to different aggregation or intermediate switches in a communication network by re-ranking  participants or processes in collective operations and mapping data flows across the distributed training system to particular physical links, thus avoiding network congestion.
The application describes multiple and varied embodiments and implementations. The following section describes an example framework that is suitable for practicing various implementations. Next, the application describes example systems, devices, and processes for implementing a distributed training system.
Example Environment
FIG. 1 illustrates an example environment 100 usable to implement a distributed training system. The environment 100 may include a distributed training system 102. In this example, the distributed training system 102 may include a plurality of computing nodes or servers 104-1, 104-2, …, 104-K (which are collectively called hereinafter as computing nodes 104) , where K is a positive integer greater than one. In implementations, the plurality of computing nodes 104 may communicate data with each other via a communication network 106.
The computing node 104 may be implemented as any of a variety of computing devices having computing/processing and communication capabilities, which may include, but not limited to, a server, a desktop computer, a notebook or portable computer, a handheld device, a netbook, an Internet appliance, a tablet computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a smart phone, etc. ) , etc., or a combination thereof.
The communication network 106 may be a wireless or a wired network, or a combination thereof. The network 106 may be a collection of individual networks interconnected with each other and functioning as a single large network (e.g., the Internet or an intranet) . Examples of such individual networks include, but are not limited to, telephone networks, cable networks, Local Area Networks (LANs) , Wide Area Networks (WANs) , and Metropolitan Area Networks (MANs) . Further, the individual networks may be wireless or wired networks, or a combination thereof. Wired networks may include an electrical carrier connection (such a communication cable, etc. ) and/or an optical carrier or connection (such as an optical fiber connection, etc. ) . Wireless networks may include, for example, a WiFi network, other radio frequency networks (e.g., 
Figure PCTCN2020082516-appb-000001
Zigbee, etc. ) , etc. In implementations, the communication network 106 may include a plurality of inter-node interconnects or switches 108-1, 108-2, …, 108-L (which are collectively called hereinafter as inter-node switches 108) for providing connections between the computing nodes 104, where L is a positive integer greater than one.
In implementations, the environment 100 may further include a client device 110. A user may instruct the distributed training system 102 to perform training on a particular learning model (such as a deep neural network model) based on data sent from the client device 110 to the distributed training system 102, for example.
Example Computing Node
FIG. 2 illustrates the computing node 104 in more detail. In  implementations, the computing node 104 may include, but is not limited to, one or more processing units 202, an input/output (I/O) interface 204, and/or one or more network interfaces 206, and memory 208. In implementations, the computing node 104 may further include one or more intra-node interconnects or switches 210.
In implementations, the processing units 202 may be configured to execute instructions that are stored in the memory 208, and/or received from the input/output interface 204, and/or the network interface 206. In implementations, the processing units 202 may be implemented as one or more hardware processors including, for example, a microprocessor, an application-specific instruction-set processor, a physics processing unit (PPU) , a central processing unit (CPU) , a graphics processing unit, a digital signal processor, a tensor processing unit, etc. Additionally or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include field-programmable gate arrays (FPGAs) , application-specific integrated circuits (ASICs) , application-specific standard products (ASSPs) , system-on-a-chip systems (SOCs) , complex programmable logic devices (CPLDs) , etc.
The memory 208 may include machine readable media in a form of volatile memory, such as Random Access Memory (RAM) and/or non-volatile memory, such as read only memory (ROM) or flash RAM. The memory 208 is an example of machine readable media.
The machine readable media may include a volatile or non-volatile type, a removable or non-removable media, which may achieve storage of  information using any method or technology. The information may include a machine readable instruction, a data structure, a program module or other data. Examples of machine readable media include, but not limited to, phase-change memory (PRAM) , static random access memory (SRAM) , dynamic random access memory (DRAM) , other types of random-access memory (RAM) , read-only memory (ROM) , electronically erasable programmable read-only memory (EEPROM) , quick flash memory or other internal storage technology, compact disk read-only memory (CD-ROM) , digital versatile disc (DVD) or other optical storage, magnetic cassette tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission media, which may be used to store information that may be accessed by a computing node. As defined herein, the machine readable media does not include any transitory media, such as modulated data signals and carrier waves.
In implementations, the network interfaces 206 may be configured to connect the computing node 104 to other computing nodes via the communication network 106. In implementations, the network interfaces 206 may be established through a network interface controller (NIC) , which may employ both hardware and software in connecting the computing node 104 to the communication network 106. In implementations, each type of NIC may use a different type of fabric or connector to connect to a physical medium associated with the communication network 106. Examples of types of fabrics or connectors may be found in the IEEE 802 specifications, and may include, for example, Ethernet (which is defined in 802.3) , Token Ring (which is defined in 802.5) , and wireless networking (which is defined in 802.11) , an InfiniBand, etc.
In implementations, the intra-node switches 210 may include various types of interconnects or switches, which may include, but are not limited to, a high-speed serial computer expansion bus (such as PCIe, etc. ) , a serial multi-lane near-range communication link (such as Nolan, which is a wire-based communications protocol serial multi-lane near-range communication link, for example) , a switch chip with a plurality of ports (e.g., a NVSwitch, etc. ) , a point-to-point processor interconnect (such as an Intel QPI/UPI, etc. ) ., etc.
Although in this example, only hardware components are described in the computing node 104, in other instances, the anomaly detection system 110 may further include other hardware components and/or other software components, such as program modules 212 to execute instructions stored in the memory 208 for performing various operations, and program data 214 for storing data received for training, intermediate and final results calculated during training, etc.
Example Collective Communication Algorithms
FIGS. 3A and 3B illustrate example collective communication algorithms that may be used in the distributed training system 102. In implementations, collective communication algorithms may include, but are not limited to, a ring-based communication algorithm, a halving-doubling communication algorithm, etc.
FIG. 3A shows a ring configuration that interconnects a predetermined number of nodes (e.g., N nodes, where N is a positive integer greater than one) with multiple connections (i.e., N connections) , and divides data (e.g., a data packet or  message) into a plurality of data chunks (i.e., N data chunks) for transmission, needs a number of steps (in this example, N –1 steps) of communications to complete a collective operation. In each step, a node may receive data from one of its neighboring nodes, conduct a specific operation with the received data to obtain a local result, and forward the received data to the other of the neighboring nodes. After N –1 steps, each node in the ring has data from other nodes of the ring, and a final result is scatter to all nodes, which needs another N –1 steps for broadcasting respective local results. For each node, a total data size that is forwarded is 2S, where S denotes a data size or message size.
FIG. 3B shows a halving-doubling configuration that interconnects a predetermined number of nodes (e.g., N nodes, where N is a positive integer greater than one) . In this halving-double configuration, the nodes communicate with each other in a pair-wise manner, with only N/2 connections that are needed in each step of communication. In the first step, adjacent nodes are paired together, send one half of a message or data to respective peer nodes, and receive other half of the message or data for processing. Therefore, intermediate results may be scattered to the peer nodes. In subsequent steps, new pairs are formed with increased or doubled distance, and a data size for processing is halved. After log 2N steps of communication, results are distributed among all the nodes in the halving-doubling configuration. Local results in the nodes are then broadcasted to other nodes through additional log 2N steps of communication.
Example Collective Communication Library
FIG. 4 shows a schematic diagram depicting an example collective communication library 400 that may be employed by the distributed training system 102. In implementations, the collective communication library is a communication library designed to provide high performance, high extensibility, and strong availability, and may be configured to provide support not only for standard collective operations such as allreduce and allgather operations, but also other self-defined operations for customized applications. In implementations, the collective communication library 400 may take different types of interconnects or switches with distinct characteristics (e.g., in terms of latency, bandwidth, topology, etc. ) into account, and provide a mechanism to collect information of underlying hardware in a network and computing nodes, thus enabling a topology-aware algorithm design to be developed based on one or more pieces of this collected information.
In implementations, the collective communication library 400 may provide flexibility for allowing multiple algorithms to be performed in a single operation, and improve the performance (e.g., the performance of communications and training, etc. ) by exploiting parallelism between intra-node communications and inter-node communications. Additionally, the collective communication library 400 may make use of multiple NICs in a computing node with conventional or new mapping algorithms, and eliminate network congestions through a topology-aware arrangement of connections.
In implementations, the collective communication library 400 may include a software stack 402. In implementations, the software stack 402 may include a plurality of components, which may include, but are not limited to, a  transport component 404, an operation component 406, a communicator component 408, and a library context component 410. In implementations, the software stack 402 may be designed in a modular manner to allow generality and extensibility.
In implementations, the transport component 404 may be responsible for peer-to-peer (P2P) data transfers or transmissions in intra-node and inter-node communications. By way of example and not limitation, the collective communication library 400 may support TCP (Transmission Control Protocol) and RDMA (Remote Direct Memory Access) for inter-node communication, and P2P fabrics for intra-node communication, such as PCIe (Peripheral Component Interconnect Express) , NVLink/NVSwitch, and QPI/UPI (Quick Path Interconnect/Ultra Path Interconnect) , etc. For RDMA communications, the transport component 404 may further be configured to manage memory regions (MRs) and corresponding memory buffers in both processing units (such as graphics processing unit (GPU) devices) and host memories.
In implementations, the operation component 406 may provide a set of basic operations and a variety of networking algorithms. For example, the basic operations may be configured with algorithms that are supported by the collective communication library 400. In addition, the operation component 406 may allow a user definition of a new operation based on these basic operations to implement a heterogeneity-aware operation that may adopt an optimal or better algorithm for each type of fabrics.
In implementations, the communicator component 408 may be  associated with a software process, and may be configured to perform manipulations and processing on a processing unit (such as a GPU device) . The communicator component 408 may keep or record information about other peers (e.g., rank IDs, IP addresses, etc. ) , and maintain connections with the peers. In implementations, the communicator component 408 may further collect intra-node and inter-node topology information, and use this information to guide an algorithm design. In implementations, intra-node information may include, but is not limited to, a type of interconnect, a distance between locations of processing units, a distance between a processing unit and a network interface controller, etc. In implementations, inter-node information may include, but is not limited to, the number of available network interface controllers, a topology of a cluster or computing nodes, locations of computing nodes in the cluster, for example.
In implementations, the library context component 410 may be configured to expose one or more application interfaces for setting system configurations (such as environment variables, for example) , managing communicator component 408, and provide other functionalities such as logging, etc.
Additionally, in some instances, the collective communication library 400 may further include or provide a plurality of tools and utilities 412 for topology awareness design, testing and evaluation, and availability improvement. By way of example and not limitations, the tools and utilities 412 may include performance testing tools for transport component 404 to provide assistance for algorithm designs and evaluations, a probing-based routing mechanism for ensuring the availability of the system, and other functionalities, such as a device  management function that is extendable to support devices other than GPUs, for example.
Example Topology-Aware Multi-Phase Algorithm for Collective Communication
In implementations, a collective communication may be defined as a communication that involves a group of processing units or processes, and an operation of collective communication may be executed by all the processing units or processes included in the group together. Examples of an operation of collective communication may include, but are not limited to, an allreduce operation, an allgather operation, a reduce-scatter operation, etc. In implementations, an allreduce operation is one of a number of important primitives of collective communication in distributed training, and involves performing a reduction on data across processes in a group. Examples of the reduction may include, but are not limited to, an operation of summation, an operation of obtaining an average, an operation of obtaining a maximum, an operation of obtaining a minimum, etc.
By way of example and not limitation, an allreduce operation is used herein as an example to illustrate how a collective operation may be divided into a plurality of micro-operations or sub-operations. In implementations, the distributed training system 102 may employ a topology-aware multi-phase algorithm that divides an allreduce operation into multiple micro-operations or sub-operations, and selectively pick one or more micro-operations or sub-operations on demand, thus reducing an amount of data that is transferred by eliminating micro-operations or sub-operations that may not be needed. In implementations, the distributed training  system 102 may decouple collective communication algorithms from micro-operations or sub-operations, and allow an independent or separate matching between algorithms and micro-operations or sub-operations based on underlying fabric information, thus maximizing or optimizing the bandwidth utilization with a lesser amount of data transferred.
FIG. 5 shows an example topology-aware multi-phase algorithm 500 that may be employed by the distributed training system 102. In implementations, the topology-aware multi-phase algorithm 500 may include a plurality of phases, for example, an intra-node reduce-scatter phase 502, an inter-node allreduce phase 504, and an intra-node allgather phase 506.
In implementations, the distributed training system 102 may first assign respective portions of data to be processed for training to multiple computing nodes 204, so that each computing node 104 of the multiple computing nodes 204 may receive a respective portion of data. In implementations, each computing node 104 may divide the respective portion of data into multiple data pieces (e.g., N data pieces, where N is a positive integer) , and assign these multiple data pieces to multiple local processing units or processes (e.g., N local processing units or processes) that are included in the respective computing node 104.
In implementations, in the intra-node reduce-scatter phase 502, each of the local processing units or processes included in each computing node 104 may divide a data piece assigned thereto into a plurality of data chunks (e.g., M chunks) . The local processing units or processes included in each computing node 104 may then collaboratively perform an intra-node reduce-scatter sub-operation to obtain  allreduce results of the plurality of data chunks in the respective computing node 104 according to a particular collective communication algorithm in a number of steps or iterations. At the end of the intra-node reduce-scatter phase 502, local processing units or processes included in a computing node 104 may have reduced results (or called reduce-scatter results) of all the processing units or processes included in that computing node 104 in different data chunks.
By way of example and not limitation, two example collective communication algorithms, namely, a ring-based algorithm, a halving-doubling algorithm, are described herein as examples of the particular collective communication algorithm to illustrate specific mechanisms or operations in the intra-node reduce-scatter phase 502. Nevertheless, other collective communication algorithms may be used in this intra-node reduce-scatter phase 502. The distributed training system 102 may select the particular collective communication algorithm used in the intra-node reduce-scatter phase 502 based on collected information of a number of factors by the collective communication library 400, for example. In implementations, the number of factors may include, but are not limited to, types of interconnects between processing units (or processes) in a computing node, the number of interconnects in the computing node, etc.
For example, in the intra-node reduce-scatter phase 502, the distributed training system 102 may employ a first collective communication algorithm for a first computing node, and employ a second collective communication algorithm for a second computing node having same or different processing and connection capabilities with the first computing node, where the first collective  communication algorithm may or may not be the same as the second collective communication algorithm. By way of example and not limitation, the distributed training system 102 may employ a halving-doubling algorithm for a computing node that uses NVSwitch or PCIe as interconnects and includes a number of processing units or processes that are used for training to be a power of two, and may employ a ring-based algorithm for another computing node using NVLink or others as interconnects and using a number of processing units or processes that is not a power of two for training, etc.
FIG. 6 shows an example ring-based algorithm 600 for a computing node in the intra-node reduce-scatter phase 502. For the sake of simplicity and description, the example ring-based algorithm includes a configuration of one ring only. Nevertheless, any ring-based algorithm including a configuration of more than one ring, with each ring processing a portion of data chunk, for example, may be used.
In this example, the computing node is described to include M processing units or processes (with rank identifiers or  numbers  1, 2, …, M) , and data assigned to each processing unit or process is divided into M data chunks. At the first step, a processing unit or process (e.g., P1) may send one of its M data chunks to a next processing unit or process (e.g., P2) in the ring, receive another data chunk from a previous processing unit or process (e.g., PM) in the ring, and reduce the received data chunk with a corresponding local data chunk to obtain a partial reduced result. At each subsequent step (e.g., at k th step) , the processing unit or process (e.g., P1) may send a partial reduced result (in this example, a partial reduced result obtained  by P1 at k-1 th step) to the next processing unit or process (e.g., P2) in the ring, receive a partial reduced result (in this example, a partial reduced result obtained by PM at k-1 th step) from the previous processing unit or process (e.g., PM) , and reduce the received partial reduced result with another local data chunk that has not previously been sent or reduced with other data.
As can be seen from FIG. 6, at each step, different data chunks may be received and reduced or sent by different processing units or processes in the computing node. Furthermore, each processing unit or process may send or receive and reduce different data chunks (or partial results) at different steps. At the end of the intra-node reduce-scatter phase 502 (i.e., after M –1 steps) , each processing unit or process may include a resulting data chunk that stores a reduced result of M respective data chunks of the M processing units or processes in that computing node. For example, after M –1 steps, a data chunk of P1 “at the top position” stores a reduced result of all the data chunks of the M processing units or processes corresponding to “that top position” as shown in FIG. 6.
FIG. 7 shows an example halving-doubling algorithm 700 for a computing node in the intra-node reduce-scatter phase 502. In this example, the computing node is described to include M processing units or processes (M is set as eight in this example for illustration) . At the first step, a processing unit or process (e.g., P1) may send one half of data allocated thereto to another processing unit or process (e.g., P2) nearby, receive one half of data allocated to the other processing unit or process (i.e., P2) , and reduce the received data with another half of data allocated to the processing unit or process (e.g., P1) to obtain a partial reduced result.  At each subsequent step, the processing unit or process (e.g., P1) may send one half of a partial reduced result that is locally obtained at a previous step to a different processing unit or process that is located at an increasingly further distance from the processing unit or process (i.e., P1) , and reduce the received partial reduced result with another half of the partial reduced result that is locally obtained at the previous step to obtain a new partial reduced result for the processing unit or process (i.e., P1) . At the end of the intra-node reduce-scatter phase 502 (i.e., after log 2M steps, i.e., 3 steps in this example as shown in FIG. 7) , each processing unit or process may include a resulting data chunk that stores a reduced result of M respective data chunks of the M (in this example, eight as shown in FIG. 7) processing units or processes in that computing node. For example, after log 2M steps, a data chunk of P1 “at the bottom position” stores a reduced result of all the data chunks of the M (in this example, eight as shown in FIG. 7) processing units or processes corresponding to “that bottom position” as shown in FIG. 7.
In implementations, in the inter-node allreduce phase 504, an inter-node allreduce sub-operation is node-based (i.e., between different computing nodes) , and may be performed between processing units (or processes) included in different computing nodes. In implementations, processing units (or processes) of different computing nodes holding a same data chunk of reduced results (or reduce-scatter results) are formed into a same group, and communicate respective results with each other in the same group to perform an inter-node allreduce sub-operation. At the end of the inter-node allreduce phase 504, each processing unit or process of each computing node in a certain group may possess a particular data chunk of  reduced results of all the processing units or processes in that same group, with processing units or processes of different groups possessing different data chunks of reduced results of respective processing units or processes in the different groups.
In implementations, the distributed training system 102 may select a particular collective communication algorithm based on one or more selection criteria, and may implement inter-node allreduce sub-operations based on the selected collective communication algorithm. Examples of the particular collective communication algorithm may include, but are not limited to, a ring-based algorithm (such as a hierarchical ring algorithm, a multi-ring algorithm, etc. ) , a halving-doubling algorithm, etc. In implementations, the one or more selection criteria may include, but are not limited to, a topology of a communication network (e.g., the communication network 206) connecting the computing nodes, the number of switches used in the communication network, types of switches used in the communication network, a network type of the communication network, etc.
By way of example and not limitation, two example collective communication algorithms, namely, a ring-based algorithm, a halving-doubling algorithm, are described herein as examples of the particular collective communication algorithm to illustrate specific mechanisms or operations in the inter-node allreduce phase 504. Nevertheless, other collective communication algorithms may be used in this inter-node allreduce phase 504 based on the one or more selection criteria described above.
FIGS. 8 and 9 show an example halving-doubling algorithm in the inter-node allreduce phase 504. In this example, for the sake of simplicity and  description, the distributed training system 102 is described to include a plurality of computing nodes (i.e., Node 0, Node 1, Node 2, …Node N-1, where N is shown as four in FIG. 8 for illustration) , with each computing node including eight processing units or processes with corresponding rank numbers (namely, Rank 0, Rank 1, Rank 2, …Rank M-1, where M is shown as eight in FIG. 8 for illustration) as shown in FIG. 8. As shown in FIG. 8, processing units or processes having a same rank number in corresponding computing nodes include a same data chunk of reduced results (or reduce-scatter results) , and are formed into a same group. For example, processing units or processes having a rank number 0 in corresponding computing nodes include a data chunk of reduced results at the first position among respective local data chunks, and are formed into a same group (e.g., group 0) . In implementations, processing units or processes in different groups may not communicate with each other.
In implementations, an inter-node allreduce sub-operation may be separately performed between processing units (or processes) in each group, so that each processing unit (or process) in a group may obtain all reduced results of a same data chunk of all processing units (or processes) in the same group. Similar to the mechanism of the halving-doubling algorithm described for the intra-node reduce-scatter phase above, a processing unit or process in each group may iteratively send a local reduced result of a corresponding data chunk with other processing units or processes in the respective group, receive respective local reduced results of the corresponding data chunk from the other processing units or processes in doubled or increased distances, and perform a reduction operation on the received reduced  results with a local reduced result.
FIG. 9 shows an example scenario of applying a halving-doubling algorithm for eight computing nodes. In this example, as shown in FIG. 8, the number of steps performed in inter-node allreduce phase 504 using the halving-doubling algorithm is log 2N = log 28 = 3, where N is the number of computing nodes. At the first step, a first processing unit or process of a certain group (e.g., a processing unit or process with a rank number 0) in a first computing node (e.g., Node 0) may send a local reduced result thereof to a second processing unit or process of the same group in a second computing node (e.g., Node 1) , receive a local reduced result from the second processing unit or process of the same group in the second computing node, and perform a reduction operation on the local reduced result thereof and the received local reduced result to obtain a new local reduced result.
At the second step, the first processing unit or process (e.g., the processing unit or process with the rank number 0) in the first computing node (e.g., Node 0) may send the new local reduced result thereof to a third processing unit or process of the same group (e.g., the rank number 0) in a third computing node (i.e., Node 2 in this example) , receive a local reduced result from the third processing unit or process of the same group in the first computing node, and perform a reduction operation on the new local reduced result thereof and the received local reduced result to obtain another new local reduced result.
At the third (or final) step, same operations are performed for the first processing unit or process, but with a fourth processing unit or process of the same group in a fourth computing node (i.e., Node 4 in this example) at this time.
At the end of the inter-node allreduce phase 504, each processing unit or process of each computing node in a certain group may possess a particular data chunk of reduced results of all the processing units or processes in that same group, with processing units or processes of different groups possessing different data chunks of reduced results of respective processing units or processes in the different groups.
Similar to the halving-doubling algorithm, an inter-node allreduce sub-operation may be separately performed between processing units (or processes) of each group in a plurality of computing nodes (e.g., N computing nodes) using a ring-based algorithm, so that each processing unit (or process) in a group may obtain all reduced results of a same data chunk of all processing units (or processes) in the same group. Similar to the mechanism of the ring-based algorithm described for the intra-node reduce-scatter phase above, a processing unit or process of each group in a computing node may iteratively send a local reduced result of a corresponding data chunk to a processing unit or process of the respective group in a next computing node, receive a local reduced result of the corresponding data chunk from a processing unit or process of the respective group in a previous computing node, and perform a reduction operation on the received reduced result with its local reduced result. At the end of the inter-node allreduce phase 504 (i.e., after N –1 steps) , each processing unit or process of each computing node in a certain group may possess a particular data chunk of reduced results of all the processing units or processes in that same group, with processing units or processes of different groups possessing different data chunks of reduced results of respective processing units or processes  in the different groups.
In implementations, similar to the intra-node reduce-scatter phase 502, in the intra-node allgather phase 506, an allgather sub-operation may be performed across local processing units or processes in each computing node of the plurality of computing nodes of the distributed training system 102, to locally broadcast respective reduced results obtained in the inter-node allreduce phase 504 to each other in the same computing node. At the end of the intra-node allgather phase 506, each processing unit or process in each computing node of the distributed training system 102 may have a reduced result of the entire data that is distributed among the plurality of computing nodes.
By way of example and not limitation, a ring-based algorithm is used herein to illustrate how to broadcast reduced results that are obtained (in the inter-node allreduce phase 504) by processing units or processes locally in a computing node of the distributed training system 102. Nevertheless, the distributed training system 102 may employ different or same collective communication algorithms (such as the halving-doubling algorithm, etc. ) for different computing nodes. For example, the distributed training system 102 may employ different or same collective communication algorithms for different computing nodes based on a number of factors associated with each individual computing node. In implementations, the number of factors may include, but are not limited to, types of interconnects between processing units (or processes) in a computing node, the number of interconnects in the computing node, etc.
FIG. 10 shows an example ring-based collective communication  algorithm 1000 used for broadcasting individual reduced results of processing units or processes to each other within a computing node of the distributed training system 102. As shown in FIG. 10, at the first step, each processing unit or process (e.g., P1) of M processing units or processes in the computing node may send its reduced result obtained in the inter-node allreduce phase 504 to one (e.g., P2 in this example) of two neighboring processing units or processes according to a ring configuration, and receive a reduced result from the other (e.g., PM in this example) of the two neighboring processing units or processes. At each subsequent step, each processing unit or process (e.g., P1) may send a newly received reduced result to one (e.g., P2 in this example) of two neighboring processing units or processes according to the ring configuration, and receive another reduced result from the other (e.g., PM in this example) of the two neighboring processing units or processes. At the end of the intra-node allgather phase 506 (i.e., after the M -1 steps) , each processing unit or process in the computing node may have reduced results of the reduced results of all the processing units or processes in the computing node.
Example Parallel Algorithm
In implementations, the distributed training system 102 may perform the plurality of phases included in the topology-aware multi-phase algorithm, i.e., the intra-node reduce-scatter phase 502, the inter-node allreduce phase 504, and the intra-node allgather phase 506, etc., sequentially. In implementations, the distributed training system 102 may alternatively partially or substantially overlap some of the intra-node reduce-scatter phase 502, the inter-node allreduce phase 504,  and the intra-node allgather phase 506, and perform some parts of these phases in parallel.
For instance, since the intra-node reduce-scatter phase 502 and the intra-node allgather phase 506 involve intra-node data communication or transmission (i.e., data communication or transmission within a computing node) , and the inter-node allreduce phase 504 involves inter-node data communication or transmission (i.e., data communication or transmission between computing nodes) , in implementations, the distributed training system 102 may allow at least parts of the intra-node reduce-scatter phase 502 and the inter-node allreduce phase 504 to be performed in parallel, and parts of the inter-node allreduce phase 504 and the intra-node allgather phase 506, thereby improving the utilization of intra-node and inter-node links (or connections) , and avoiding intra-node links from being idle while inter-node links are used, and vice versa.
FIG. 11 shows an example scenario of performing an intra-node reduce-scatter phase, an inter-node allreduce phase, and an intra-node allgather phase in a parallel or overlapping manner. As shown in FIG. 11, a processing unit or process of a computing node may divide a data chunk into multiple blocks (in this example, four blocks as shown in FIG. 11) , and distribute these blocks to at least two concurrent threads (e.g., a first thread 1102 and a second thread 1104) . In this way, the processing unit or process may pipeline intra-node and inter-node sub-operations for execution by the at least two concurrent threads (in this example, the first thread 1102 and the second thread 1104) .
By way of example and not limitation, the first thread 1102 may  perform an inter-node allreduce sub-operation (i.e., an operation in the inter-node allreduce phase 504) on a first data block (e.g., a data block 1106) while the second thread 1104 performs an intra-node reduce-scatter sub-operation (i.e., an operation in the intra-node reduce-scatter phase 502) on a second data block (e.g., a data block 1108) . For example, the first thread 1102 may perform an intra-node allgather sub-operation (i.e., an operation in the intra-node allgather phase 506) on a third data block (e.g., a data block 1110) , while the second thread 1104 performs an inter-node allreduce sub-operation on a fourth data block (e.g., a data block 1112) .
By way of example and not limitation, another operation involved in distributed neural network training may be further used as an example. In implementations, the distributed training system 102 may divide an allgather operation involved in distributed neural network training into a plurality of sub-operations, namely, an inter-node allgather sub-operation, an intra-node allgather sub-operation, and a data copy sub-operation. In implementations, the inter-node allgather sub-operation may be similar to the inter-node allreduce sub-operation as described above, except broadcasting data (e.g., reduced results) instead of reduction operation (e.g., reducing received results with local results) being performed, whereas the inter-node allgather sub-operation may be similar or identical to the inter-node allgather sub-operation as described above. In implementations, the data copy sub-operation may include an operation of copying resulting data (e.g., final reduced results) as parameters for output.
In implementations, a processing unit or a process of a computing node may divide a data chunk into multiple blocks (e.g., four blocks) , and distribute  these blocks to at least two concurrent threads (e.g., a first thread and a second thread) , and pipeline intra-node and inter-node sub-operations for execution by the at least two concurrent threads.
For example, the first thread may perform an inter-node allgather sub-operation on a first data block while the second thread performs an intra-node allgather sub-operation on a second data block. Furthermore, the first thread may perform a data copy sub-operation on a third data block, while the second thread performs an inter-node allgather sub-operation on a fourth data block.
Example Congestion Avoidance Approaches
In implementations, due to data transmission among the plurality of computing nodes in the distributed training system 102, data or traffic congestion may happen at some switches or links in the communication network 206. In order to avoid congestion, the distributed training system 102 may adopt a predetermined congestion avoidance strategy to distribute or divert data traffic among various switches or links in the communication network 206, thus avoiding an excessive amount of data from passing through a certain switch or link in the communication network 206 during training (e.g., the inter-node allreduce sub-operation or phase, or the inter-node allgather sub-operation or phase) .
In implementations, the distributed training system 102 may adopt a first congestion avoidance approach that includes a strategy on ring generation, followed by a routing management of network flows. Additionally or alternatively, the distributed training system 102 may adopt a second congestion avoidance  approach that includes a strategy on a reordering of node identification, followed by a routing management of network flows. Depending on a type of network topology of the communication network 206, and processing and communication capabilities of the plurality of computing nodes 204, etc., the distributed training system 102 may select one or more of the first congestion avoidance approach, or the second congestion avoidance approach for routing data flows between all or part of the plurality of computing nodes in the distributed training system 102. Furthermore, the distributed training system 102 may selectively combine parts of the first congestion avoidance approach and the second congestion avoidance approach to implement a new congestion avoidance approach. In implementations, both the first congestion avoidance approach and the second congestion avoidance approach may aim at specifying a dedicated network path for each direction of an inter-node data flow in a way that inter-node data flows have no or little conflict with each other.
In implementations, the distributed training system 102 may obtain or establish mapping relationships between communication connections and routing paths (e.g., physical links) in advance. In implementations, a connection-path data structure in a form of a table, a linked list, etc., may be created and used for storing information of the mapping relationships. In implementations, the distributed training system 102 may selectively or strategically use a specific path for establishing a connection between any two computing nodes based on the connection-path data structure.
In implementations, the distributed training system 102 may determine mapping relationships between communication connections and routing  paths by enabling each computing node of the distributed training system 102 to send probing data packets to other computing nodes through varying source/destination ports of the probing data packets, to exhaust possible communication connections between computing nodes of the distributed training system 102. Apparently, other methods of exploring mapping relationships between communication connections and routing paths may be employed by the distributed training system 102, which are not limited herein.
By way of example and not limitation, a first computing node may send a plurality of probing data packets to a second computing node, each probing data packet having a different combination of source and destination ports while a source address and a destination address being an address of the first computing node and an address of the second computing node respectively. Each probing data packet may record switches through which the respective probing data packet passes through, and thus the first computing node may know an entire routing path of the respective probing data packet for mapping when the respective probing data packet is returned to the first computing node. Accordingly, a connection-path data structure (e.g., a connection-path data structure) may be established between the first computing node and the second computing node. Similarly, mapping relationships between communication connections and routing paths (and hence connection-path data structures) for other pairs of computing nodes in the distributed training system 102 may be established accordingly.
For the sake of simplicity and illustration, an example network topology, namely, a fat-tree network (or in particular a two-tier Clos network  architecture in a full-mesh topology) is used herein as an example network topology of the communication network 206 that is associated with the distributed training system 102. Nevertheless, the example congestion avoidance strategies described herein may also be applicable to other network topologies.
FIG. 12 shows an example fat-tree network topology 1200. In this example, the example fat-tree network topology is two-tier Clos network architecture in a full-mesh topology. One tier corresponds to a tier of leaf switches 1202 that are directly connected to computing nodes 1204, with each leaf switch 1202 being connected to one or more computing nodes 1204. In implementations, a computing node 1204 may include one or more network interface controllers (e.g., four network interface controllers) which are connected to one or more ports (e.g., four ports) of a leaf switch 1202. In implementations, the number of network interface controllers in each computing node 1204 may or may not be the same. Another tier corresponds to a tier of aggregation switches 1206 (or called spine switches 1206) that are connected to one or more leaf switches 1202.
In implementations, if two processing units or processes included in different computing nodes are connected under a same leaf switch, data packets that are transmitted between the two processing units or processes will pass through that same leaf switch without passing through any of the aggregation switches. Alternatively, if two processing units or processes included in different computing nodes are connected under different leaf switches, data packets that are transmitted between the two processing units or processes will pass through one of the aggregation switches. Using the connection-path data structure as described above,  a data packet that is transmitted between the two processing units or processes can be made to flow through a specified aggregation switch by setting an appropriate combination of source and destination ports in the data packet. In implementations, the routing management of the first congestion avoidance approach and/or the second congestion avoidance approach may aim at enabling data flows from a same leaf switch to different destination leaf switches to pass through different aggregation switches, and/or data flows from different source leaf switches to a same destination leaf switches to pass through different aggregation switches, thus avoiding collisions between the data flows, and leading to no network congestion at the aggregation switches.
In implementations, as described in the foregoing description, the first congestion avoidance approach may include a strategy on ring generation, followed by a routing management of network flows. The first congestion avoidance approach may support a variety of ring-based algorithms, which include, but are limited to, a ring algorithm, a ring chunked algorithm, a multi-ring algorithm, a hierarchical ring algorithm, an algorithm involving multiple hierarchical rings, and a node-aware ring algorithm, etc.
In implementations, the strategy on ring generation may include a topology-aware strategy on ring generation. By way of example and limitation, the topology-aware strategy on ring generation may include a plurality of rules to build up a ring or ring configuration of processing units or processes. In implementations, a processing unit or process in a computing node may send/receive data to/from a processing unit or process in another computing node through a network interface  controller. In implementations, a processing unit or process in a computing node may be associated with a single network interface controllers or multiple network interface controllers for transmitting data to processing units or processes in other computing nodes. Additionally or alternatively, multiple processing units or processes may be associated with a single network interface controller, and employ that network interface controller for transmitting data to processing units or processes in other computing nodes.
In implementations, the plurality of rules may include, but are not limited to, priorities for a processing unit or process in a first computing node to select a neighboring processing unit or process, conditions for a network interface controller in a first computing node to send or receive data, conditions for a network interface controller in a first computing node to route data to/from a network interface controller in a second computing node, etc.
In implementations, priorities for a processing unit or process in a first computing node to select a neighboring processing unit or process may include, in a descending order of priorities, selecting a processing unit or process in the first computing node and using an inter-process communication if applicable, selecting a processing unit or process in a second computing node connected to a leaf switch that is the same as a leaf switch connected to the first computing node, selecting a processing unit or process in a third computing node connected to a leaf switch that is different from the leaf switch connected to the first computing node, wherein the first computing node is different from the second computing node and the third computing node.
In implementations, conditions for a network interface controller in a first computing node to send or receive data may include, for example, the network interface controller being capable of sending data to a network interface controller in a second computing node only, and/or the network interface controller being capable of receiving data from a network interface controller in a third computing node only, where the first computing node is different from the second computing node and the third computing node, and the second computing node may or may not be the same as the third computing node.
In implementations, conditions for a network interface controller in a first computing node to route data to/from a network interface controller in a second computing node may include, for example, routing data sent by processing units or processes belonging to multiple rings to the network interface controller in the second computing node if the data is sent through the network interface controller in the first computing node. In implementations, conditions for a network interface controller in a first computing node to route data to/from a network interface controller in a second computing node may further include receiving data through the network interface controller in the first computing node if the data is sent by processing units or processes belonging to multiple rings through the network interface controller in the second computing node.
In implementations, the routing management of the first congestion avoidance approach may assign network interface controller (NIC) identifiers to each network interface controller that is connected or linked to a same leaf switch. The routing management of the first congestion avoidance approach may further assign  aggregation identifiers to each aggregation switch in the communication network 206. For a processing unit or process in a certain ring, the routing management may determine a routing identifier for routing a data packet from that processing unit or process.
For example, if a network interface controller of the processing unit or process and a network interface controller of a next processing unit or process in the ring are located in a same computing node or are directly connected or linked to a same leaf switch, a routing identifier may be determined as a default value or identifier. This default routing identifier indicates that data is either routed within a computing node or through a leaf switch, without passing through any aggregation switch in the communication network. Otherwise, the routing identifier may be determined to be equal to a NIC identifier of that processing unit or process, or other predefined value. Based on a mapping relationship between routing identifiers and aggregation identifiers, an aggregation identifier may be determined based on the determined routing identifier. In implementations, the mapping relationship between routing identifiers and aggregation identifiers may be predetermined in advance using a probing-based routing mechanism (e.g., sending probing data packets between computing nodes as described in the foregoing description) , for example.
In other words, data flows between processing units (or processes) which are included in a same computing node or which network interface controllers a same leaf switch will not go through any aggregation switch in the communication network. On the other hand, data flows between processing units (or processes)  which are included in different computing nodes and which network interface controllers different leaf switches will pass through a designated aggregation switch based on a predetermined mapping relationship, thus enabling routing control and management of data flows and distributing the data flows to different aggregation switches to avoid network congestion.
FIG. 13 shows an example scenario of using the first congestion avoidance approach. In this example, four inter-node rings (or ring configurations, R0, R1, R2, and R3) involving eight computing nodes (Node 0, Node 1, …, Node 7) are generated, and each ring uses a different aggregation switch for sending and receiving data (e.g., during the inter-node allreduce phase 504) . Therefore, no conflict exists among these four rings. Furthermore, each leaf switch of any ring has only one data flow coming in, and one data flow coming out, thus avoiding an occurrence of network congestion.
In implementations, as described above, the second congestion avoidance approach may include a strategy on a reordering of node identification, followed by a routing management of network flows. In implementations, in order to minimize communication cost, the second congestion avoidance approach may reorder identifiers of computing nodes and processing units (or processes) according to a network topology connecting the computing nodes and processing units (or processes) based on a plurality of rules.
In implementations, the plurality of rules may include, for example, grouping computing nodes by respective leaf switches. For example, computing nodes connecting to a same leaf switch (e.g., computing nodes having network  interface controllers that are linked to a same leaf switch) are formed into one group, and each computing node is assigned with a node identifier. Since the computing nodes are connected to the same leaf switch, these computing nodes are (physically) adjacent to each other.
In implementations, the plurality of rules may further include assigning rank identifiers (or rank numbers) to each processing unit or process in the computing nodes using a same order sequence. For example, the k number of processing units (or processes) in a first computing node may be assigned with rank identifiers as 0, 1, …, k –1, and the k number of processing units (or processes) in a second computing node may be assigned with rank identifiers as k, k + 1, …, 2k –1, etc., and so forth for other computing nodes. Processing units (or processes) in a computing node may be ordered according to respective network interface controllers that the processing units (or processes) use, and processing units (or processes) using a same network interface controller are (physically) adjacent to each other.
In this case, in the first log 2L steps, data flows between processing units (or processes) in computing nodes may be restricted to go through respective leaf switches, which have a better latency than that of aggregation switches, and thus no network congestion is generated. In implementations, L is the number of computing nodes per leaf switch for a node-aware halving-doubling algorithm as described in the foregoing description. In implementations, L is a product of the number computing nodes per leaf switch and the number of processing units (or processes) per computing node for a conventional halving-doubling algorithm.
In implementations, the routing management of the second congestion avoidance approach may include determining an aggregation identifier for a data flow or data packet sent from a first processing unit (or process) having a first rank identifier in a first computing node having a first node identifier to a second processing unit (or process) having a second rank identifier in a second computing node having a second node identifier, where the first computing node may or may not be the same as the second computing node.
In implementations, the aggregation identifier may be determined based at least in part on at least some of the rank identifier, the node identifier, the number of network interface controllers per computing node, and a maximum number of computing nodes at each leaf switch. By way of example and not limitation, the aggregation identifier may be determined as a first rank identifier of a first processing unit (or process) from which a data flow or data packet is sent + (afirst node identifier of a first computing node having the first processing unit (or process) %a maximum number of computing nodes at each leaf switch) × a number of network interface controllers per computing node, where %represents a modulus operator. Apparently, other means of calculating the aggregation identifier may be applicable provided that a consistent result is obtained. For example, the aggregation identifier may be determined based on a preset mapping relationship between aggregation identifiers and combinations of rank identifier and node identifier, etc.
In implementations, the routing management of the second congestion avoidance approach may include assigning aggregation identifiers to each aggregation switch in the communication network 206 associated with the  distributed training system 102 in advance. If the first processing unit (or process) and the second processing unit (or process) are linked to or under a same leaf switch (e.g., through respective network controllers) , the data flow or data packet will pass through that leaf switch without the need of passing through any aggregation switch in the communication network 206. If the first processing unit (or process) and the second processing unit (or process) are not linked to or under a same leaf switch, the data flow or data packet sent by the first processing unit (or process) to the second processing unit (or process) will pass through an aggregation switch having the determined aggregation identifier. In this example, the number of network interface controllers included in each computing node is described to be four.
FIG. 14 shows an example scenario of using the second congestion avoidance approach. In this example, all computing nodes includes the same number of processing units (or processes) and the same number of network interface controllers, with each network interface controller having the same number of processing units (or processes) to be associated with. Furthermore, the number of network interface controllers linking to a leaf switch is fewer than the number of aggregation switches in the network. In this example, the number of network interface controllers per computing node is four, and the maximum number of computing nodes at each leaf switch is two. In implementations, the number of computing nodes under a same leaf switch may be a power of two for a node-aware halving-doubling algorithm, and the number of network interface controllers included in computing nodes under a same leaf switch may be a power of two, and the number of processing units (or processes) using a same network interface  controller may be a power of two for a conventional halving-doubling algorithm. In this example, the number of network interface controllers included in each computing node is described to be four.
During an inter-node allreduce phase in a node-aware halving-doubling algorithm in this example, processing units (or processes) of computing nodes (Node 0, Node 2, Node 4, and Node 6) will use aggregation switches with aggregation identifiers (A1, A2, A3, and A4, for example) , and processing units (or processes) of computing nodes (Node 1, Node 3, Node 5, and Node 7) will use aggregation switches with aggregation identifiers (A5, A6, A7, and A8, for example) . Accordingly, no collision exists among data flows between computing nodes, thus avoiding network congestion at any aggregation switch in the network.
In implementations, at each step of the inter-node allreduce phase, a processing unit (or process) may communicate data with a new processing unit (or process) . In implementations, synchronization may be performed to ensure that a data flow conducted at a current step by the processing unit (or process) using a network interface controller does not overlap with a data flow conducted at a previous step by a neighboring processing unit (or process) using the same network interface controller, to avoid an occurrence of an incast and thus avoiding an occurrence of incast congestion.
Example Methods
FIG. 15 shows a schematic diagram depicting an example topology aware multi-phase method. FIG. 16 shows a schematic diagram depicting a first  example network congestion avoidance method. FIG. 17 shows a schematic diagram depicting a second example network congestion avoidance method. FIG. 18 shows a schematic diagram depicting an example parallel method based on hybrid architecture in distributed training. The methods of FIGS. 15-18 may, but need not, be implemented in the environment of FIG. 1, using the computing node of FIGS. 2, with the help of the methods and scenarios of FIGS 3-14. For ease of explanation, methods 1500 –1800 are described with reference to FIGS. 1-14. However, the methods 1500 –1800 may alternatively be implemented in other environments and/or using other systems.
The methods 1500 –1800 are described in the general context of machine-executable instructions. Generally, machine-executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types. Furthermore, each of the example methods are illustrated as a collection of blocks in a logical flow graph representing a sequence of operations that can be implemented in hardware, software, firmware, or a combination thereof. The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or alternate methods. Additionally, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein. In the context of software, the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations. In the context of hardware, some or all  of the blocks may represent application specific integrated circuits (ASICs) or other physical components that perform the recited operations.
Referring back to FIG. 15, at block 1502, a first computing node (e.g., the computing node 104) may perform reduce-scatter sub-operations between a first plurality of processing units in the first computing node according to a first collective communication algorithm.
In implementations, prior to performing the reduce-scatter sub-operations, the first computing node may select the first collective communication algorithm based at least in part on a type or a bandwidth of intra-node connections between the first plurality of processing units in the first computing node. In implementations, the first collective communication algorithm may include, but is not limited to, a ring-based algorithm, or a halving-doubling algorithm.
In implementations, performing the reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm may include dividing data into a plurality of data chunks; assigning the plurality of data chunks to the first plurality of processing units; receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
At block 1504, the first computing node may perform allreduce sub-operations between the first plurality of processing units in the first computing node and a second plurality of processing units in a second computing node according to a  second collective communication algorithm.
In implementations, prior to performing the allreduce sub-operations, the first computing node may select the second collective communication algorithm based at least in part on a type or a bandwidth of inter-node connections between the first computing node and other computing nodes, and/or a connection topology of the first computing node and the other computing nodes. In implementations, the first collective communication algorithm may include, but is not limited to, a ring-based algorithm, or a halving-doubling algorithm (such as a node-aware halving-doubling algorithm) , etc.
In implementations, performing the allreduce sub-operations between the first plurality of processing units in the first computing node and the second plurality of processing units in the second computing node according to the second collective communication algorithm may include receiving, by the first plurality of processing units, respective portions of a reduce-scatter result obtained by the second plurality of processing units in the second computing node according to the second collective communication algorithm, each processing unit of the first plurality of processing units forming a group with a respective processing unit of the second plurality of processing units and receiving a respective portion of the reduce-scatter result from the respective processing unit; and reducing, by the first plurality of processing units, the respective portions of the reduce-scatter result with corresponding local portions of a reduce-scatter result obtained after performing the reduce-scatter sub-operations between the first plurality of processing units.
At block 1506, the first computing node may perform allgather sub- operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm.
In implementations, performing the allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm may include receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
Referring to FIG. 16, at block 1602, a first computing node (e.g., the computing node 104) or a first process may determine a routing identifier for routing data from the first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same leaf switch.
In implementations, the first process and the second process may belong to a particular inter-node ring that connects a plurality of different nodes under a particular network topology. By way of example and not limitation, the particular network topology may include a fat-tree topology.
In implementations, the network interface controller associated with the first process is configured to send data to or receive data from a second computing node in the ring topology only, the second computing node being different from the first computing node.
In implementations, the network interface controller associated with the first process is further associated with one or more processes, and wherein all data sent from the first process and the one or more processes are sent through the network interface controller.
In implementations, the routing identifier may be set or determined as a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or are linked to the same leaf switch.
In implementations, the routing identifier may be set or determined to be equal to an identifier of the network interface controller associated with the first process in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in different computing nodes or are linked to different leaf switches.
At block 1604, the first computing node or the first process may route the data from the first process to the second process according to the routing identifier.
In implementations, routing the data from the first process to the second process according to the routing identifier may include routing the data from the first process to the second process through at least a leaf switch connected with the network interface controller associated with the first process and an aggregation switch having an identifier that has a correspondence relationship with the identifier  of the network interface controller associated with the first process.
Referring to FIG. 17, at block 1702, a first computing node (e.g., the computing node 104) or a first process may determine an aggregation identifier for sending a data packet from the first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology.
In implementations, the first computing node may assign different aggregation identifiers for data packets directed to computing nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
In implementations, the first computing node may assign a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship. In implementations, the correspondence relationship may record a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs. In implementations, the particular network topology may include a fat-tree topology.
At block 1704, the first computing node may send a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
In implementations, the first computing node may further send  respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
In implementations, the first computing node may further receive data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
Referring to FIG. 18, at block 1802, a first computing node (e.g., the computing node 104) or a processing unit may divide a data chunk assigned to the processing unit into a plurality of data segments, the plurality of data segments comprising at least a first data segment and a second data segment.
At block 1804, the first computing node or the processing unit may assign the plurality of data segments to a plurality of threads, the plurality of threads comprising at least a first thread and a second thread.
At block 1806, the first computing node or the processing unit may perform an intra-node sub-operation on a portion of the first data segment using the first thread, in parallel with performing an inter-node sub-operation on a portion of the second data segment using the second thread.
In implementations, performing the intra-node sub-operation on the portion of the first data segment using the first thread may include transmitting the  portion of the first data segment between the processing unit and another processing unit included in the first computing node through an intra-node connection.
In implementations, performing the inter-node sub-operation on the portion of the second data segment using the second thread may include transmitting the portion of the second data segment between the processing unit and another processing unit included in a second computing node that is different from the first computing node through an inter-node connection.
In implementations, the intra-node sub-operation may include a reduce-scatter sub-operation or an allgather sub-operation performed within the first computing node, and the inter-node sub-operation may include an allreduce sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
In implementations, the intra-node sub-operation may include an allgather sub-operation or a copy sub-operation performed within the first computing node, and the inter-node sub-operation may include an allgather sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
In implementations, the first computing node or the processing unit may perform another inter-node sub-operation on the portion of the first data segment using the first thread, and performing another intra-node sub-operation on the portion of the second data segment using the second thread in parallel.
In implementations, performing the intra-node sub-operation on the  portion of the first data segment using the first thread, in parallel with performing the inter-node sub-operation on the portion of the second data segment using the second thread enables utilizing an intra-node connection for transmitting the portion of the first data segment to another processing unit included in the first computing node and an inter-node connection for transmitting the portion of the second data segment to another processing unit included in a second computing node that is different from the first computing node concurrently.
Although the above method blocks are described to be executed in a particular order, in some implementations, some or all of the method blocks can be executed in other orders, or in parallel.
Conclusion
Although implementations have been described in language specific to structural features and/or methodological acts, it is to be understood that the claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed subject matter. Additionally or alternatively, some or all of the operations may be implemented by one or more ASICS, FPGAs, or other hardware.
The present disclosure can be further understood using the following clauses.
Clause 1: A method implemented by a first computing node, the method comprising: performing reduce-scatter sub-operations between a first plurality of processing units in the first computing node according to a first collective  communication algorithm; performing allreduce sub-operations between the first plurality of processing units in the first computing node and a second plurality of processing units in a second computing node according to a second collective communication algorithm; and performing allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm.
Clause 2: The method of Clause 1, further comprising selecting the first collective communication algorithm based at least in part on a type or a bandwidth of intra-node connections between the first plurality of processing units in the first computing node.
Clause 3: The method of Clause 1, further comprising selecting the second collective communication algorithm based at least in part on a type or a bandwidth of inter-node connections between the first computing node and other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.
Clause 4: The method of Clause 1, wherein the first collective communication algorithm comprises a ring-based algorithm, or a halving-doubling algorithm.
Clause 5: The method of Clause 1, wherein performing the reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: dividing data into a plurality of data chunks; assigning the plurality of data chunks to the first plurality of processing units; receiving a data chunk at a first processing unit  of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
Clause 6: The method of Clause 1, wherein performing the allreduce sub-operations between the first plurality of processing units in the first computing node and the second plurality of processing units in the second computing node according to the second collective communication algorithm comprises: receiving, by the first plurality of processing units, respective portions of a reduce-scatter result obtained by the second plurality of processing units in the second computing node according to the second collective communication algorithm, each processing unit of the first plurality of processing units forming a group with a respective processing unit of the second plurality of processing units and receiving a respective portion of the reduce-scatter result from the respective processing unit; and reducing, by the first plurality of processing units, the respective portions of the reduce-scatter result with corresponding local portions of a reduce-scatter result obtained after performing the reduce-scatter sub-operations between the first plurality of processing units.
Clause 7: The method of Clause 1, wherein performing the allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first  collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
Clause 8: One or more machine readable media storing machine readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: performing reduce-scatter sub-operations between a first plurality of processing units in the first computing node according to a first collective communication algorithm; performing allreduce sub-operations between the first plurality of processing units in the first computing node and a second plurality of processing units in a second computing node according to a second collective communication algorithm; and performing allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm.
Clause 9: The one or more machine readable media of Clause 8, the acts further comprising selecting the first collective communication algorithm based at least in part on a type or a bandwidth of intra-node connections between the first plurality of processing units in the first computing node.
Clause 10: The one or more machine readable media of Clause 8, the acts further comprising selecting the second collective communication algorithm based at least in part on a type or a bandwidth of inter-node connections between the first computing node and other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.
Clause 11: The one or more machine readable media of Clause 8, wherein the first collective communication algorithm comprises a ring-based algorithm, or a halving-doubling algorithm.
Clause 12: The one or more machine readable media of Clause 8, wherein performing the reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: dividing data into a plurality of data chunks; assigning the plurality of data chunks to the first plurality of processing units; receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
Clause 13: The one or more machine readable media of Clause 8, wherein performing the allreduce sub-operations between the first plurality of processing units in the first computing node and the second plurality of processing units in the second computing node according to the second collective communication algorithm comprises: receiving, by the first plurality of processing units, respective portions of a reduce-scatter result obtained by the second plurality of processing units in the second computing node according to the second collective communication algorithm, each processing unit of the first plurality of processing units forming a group with a respective processing unit of the second plurality of processing units and receiving a respective portion of the reduce-scatter result from the respective processing unit; and reducing, by the first plurality of processing units,  the respective portions of the reduce-scatter result with corresponding local portions of a reduce-scatter result obtained after performing the reduce-scatter sub-operations between the first plurality of processing units.
Clause 14: The one or more machine readable media of Clause 8, wherein performing the allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
Clause 15: A first computing node comprising: a first plurality of processing units; and memory storing machine executable instructions that, when executed by the first plurality of processing units, cause the first plurality of processing units to perform acts comprising: performing reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to a first collective communication algorithm; performing allreduce sub-operations between the first plurality of processing units in the first computing node and a second plurality of processing units in a second computing node according to a second collective communication algorithm; and performing allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm.
Clause 16: The first computing node of Clause 15, the acts further comprising: selecting the first collective communication algorithm based at least in part on a type or a bandwidth of intra-node connections between the first plurality of processing units in the first computing node; and selecting the second collective communication algorithm based at least in part on a type or a bandwidth of inter-node connections between the first computing node and other computing nodes, and/or a connection topology of the first computing node and the other computing nodes.
Clause 17: The first computing node of Clause 15, wherein the first collective communication algorithm comprises a ring-based algorithm, or a halving-doubling algorithm.
Clause 18: The first computing node of Clause 15, wherein performing the reduce-scatter sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: dividing data into a plurality of data chunks; assigning the plurality of data chunks to the first plurality of processing units; receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
Clause 19: The first computing node of Clause 15, wherein performing the allreduce sub-operations between the first plurality of processing units in the first computing node and the second plurality of processing units in the second  computing node according to the second collective communication algorithm comprises: receiving, by the first plurality of processing units, respective portions of a reduce-scatter result obtained by the second plurality of processing units in the second computing node according to the second collective communication algorithm, each processing unit of the first plurality of processing units forming a group with a respective processing unit of the second plurality of processing units and receiving a respective portion of the reduce-scatter result from the respective processing unit; and reducing, by the first plurality of processing units, the respective portions of the reduce-scatter result with corresponding local portions of a reduce-scatter result obtained after performing the reduce-scatter sub-operations between the first plurality of processing units.
Clause 20: The first computing node of Clause 15, wherein performing the allgather sub-operations between the first plurality of processing units in the first computing node according to the first collective communication algorithm comprises: receiving a data chunk at a first processing unit of the first plurality of processing units from a second processing unit of the first plurality of processing units according to the first collective communication algorithm; and reducing the received data chunk with a local data chunk at the first processing unit.
Clause 21: A method implemented by a first computing node, the method comprising: determining a routing identifier for routing data from a first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are  linked to a same leaf switch, the first process and the second process belonging to a particular inter-node ring that connects a plurality of different nodes under a particular network topology; and routing the data from the first process to the second process according to the routing identifier.
Clause 22: The method of Clause 21, wherein the network interface controller associated with the first process is configured to send data to or receive data from a second computing node in the ring topology only, the second computing node being different from the first computing node.
Clause 23: The method of Clause 21, wherein the network interface controller associated with the first process is further associated with one or more processes, and wherein all data sent from the first process and the one or more processes are sent through the network interface controller.
Clause 24: The method of Clause 21, wherein the particular network topology comprises a fat-tree topology.
Clause 25: The method of Clause 21, further comprising setting the routing identifier as a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or are linked to the same leaf switch.
Clause 26: The method of Clause 21, further comprising setting the routing identifier to be equal to an identifier of the network interface controller associated with the first process in response to determining that the network interface controller associated with the first process and the network interface  controller associated with the second process are located in different computing nodes or are linked to different leaf switches.
Clause 27: The method of Clause 26, wherein routing the data from the first process to the second process according to the routing identifier comprises routing the data from the first process to the second process through at least a leaf switch connected with the network interface controller associated with the first process and an aggregation switch having an identifier that has a correspondence relationship with the identifier of the network interface controller associated with the first process.
Clause 28: One or more machine readable media storing machine readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: determining a routing identifier for routing data from a first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same leaf switch, the first process and the second process belonging to a particular inter-node ring that connects a plurality of different nodes under a particular network topology; and routing the data from the first process to the second process according to the routing identifier.
Clause 29: The one or more machine readable media of Clause 28, wherein the network interface controller associated with the first process is configured to send data to or receive data from a second computing node in the ring  topology only, the second computing node being different from the first computing node.
Clause 30: The one or more machine readable media of claim Clause 28, wherein the network interface controller associated with the first process is further associated with one or more processes, and wherein all data sent from the first process and the one or more processes are sent through the network interface controller.
Clause 31: The one or more machine readable media of Clause 28, wherein the particular network topology comprises a fat-tree topology.
Clause 32: The one or more machine readable media of Clause 28, the acts further comprising setting the routing identifier as a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or are linked to the same leaf switch.
Clause 33: The one or more machine readable media of Clause 28, the acts further comprising setting the routing identifier to be equal to an identifier of the network interface controller associated with the first process in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in different computing nodes or are linked to different leaf switches.
Clause 34: The one or more machine readable media of Clause 33, wherein routing the data from the first process to the second process according to the routing identifier comprises routing the data from the first process to the second  process through at least a leaf switch connected with the network interface controller associated with the first process and an aggregation switch having an identifier that has a correspondence relationship with the identifier of the network interface controller associated with the first process.
Clause 35: A first computing node comprising: one or more processing units; and memory storing machine executable instructions that, when executed by one or more processing units, cause the one or more processing units to perform acts comprising: determining a routing identifier for routing data from a first process to a second process based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in a same computing node or are linked to a same leaf switch, the first process and the second process belonging to a particular inter-node ring that connects a plurality of different nodes under a particular network topology; and routing the data from the first process to the second process according to the routing identifier.
Clause 36: The first computing node of Clause 35, wherein the network interface controller associated with the first process is configured to send data to or receive data from a second computing node in the ring topology only, the second computing node being different from the first computing node.
Clause 37: The first computing node of Clause 35, wherein the network interface controller associated with the first process is further associated with one or more processes, and wherein all data sent from the first process and the one or more processes are sent through the network interface controller.
Clause 38: The first computing node of Clause 35, the acts further comprising setting the routing identifier as a default identifier in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or are linked to the same leaf switch.
Clause 39: The first computing node of Clause 35, the acts further comprising setting the routing identifier to be equal to an identifier of the network interface controller associated with the first process in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in different computing nodes or are linked to different leaf switches.
Clause 40: The first computing node of Clause 39, wherein routing the data from the first process to the second process according to the routing identifier comprises routing the data from the first process to the second process through at least a leaf switch connected with the network interface controller associated with the first process and an aggregation switch having an identifier that has a correspondence relationship with the identifier of the network interface controller associated with the first process.
Clause 41: A method implemented by a first computing node, the method comprising: determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology; and  sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
Clause 42: The method of claim 41, further comprising assigning different aggregation identifiers for data packets directed to nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
Clause 43: The method of Clause 41, further comprising assigning a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
Clause 44: The method of Clause 43, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
Clause 45: The method of Clause 41, wherein the particular network topology comprises a fat-tree topology.
Clause 46: The method of Clause 41, further comprising sending respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
Clause 47: The method of Clause 41, further comprising receiving data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
Clause 48: One or more machine readable media storing machine readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology; and sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
Clause 49: The one or more machine readable media of Clause 48, the acts further comprising assigning different aggregation identifiers for data packets directed to nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
Clause 50: The one or more machine readable media of Clause 48, the acts further comprising assigning a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
Clause 51: The one or more machine readable media of Clause 50, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
Clause 52: The one or more machine readable media of Clause 48, wherein the particular network topology comprises a fat-tree topology.
Clause 53: The one or more machine readable media of Clause 48, the acts further comprising sending respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
Clause 54: The one or more machine readable media of Clause 48, the acts further comprising receiving data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
Clause 55: A first computing node comprising: one or more processing units; and memory storing machine executable instructions that, when executed by one or more processing units, cause the one or more processing units to perform acts comprising: determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling  algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology; and sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
Clause 56: The first computing node of Clause 55, the acts further comprising assigning different aggregation identifiers for data packets directed to nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
Clause 57: The first computing node of Clause 55, the acts further comprising assigning a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
Clause 58: The first computing node of Clause 57, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
Clause 59: The first computing node of Clause 55, the acts further comprising sending respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
Clause 60: The first computing node of Clause 55, the acts further comprising receiving data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
Clause 61: A method implemented by a first computing node, the method comprising: dividing a data chunk assigned to a processing unit into a plurality of data segments, the plurality of data segments comprising at least a first data segment and a second data segment; assigning the plurality of data segments to a plurality of threads, the plurality of threads comprising at least a first thread and a second thread; and performing an intra-node sub-operation on a portion of the first data segment using the first thread, in parallel with performing an inter-node sub-operation on a portion of the second data segment using the second thread.
Clause 62: The method of Clause 61, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread comprises transmitting the portion of the first data segment between the processing unit and another processing unit included in the first computing node through an intra-node connection.
Clause 63: The method of Clause 61, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread comprises transmitting the portion of the second data segment between the  processing unit and another processing unit included in a second computing node that is different from the first computing node through an inter-node connection.
Clause 64: The method of Clause 61, wherein the intra-node sub-operation comprises a reduce-scatter sub-operation or an allgather sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allreduce sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
Clause 65: The method of Clause 61, wherein the intra-node sub-operation comprises an allgather sub-operation or a copy sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allgather sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
Clause 66: The method of Clause 61, further comprising performing another inter-node sub-operation on the portion of the first data segment using the first thread, and performing another intra-node sub-operation on the portion of the second data segment using the second thread in parallel.
Clause 67: The method of Clause 61, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread, in parallel with performing the inter-node sub-operation on the portion of the second data segment using the second thread enables utilizing an intra-node connection for transmitting the portion of the first data segment to another processing unit included in the first computing node and an inter-node connection for transmitting  the portion of the second data segment to another processing unit included in a second computing node that is different from the first computing node concurrently.
Clause 68: One or more machine readable media storing machine readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising: dividing a data chunk assigned to a processing unit into a plurality of data segments, the plurality of data segments comprising at least a first data segment and a second data segment; assigning the plurality of data segments to a plurality of threads, the plurality of threads comprising at least a first thread and a second thread; and performing an intra-node sub-operation on a portion of the first data segment using the first thread, in parallel with performing an inter-node sub-operation on a portion of the second data segment using the second thread.
Clause 69: The one or more machine readable media of Clause 68, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread comprises transmitting the portion of the first data segment between the processing unit and another processing unit included in the first computing node through an intra-node connection.
Clause 70: The one or more machine readable media of Clause 68, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread comprises transmitting the portion of the second data segment between the processing unit and another processing unit included in a second computing node that is different from the first computing node through an inter-node connection.
Clause 71: The one or more machine readable media of Clause 68, wherein the intra-node sub-operation comprises a reduce-scatter sub-operation or an allgather sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allreduce sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
Clause 72: The one or more machine readable media of Clause 68, wherein the intra-node sub-operation comprises an allgather sub-operation or a copy sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allgather sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
Clause 73: The one or more machine readable media of Clause 68, the acts further comprising performing another inter-node sub-operation on the portion of the first data segment using the first thread, and performing another intra-node sub-operation on the portion of the second data segment using the second thread in parallel.
Clause 74: The one or more machine readable media of Clause 68, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread, in parallel with performing the inter-node sub-operation on the portion of the second data segment using the second thread enables utilizing an intra-node connection for transmitting the portion of the first data segment to another processing unit included in the first computing node and an  inter-node connection for transmitting the portion of the second data segment to another processing unit included in a second computing node that is different from the first computing node concurrently.
Clause 75: A first computing node comprising: one or more processing units; and memory storing machine executable instructions that, when executed by one or more processing units, cause the one or more processing units to perform acts comprising: dividing a data chunk assigned to a processing unit into a plurality of data segments, the plurality of data segments comprising at least a first data segment and a second data segment; assigning the plurality of data segments to a plurality of threads, the plurality of threads comprising at least a first thread and a second thread; and performing an intra-node sub-operation on a portion of the first data segment using the first thread, in parallel with performing an inter-node sub-operation on a portion of the second data segment using the second thread.
Clause 76: The first computing node of Clause 75, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread comprises transmitting the portion of the first data segment between the processing unit and another processing unit included in the first computing node through an intra-node connection.
Clause 77: The first computing node of Clause 75, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread comprises transmitting the portion of the second data segment between the processing unit and another processing unit included in a second  computing node that is different from the first computing node through an inter-node connection.
Clause 78: The first computing node of Clause 75, wherein the intra-node sub-operation comprises a reduce-scatter sub-operation or an allgather sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allreduce sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
Clause 79: The first computing node of Clause 75, wherein the intra-node sub-operation comprises an allgather sub-operation or a copy sub-operation performed within the first computing node, and the inter-node sub-operation comprises an allgather sub-operation performed between the first computing node and a second computing node that is different from the first computing node.
Clause 80: The first computing node of Clause 75, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread, in parallel with performing the inter-node sub-operation on the portion of the second data segment using the second thread enables utilizing an intra-node connection for transmitting the portion of the first data segment to another processing unit included in the first computing node and an inter-node connection for transmitting the portion of the second data segment to another processing unit included in a second computing node that is different from the first computing node concurrently.

Claims (20)

  1. A method implemented by a first computing node, the method comprising:
    determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology; and
    sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
  2. The method of claim 1, further comprising assigning different aggregation identifiers for data packets directed to nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
  3. The method of claim 1, further comprising assigning a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
  4. The method of claim 3, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
  5. The method of claim 1, wherein the particular network topology comprises a fat-tree topology.
  6. The method of claim 1, further comprising sending respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  7. The method of claim 1, further comprising receiving data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  8. One or more machine readable media storing machine readable instructions that, when executed by a first computing node, cause the first computing node to perform acts comprising:
    determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology; and
    sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
  9. The one or more machine readable media of claim 8, the acts further comprising assigning different aggregation identifiers for data packets directed to nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
  10. The one or more machine readable media of claim 8, the acts further comprising assigning a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
  11. The one or more machine readable media of claim 10, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
  12. The one or more machine readable media of claim 8, wherein the particular network topology comprises a fat-tree topology.
  13. The one or more machine readable media of claim 8, the acts further comprising sending respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  14. The one or more machine readable media of claim 8, the acts further comprising receiving data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
  15. A first computing node comprising:
    one or more processing units; and
    memory storing machine executable instructions that, when executed by one or more processing units, cause the one or more processing units to perform acts comprising:
    determining an aggregation identifier for sending a data packet from a first process to a second process according to a node-aware halving-doubling algorithm, the first process and the second process belonging to different nodes that are connected to different leaf switches under a particular network topology; and
    sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.
  16. The first computing node of claim 15, the acts further comprising assigning different aggregation identifiers for data packets directed to nodes that are connected to different leaf switches to enable routing the data packets to the nodes that are connected to the different leaf switches through different aggregation switches.
  17. The first computing node of claim 15, the acts further comprising assigning a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence relationship.
  18. The first computing node of claim 17, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source and destination port pairs.
  19. The first computing node of claim 15, the acts further comprising sending respective data packets from a first plurality of processes included in the first computing node to a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a  plurality of different aggregation identifiers that are assigned to the respective data packets.
  20. The first computing node of claim 15, the acts further comprising receiving data packets by a first plurality of processes included in the first computing node from a second plurality of processes included in a second computing node through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers that are assigned to the respective data packets.
PCT/CN2020/082516 2020-03-31 2020-03-31 Network congestion avoidance over halving-doubling collective communication WO2021195988A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/082516 WO2021195988A1 (en) 2020-03-31 2020-03-31 Network congestion avoidance over halving-doubling collective communication
CN202080098260.0A CN115335804A (en) 2020-03-31 2020-03-31 Avoiding network congestion by halving trunked communication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/082516 WO2021195988A1 (en) 2020-03-31 2020-03-31 Network congestion avoidance over halving-doubling collective communication

Publications (1)

Publication Number Publication Date
WO2021195988A1 true WO2021195988A1 (en) 2021-10-07

Family

ID=77926934

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/082516 WO2021195988A1 (en) 2020-03-31 2020-03-31 Network congestion avoidance over halving-doubling collective communication

Country Status (2)

Country Link
CN (1) CN115335804A (en)
WO (1) WO2021195988A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115102864A (en) * 2022-06-21 2022-09-23 中国人民解放军国防科技大学 Allgather method and device for Dragonfly topology

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114567597B (en) * 2022-02-21 2023-12-19 深圳市亦青藤电子科技有限公司 Congestion control method and device based on deep reinforcement learning in Internet of things

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236511A (en) * 2007-01-31 2008-08-06 国际商业机器公司 Method and system for optimizing global reduction treatment
US20190102169A1 (en) * 2017-09-29 2019-04-04 Fujitsu Limited Effective determination of processor pairs for transferring data processed in parallel
US20190138302A1 (en) * 2017-11-06 2019-05-09 Fujitsu Limited Information processing system, arithmetic processing circuit, and control method for information processing system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236511A (en) * 2007-01-31 2008-08-06 国际商业机器公司 Method and system for optimizing global reduction treatment
US20190102169A1 (en) * 2017-09-29 2019-04-04 Fujitsu Limited Effective determination of processor pairs for transferring data processed in parallel
US20190138302A1 (en) * 2017-11-06 2019-05-09 Fujitsu Limited Information processing system, arithmetic processing circuit, and control method for information processing system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115102864A (en) * 2022-06-21 2022-09-23 中国人民解放军国防科技大学 Allgather method and device for Dragonfly topology
CN115102864B (en) * 2022-06-21 2023-08-29 中国人民解放军国防科技大学 Allgather method and device for Dragonfly topology

Also Published As

Publication number Publication date
CN115335804A (en) 2022-11-11

Similar Documents

Publication Publication Date Title
US9614762B2 (en) Work migration in a processor
Su et al. Adaptive deadlock-free routing in multicomputers using only one extra virtual channel
Liao et al. DPillar: Dual-port server interconnection network for large scale data centers
WO2021195988A1 (en) Network congestion avoidance over halving-doubling collective communication
WO2021195987A1 (en) Topology aware multi-phase method for collective communication
US11140127B2 (en) Optimising data transmission in a hypercube network
CN110226159B (en) Method for performing database functions on a network switch
US10394738B2 (en) Technologies for scalable hierarchical interconnect topologies
EP3560148B1 (en) Database functions-defined network switch
US11245584B2 (en) Software defined network optimization using quantum computing
Olexandr et al. Routing method based on the excess code for fault tolerant clusters with InfiniBand
WO2021195989A1 (en) Parallel method based on hybrid architecture in distributed training
CN109952809B (en) Quaternary full mesh dimension driven network architecture
Kobus et al. Gossip: Efficient communication primitives for multi-gpu systems
US10637739B2 (en) Network topology system and building method for topologies and routing tables thereof
WO2021195990A1 (en) Network congestion avoidance over ring-based collective communication
Costa et al. Why should we integrate services, servers, and networking in a data center?
Chakaravarthy et al. Mapping strategies for the PERCS architecture
KR20210138105A (en) Nested rings on a toroidal computer network
Rout et al. Performance evaluation of the controller in software-defined networking
US11169956B2 (en) Networked computer with embedded rings field
Pickartz et al. Swift: A transparent and flexible communication layer for pcie-coupled accelerators and (co-) processors
Guo Data Center Networking: Network Topologies and Traffic Management in Large-Scale Data Centers
WO2021232190A1 (en) Forward path planning method in massive data center networks
US20200337114A1 (en) Communication control method and information processing apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20929049

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20929049

Country of ref document: EP

Kind code of ref document: A1