WO2024001259A1 - 一种分布式训练方法、系统及装置 - Google Patents

一种分布式训练方法、系统及装置 Download PDF

Info

Publication number
WO2024001259A1
WO2024001259A1 PCT/CN2023/078777 CN2023078777W WO2024001259A1 WO 2024001259 A1 WO2024001259 A1 WO 2024001259A1 CN 2023078777 W CN2023078777 W CN 2023078777W WO 2024001259 A1 WO2024001259 A1 WO 2024001259A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing
group
computing nodes
inter
node
Prior art date
Application number
PCT/CN2023/078777
Other languages
English (en)
French (fr)
Inventor
郑潇雨
庞西豹
练韵文
李亿
戴宗宏
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024001259A1 publication Critical patent/WO2024001259A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/26Route discovery packet
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/16Multipoint routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/24Multipath
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1031Controlling of the operation of servers by a load balancer, e.g. adding or removing servers that serve requests

Definitions

  • the present application relates to the field of computing technology, and in particular, to a distributed training method, system and device.
  • Deep learning is a type of machine learning technology based on deep neural network algorithms. Deep learning is mainly used in perception, decision-making and other scenarios in the field of artificial intelligence (AI), such as image and speech recognition, natural language Translation, computer games, etc.
  • AI artificial intelligence
  • Distributed training refers to multiple computing nodes (workers) jointly training the same model. Any two computing nodes (ie, a pair of computing nodes) can be connected through a multi-layer switch, so that intermediate data (such as weight gradients) can be transmitted between the two computing nodes.
  • a switch of a certain layer transmits data to the switch of the upper layer, it can select a switch from multiple switches of the upper layer based on the load balancing principle and transmit the data to the selected switch of the upper layer.
  • the switch on the upper layer receives data from multiple switches on the lower layer, the switch on the upper layer may have transmission link congestion, which will cause the problem of slower data transmission.
  • This application provides a distributed training method, system and device for improving data transmission speed.
  • this application provides a distributed training method, which is suitable for a distributed training system including a computing cluster and a core switch.
  • the method is executed by a management node.
  • the management node is an external node that is independent of the computing cluster.
  • the external node is connected to multiple computing nodes in the computing cluster to manage each computing node in the computing cluster.
  • the management node is, for example, a computer, or a module in the computer, such as a plug-in.
  • the management node is a computing node in the computing cluster.
  • the computing node is connected to multiple other computing nodes in the computing cluster. It not only has the ability to manage the other multiple computing nodes in the computing cluster, but also has other computing capabilities.
  • the management node is, for example, a physical server.
  • the physical server includes one or more computing units (or processing units).
  • the computing units are, for example, graphics processing units (GPUs), central processing units (CPUs). central processing unit (CPU), neural network accelerator (neural-network processing unit, NPU), etc.
  • the management node includes multiple functional modules, some of the multiple functional modules are deployed in computing nodes of the computing cluster, and the remaining functional modules are deployed in external nodes independent of the computing cluster.
  • the distributed training method includes: the management node obtains the network topology, where the network topology includes the connectivity relationship between the core switch and the computing nodes in the computing cluster. Further, the computing cluster includes M groups, and each group contains Contains one or more compute nodes. Subsequently, the management node determines the communication plan between N computing nodes based on the network topology; among them, the N computing nodes are the computing nodes used for distributed training of the target model in the computing cluster; the communication plan includes multiple inter-group paths, for Each inter-group path among multiple inter-group paths: the inter-group path includes two computing nodes belonging to different groups among the N computing nodes, and the core switch used to connect the two computing nodes. The inter-group path is used for transmission Data between two computing nodes in the inter-group path; the amount of data transmitted by multiple inter-group paths respectively meets the preset conditions; M and N are both integers greater than 2.
  • the management node determines the communication plan of the N computing nodes in the data aggregation process of distributed training based on the network topology, so that the amount of data transmitted by the multiple inter-group paths included in the communication plan meets the preset conditions. This avoids the problem that a certain core switch needs to transmit a large amount of data in the inter-group transmission mode when the N computing nodes perform data aggregation, causing the core switch to have transmission link congestion. This helps to improve data processing. transmission speed, thereby further improving the speed of distributed training.
  • the management node determines the communication plan between the N computing nodes based on the network topology, specifically: the management node determines the communication plan between the N computing nodes based on the network topology and communication algorithm; where , the communication algorithm is used to aggregate the data obtained by N computing nodes performing training separately in distributed training to obtain the target model.
  • Communication algorithms include ring algorithm, halving-doubling (HD) algorithm, binary tree algorithm, etc.
  • the management node determines the communication plan between N computing nodes based on the principles of different communication algorithms and combined with the network topology, which helps the N computing nodes perform distributed training more efficiently.
  • each core switch includes one or more traffic ports; the amount of data transmitted by the multiple inter-group paths respectively meets the preset conditions, including : Among the multiple traffic ports included in multiple inter-group paths, the difference in the traffic of any two traffic ports is less than the threshold, where the traffic of the traffic port is equal to the amount of data transmitted between the two computing nodes in the inter-group path to which it belongs. association. In one possible implementation, when each inter-group path includes multi-level core switches, the core switches to which any two traffic ports whose difference is less than the threshold belong belong to the same level.
  • the communication plan determined by the management node is used to achieve load balancing of traffic in the traffic ports of multiple core switches passed by multiple inter-group paths, thereby avoiding serious traffic on a certain core switch during data transmission.
  • Congestion ensures the balance of data transmitted by paths between groups in the entire distributed training.
  • the two inter-group paths include different core switches respectively, or the two inter-group paths include the same core switch, And the traffic ports of the core switches in the two inter-group paths are different.
  • the traffic ports passed by multiple inter-group paths do not overlap, which prevents a certain traffic port of a core switch from needing to transmit data in multiple inter-group paths, thereby avoiding traffic port congestion and helping to improve Data transfer speed.
  • the network topology includes the connectivity relationships of core switches, computing clusters, and access switches; for each inter-group path among multiple inter-group paths: the inter-group path also includes two computing nodes. There are two corresponding access switches respectively. Each computing node in the inter-group path is connected to the core switch through the access switch corresponding to the computing node.
  • a method for realizing the connection between computing nodes and core switches is provided.
  • the communication plan also includes multiple intra-group paths.
  • Each intra-group path includes two computing nodes belonging to the same group among the N computing nodes, and the access switch corresponding to the group.
  • the intra-group path is used to transmit data between two ,computing nodes in the intra-group path.
  • the amount of data transmitted between two computing nodes in the intra-group path is greater than the amount of data transmitted between the two computing nodes in the inter-group path.
  • the communication plan determined by the management node includes not only multiple inter-group paths, but also multiple intra-group paths.
  • the data transmission performance of intra-group paths is better than the data transmission performance of inter-group paths.
  • the management node Inter-group paths can be planned to transmit data with a small amount of data, and intra-group paths are used to transmit data with a large amount of data to achieve more efficient data transmission and avoid congestion of core switch ports in the inter-group paths, improving distribution Speed of training.
  • M groups respectively correspond to M access switches; for each of the M access switches: the access switch includes K first ports and K first ports respectively. The corresponding K second ports; the K first ports are respectively connected to the K core switches; the K second ports are respectively connected to the K ports of the computing nodes in the groups corresponding to the access switches; K is an integer greater than 2.
  • the access switch can not only connect any core switch and any computing node in the corresponding group of the access switch, but also connect any two computing nodes in the corresponding group of the access switch, thereby realizing any computing node in the entire computing cluster.
  • Two computing nodes can be connected to each other and train the target model in a distributed manner.
  • the management node determines the communication plan between N computing nodes based on the network topology
  • the management node obtains a training task, where the training task includes the total number of computing nodes N and the communication algorithm.
  • the management node determines the communication plan between N computing nodes and the N computing nodes from the multiple computing nodes in the idle state in the computing cluster based on the network topology, the total number of computing nodes N and the communication algorithm.
  • the user issues a training task to the management node, and includes the parameters required by the user in the training task, that is, the total number of computing nodes N and the communication algorithm. In this way, the user's needs for distributed training can be better met.
  • the management node determines N computing nodes and one of the N computing nodes from multiple computing nodes in the idle state in the computing cluster based on the network topology, the total number of computing nodes N and the communication algorithm. Specifically, the management node determines N computing nodes from multiple computing nodes that are idle in the computing cluster based on the network topology and the total number of computing nodes N; assigns the N computing nodes that belong to the same The two computing nodes in the group are paired, and when there are multiple computing nodes that have not yet been paired, the multiple computing nodes that have not yet been paired are paired to obtain N/2 node pairs; according to the communication algorithm, multiple rounds of communication and N/ 2 node pairs, determine the communication plans of N computing nodes in multiple rounds of communication respectively; for the communication plan in any round of communication, the greater the amount of data transmitted by the two computing nodes in the communication plan, the group included in the communication plan The smaller the number of inter-group paths; if it is determined that in the i-th round
  • the management node first selects N computing nodes from the computing cluster, and then performs communication planning on the N computing nodes. This helps to reduce the calculation amount in the communication planning process. Further, the management node first pairs the N computing nodes, and then determines the communication plan of the N computing nodes in each round of communication based on the multiple pairs of nodes after pairing and the multiple rounds of communication of the communication algorithm. This helps In order to realize that the amount of data transmitted by multiple inter-group paths in each round of communication meets the preset conditions, the efficiency of data transmission in each round of communication is further improved.
  • the multiple inter-group paths include a first inter-group path, and the first inter-group path includes a first computing node, a second computing node, and a first core switch.
  • the management node After determining the communication plan between the N computing nodes, the management node also sends the first information to the first computing node and the second computing node respectively according to the communication plan; wherein the first information indicates that the first inter-group path is used for the first computing node.
  • a computing node sends first data to a second computing node.
  • the first computing node and the second computing node may respectively transmit the first data through the first inter-group path according to the first information.
  • the multiple intra-group paths include a first intra-group path, and the first intra-group path includes a first computing node, a third computing node, and a first access switch; the management node determines N After calculating the communication plan between the nodes, the second information is also sent to the first computing node and the third computing node respectively according to the communication plan; wherein, The second information indicates the first intra-group path for the first computing node to send the second data to the third computing node.
  • the first computing node and the third computing node may respectively transmit the first data through the paths within the first group according to the second information.
  • this application provides a distributed training system.
  • the distributed training system includes: K core switches and computing clusters, wherein the computing cluster includes M groups, and each group includes one or more computing clusters. node;
  • K core switches are used to connect computing nodes located in different groups in M groups.
  • the distributed training system includes management nodes.
  • the management node is an external node independent of the computing cluster.
  • the management node is respectively connected to multiple computing nodes in the computing cluster to manage each computing node in the computing cluster.
  • the management node is, for example, a computer, or a module in the computer, such as a plug-in.
  • the management node is a computing node in the computing cluster.
  • the computing node is connected to multiple other computing nodes in the computing cluster. It not only has the ability to manage the other multiple computing nodes in the computing cluster, but also has other computing capabilities.
  • the computing power of the node is, for example, a physical server.
  • the physical server includes one or more computing units (or processing units), such as GPU, CPU, NPU, etc.
  • the management node includes multiple functional modules, some of the multiple functional modules are deployed in computing nodes of the computing cluster, and the remaining functional modules are deployed in external nodes independent of the computing cluster.
  • the management node is used to obtain the network topology and determine the communication plan between N computing nodes based on the network topology.
  • the network topology includes the connectivity relationship between K core switches and computing nodes in the computing cluster, where the N computing nodes are Computing nodes in the computing cluster used for distributed training of the target model;
  • the communication plan includes multiple inter-group paths.
  • the inter-group path includes two computing nodes belonging to different groups among N computing nodes, and K core switches.
  • the core switch used to connect two computing nodes, and the inter-group path is used to transmit data between the two computing nodes in the inter-group path;
  • the amount of data transmitted by multiple inter-group paths respectively meets the preset conditions
  • K, M and N are all integers greater than 2.
  • the management node when the management node determines the communication plan between the N computing nodes based on the network topology, it is specifically used to: determine the communication plan between the N computing nodes based on the network topology and communication algorithm; The communication algorithm is used to aggregate the data obtained by N computing nodes performing training separately in distributed training to obtain the target model.
  • each core switch includes one or more traffic ports; the amount of data transmitted by the multiple inter-group paths respectively meets the preset conditions, including : Among the multiple traffic ports included in multiple inter-group paths, the difference in the traffic of any two traffic ports is less than the threshold, where the traffic of the traffic port is equal to the amount of data transmitted between the two computing nodes in the inter-group path to which it belongs. association.
  • each inter-group path includes multi-level core switches
  • the core switches to which any two traffic ports whose difference is less than the threshold belong belong to the same level.
  • the distributed training system also includes: M access switches corresponding to M groups respectively; any one of the M access switches is used to connect to the corresponding group of the access switch.
  • the communication plan also includes multiple intra-group paths.
  • Each intra-group path includes two computing nodes belonging to the same group among N computing nodes, and M access switches. The link corresponding to this group Into the switch, the intra-group path is used to transmit data between two computing nodes in the intra-group path.
  • the amount of data transmitted between two computing nodes in the intra-group path is greater than the amount of data transmitted between the two computing nodes in the inter-group path.
  • the multiple inter-group paths include a first inter-group path, and the first inter-group path includes a first computing node, a second computing node, and a first core switch;
  • the management node is also used to: according to Communication plan, sending first information to the first computing node and the second computing node respectively, the first information indicating the first inter-group path for the first computing node to send the first data to the second computing node; the first computing node, using sending the first data to the first core switch according to the first information;
  • the first core switch is configured to receive the first data from the first computing node and forward the first data to the second computing node; the second computing node, Used to receive first data from the first core switch according to the first information.
  • the first inter-group path also includes a first access switch corresponding to the first node and a second access switch corresponding to the second node.
  • the first computing node is specifically configured to send the first data to the first access switch according to the first information
  • the first access switch is configured to receive the first data from the first computing node and send the first data to the first core switch.
  • the first data; the first core switch is specifically configured to receive the first data from the first access switch and forward the first data to the second access switch;
  • the second access switch is configured to receive the first data from the first core switch.
  • the first data is sent to the second computing node; the second computing node is specifically configured to receive the first data from the second access switch according to the first information.
  • the multiple intra-group paths include a first intra-group path, and the first intra-group path includes a first computing node, a third computing node, and a first access switch;
  • the management node is also used to: According to the communication plan, the second information is sent to the first computing node and the third computing node respectively, and the second information indicates the path within the first group for the first computing node to send the second data to the third computing node; correspondingly, the first The computing node is used to send the second data to the first access switch according to the second information; the first access switch is used to forward the second data to the third computing node; the third computing node is used to send the second data to the first access switch according to the first information. information to receive second data from the first access switch.
  • this application provides a distributed training device, which is specifically a management node.
  • the management node is an external node that is independent of the computing cluster.
  • the external node is connected to multiple computing nodes in the computing cluster to manage each computing node in the computing cluster.
  • the management node is, for example, a computer, or a module in the computer, such as a plug-in.
  • the management node is a computing node in the computing cluster.
  • the computing node is connected to multiple other computing nodes in the computing cluster. It not only has the ability to manage the other multiple computing nodes in the computing cluster, but also has other computing capabilities.
  • the computing power of the node is, for example, a physical server.
  • the physical server includes one or more computing units (or processing units), such as GPU, CPU, NPU, etc.
  • the management node includes multiple functional modules, some of the multiple functional modules are deployed in computing nodes of the computing cluster, and the remaining functional modules are deployed in external nodes independent of the computing cluster.
  • Distributed training devices include:
  • the acquisition module is used to obtain the network topology.
  • the network topology includes the connectivity relationship between the core switch and the computing nodes in the computing cluster.
  • the computing cluster includes M groups, and each group includes one or more computing nodes;
  • the processing module is used to determine the communication plan between N computing nodes according to the network topology; wherein the N computing nodes are computing nodes used for distributed training target models in the computing cluster; the communication plan includes multiple inter-group paths, For each inter-group path among the multiple inter-group paths: the inter-group path includes two of the N computing nodes belonging to different groups. Computing nodes, as well as core switches used to connect two computing nodes, inter-group paths are used to transmit data between two computing nodes in the inter-group path; the amount of data transmitted by multiple inter-group paths respectively meets the preset conditions;
  • Both M and N are integers greater than 2.
  • the processing module determines the communication plan between the N computing nodes based on the network topology, it is specifically used to: determine the communication plan between the N computing nodes based on the network topology and communication algorithm;
  • the communication algorithm is used to aggregate the data obtained by N computing nodes performing training separately in distributed training to obtain the target model.
  • each core switch includes one or more traffic ports; the amount of data transmitted by the multiple inter-group paths respectively meets the preset conditions, including : Among the multiple traffic ports included in multiple inter-group paths, the difference in the traffic of any two traffic ports is less than the threshold, where the traffic of the traffic port is equal to the amount of data transmitted between the two computing nodes in the inter-group path to which it belongs. association.
  • each inter-group path includes multi-level core switches
  • the core switches to which any two traffic ports whose difference is less than the threshold belong belong to the same level.
  • the two inter-group paths include different core switches respectively, or the two inter-group paths include the same core switch, And the traffic ports of the core switches in the two inter-group paths are different.
  • the network topology includes the connectivity relationships of core switches, computing clusters, and access switches; for each inter-group path among multiple inter-group paths: the inter-group path also includes two computing nodes. There are two corresponding access switches, and each computing node in the inter-group path is connected to the core switch through the access switch corresponding to the computing node.
  • the communication plan also includes multiple intra-group paths.
  • Each intra-group path includes two computing nodes belonging to the same group among the N computing nodes, and the access switch corresponding to the group.
  • the intra-group path is used to transmit data between two ,computing nodes in the intra-group path.
  • the amount of data transmitted between two computing nodes in the intra-group path is greater than the amount of data transmitted between the two computing nodes in the inter-group path.
  • M groups respectively correspond to M access switches; for each of the M access switches: the access switch includes K first ports and K first ports respectively. The corresponding K second ports; the K first ports are respectively connected to the K core switches; the K second ports are respectively connected to the K ports of the computing nodes in the groups corresponding to the access switches; K is an integer greater than 2.
  • the acquisition module is also used to: acquire training tasks, which include the total number of computing nodes N and communication algorithms; when the processing module determines the communication plan between N computing nodes based on the network topology, specifically Used to: determine the communication plan between N computing nodes and the N computing nodes from multiple computing nodes in the idle state in the computing cluster based on the network topology, the total number of computing nodes N and the communication algorithm.
  • the processing module determines the distance between N computing nodes and N computing nodes from multiple computing nodes that are idle in the computing cluster based on the network topology, the total number of computing nodes N and the communication algorithm.
  • planning communication it is specifically used to: determine N computing nodes from multiple computing nodes that are idle in the computing cluster based on the network topology and the total number of computing nodes N; classify the N computing nodes that belong to the same group Two computing nodes are paired, and when there are multiple computing nodes that have not yet been paired, multiple computing nodes that have not yet been paired are paired to obtain N/2 node pairs; according to the communication algorithm, multiple rounds of communication and N/2 Node pair, determine the communication plans of N computing nodes in multiple rounds of communication; among them, for the communication plan in any round of communication, the greater the amount of data transmitted by the two computing nodes in the communication plan, the greater the number of data included in the communication plan.
  • the communication plan of N computing nodes includes multiple inter-group paths, and the amount of data transmitted by the multiple inter-group paths does not meet the preset conditions, then the communication plan of N computing nodes in the i-th round of communication is adjusted, i is a positive integer.
  • the multiple inter-group paths include a first inter-group path, and the first inter-group path includes a first computing node, a second computing node, and a first core switch; the device further includes a sending module; sending The module is configured to: send first information to the first computing node and the second computing node respectively; wherein the first information indicates the first inter-group path for the first computing node to send the first data to the second computing node.
  • the multiple intra-group paths include a first intra-group path, and the first intra-group path includes a first computing node, a third computing node, and a first access switch; the device further includes a sending module; The sending module is configured to: send second information to the first computing node and the third computing node respectively; wherein the second information indicates a path within the first group for the first computing node to send the second data to the third computing node.
  • embodiments of the present application provide a computing device, including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the computing device executes the above-mentioned first step.
  • a computing device including a processor, the processor is connected to a memory, the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the computing device executes the above-mentioned first step.
  • embodiments of the present application provide a computer-readable storage medium.
  • Computer programs or instructions are stored in the computer-readable storage medium.
  • the above-mentioned first aspect or aspects are implemented. any possible implementation method.
  • embodiments of the present application provide a computer program product, which when a computer reads and executes the computer program product, causes the computer to execute the method in the above-mentioned first aspect or any possible implementation of the first aspect.
  • Figure 1 is a schematic structural diagram of a neural network
  • Figure 2 is a schematic diagram of a stochastic gradient descent method
  • Figure 3 is a schematic diagram of data aggregation based on HD algorithm
  • Figure 4 is a schematic diagram of a distributed training system provided by this application.
  • Figure 5a is a schematic diagram of the interface connection relationship in a distributed training system provided by this application.
  • Figure 5b is a schematic diagram of the interface connection relationship in yet another distributed training system provided by this application.
  • FIG. 6 is an architectural schematic diagram of yet another distributed training system provided by this application.
  • Figure 7 is a schematic flow chart of a distributed training method provided by this application.
  • Figure 8 is a schematic flowchart of a management node determining a communication plan provided by this application.
  • Figure 9 is a communication diagram based on the HD algorithm provided by this application as an example.
  • FIG. 10 is a schematic structural diagram of a management node provided by this application.
  • Figure 11 is a schematic structural diagram of a distributed training device provided by this application.
  • Figure 12 is a schematic structural diagram of yet another distributed training device provided by this application.
  • Neural networks are a mathematical model or computational model that imitates the structure and function of biological neural networks (the central nervous system of animals, especially the brain). Neural networks are connected by a large number of neurons to perform calculations.
  • Figure 1 is a schematic structural diagram of a neural network.
  • the neural network may include m layers connected end to end, where m is an integer greater than or equal to 2.
  • the first layer of the neural network can be expressed as a function f 0 , the input of f 0 is x, the output is y 0 , and the weight is w 0 ;
  • the second layer of the neural network can be expressed as a function f 1 , the input of f 1 is y 0 , the output is y 1 , the weight is w 1 and so on.
  • any input in the data set can be expressed as x j ) into the neural network as shown in Figure 1, and the actual output of the neural network can be obtained, for example, expressed as y j m-1 .
  • the goal of model training is to solve w 0 ,..., w m-1 so that under the loss function L, y j m-1 and l j are closest.
  • SGD stochastic gradient descent
  • model training because the amount of training data required for model training is too large, or the model itself requires a large amount of calculation, in order to train the model more efficiently and quickly, multiple computing nodes included in the computing cluster can be used to train together.
  • Model the method of training the model can be called distributed model training or distributed training.
  • the computing node may include one or more computing units, such as GPU, CPU, and NPU.
  • the data set of training data is divided into multiple data subsets corresponding to multiple computing nodes, where the size of the data subsets is, for example, batch size (batch size) or mini batch size (mini batch size).
  • the multiple computing nodes input their corresponding data subsets into the local neural network to obtain the actual output of their respective neural networks, and then based on the actual output and expected output of their respective neural networks and loss function to determine the weight gradient corresponding to the m-1th layer of the respective neural network. Subsequently, the multiple computing nodes perform data aggregation and perform the next round of iteration based on the aggregated intermediate data.
  • a distributed training method that can divide a data set is called a data parallel training method.
  • Distributed training methods also include model parallel training methods. Specifically, the model is segmented to obtain multiple sub-models, and the multiple sub-models are run by their corresponding computing nodes. In each training iteration of the model parallel training method, the multiple computing nodes also perform similar data aggregation as described above to obtain input for the next round of model training.
  • the target model In each iteration of the distributed training process, multiple computing nodes need to Aggregate the intermediate data, and perform the next round of iteration based on the aggregated intermediate data. After multiple rounds of iterations, the final model (referred to as the target model) is obtained.
  • the intermediate data may include one or more of features (features or activations), gradients and model parameters obtained by the computing nodes performing respective model training.
  • the features are, for example, the characteristics of the training data learned by the model
  • the model parameters are, for example, the parameter w of the function f in the neural network, etc.
  • the gradient is, for example, the difference ⁇ w j of w j generated in the back propagation, etc.
  • the intermediate data can be referred to as data for short.
  • multiple computing nodes can complete data aggregation through aggregation communication.
  • the aggregation algorithms (or communication algorithms) used in aggregation communication are such as ring algorithm, halving-doubling (halving-doubling, HD) algorithm, binary tree (binary tree) algorithm, etc.
  • computing nodes are used for distributed training, where the 4 computing nodes are respectively represented as computing node a to computing node d.
  • the four computing nodes perform data aggregation through the HD algorithm.
  • Each computing node divides its data into 4 parts. Specifically, computing node a includes data a1 to a4, computing node b includes data b1 to b4, computing node c includes data c1 to c4, and computing node d includes data d1 to d4.
  • FIG. 3 shows an example of data aggregation by four computing nodes using the HD algorithm in this application.
  • the HD algorithm includes two parts: reduce-scatter and allgather.
  • the reducescatter of the HD algorithm includes the following steps 1 and 2:
  • step 1
  • node pairs composed of four computing nodes, namely (computing node a and computing node b), (computing node c and computing node d). Among them, the two computing nodes in the node pair exchange data with each other.
  • computing node a and computing node b exchange data.
  • computing node a sends data a1 and a2 to computing node b
  • computing node b sends data b3 and b4.
  • computing node a includes data: a1, a2, a3+b3, a4+b4
  • computing node b includes data: a1+b1, a2+b2, b3, b4.
  • (Computing node c and computing node d) exchange data in a similar manner to (computing node a and computing node b). For details, see step 1 in Figure 3.
  • step 2
  • node pairs composed of four computing nodes, namely (computing node a and computing node c), (computing node b and computing node d). Among them, the two computing nodes in the node pair exchange data with each other.
  • computing node a and computing node c exchange data.
  • computing node a sends data a3+b3 to computing node c
  • computing node c sends data c4+ to computing node a.
  • computing node a includes data: a1, a2, a3+b3, a4+b4+c4+d4
  • computing node c includes data: c1, c2, a3+b3+c3+d3, c4+d4.
  • (Computing node b and computing node d) exchange data in a similar manner to (computing node a and computing node c). For details, see step 2 in Figure 3.
  • the allgather of the HD algorithm includes the following steps 3 and 4:
  • step 3
  • node pairs composed of four computing nodes, namely (computing node a and computing node c), (computing node b and computing node d). Among them, the two computing nodes in the node pair exchange data with each other.
  • computing node a and computing node c exchange data.
  • computing node a sends data a4+b4+c4+d4 to computing node c
  • computing node c sends data to computing node a. send data a3+b3+c3+d3.
  • computing node a includes data: a1, a2, a3+b3+c3+d3, a4+b4+c4+d4
  • computing node c includes data: c1, c2, a3+b3+c3+d3, a4 +b4+c4+d4.
  • the way (computing node b and computing node d) exchange data is similar to (computing node a and computing node c). For details, see step 3 in Figure 3.
  • step 4
  • node pairs composed of four computing nodes, namely (computing node a and computing node b), (computing node c and computing node d). Among them, the two computing nodes in the node pair exchange data with each other.
  • computing node a and computing node b exchange data. Specifically, computing node a sends data a3+b3+c3+d3, a4+b4+c4+d4 to computing node b. , computing node b sends data a1+b1+c1+d1, a2+b2+c2+d2 to computing node a.
  • computing node a includes data: a1+b1+c1+d1, a2+b2+c2+d2, a3+b3+c3+d3, a4+b4+c4+d4;
  • computing node b includes data: a1 +b1+c1+d1, a2+b2+c2+d2, a3+b3+c3+d3, a4+b4+c4+d4.
  • Computing node c and computing node d exchange data in a similar manner to (computing node a and computing node b). For details, see step 4 in Figure 3.
  • each computing node among computing node a, computing node b, computing node c and computing node d obtains a1+b1+c1+d1, a2+b2+c2+d2, a3+b3+c3+d3, a4 +b4+c4+d4.
  • each step of the above HD algorithm all computing nodes are paired to obtain multiple node pairs (that is, each step corresponds to one pairing of computing nodes), and the two computing nodes in each node pair exchange data.
  • computing node a to computing node d are arranged sequentially in actual deployment, and the distance between any two adjacent computing nodes is a fixed value, that is, computing node a and computing node b, computing node b and computing node
  • the distance between node c, calculation node c and calculation node d are all fixed values.
  • computing node a is the farthest from computing node d
  • computing node a is closest to computing node b, etc.
  • the pairing of computing nodes in each step can be determined based on the distance of the four computing nodes.
  • the distance between the two paired computing nodes gradually increases, while the amount of data transmitted gradually decreases.
  • computing node a is paired with computing node b.
  • computing node a is paired with computing node c.
  • the distance between computing node a and computing node b is the distance between computing node a and computing node c.
  • the amount of data transmitted between computing node a and computing node b is twice the amount of data transmitted between computing node a and computing node c.
  • the distance between two computing nodes that are paired gradually decreases, while the amount of data transferred gradually increases.
  • step 3 computing node a and computing node c are paired.
  • step 4 computing node a and computing node b are paired.
  • the distance between computing node a and computing node c is the distance between computing node a and computing node b. twice the distance, the amount of data transmitted between computing node a and computing node c is half of the amount of data transmitted between computing node a and computing node b.
  • steps 1 and 2 in reduce-scatter are opposite to steps 3 and 4 in allgather.
  • the node pair in step 1 can be the same as the node pair in step 4, and the node pair in step 2 can be the same as step 3.
  • the mid-node pairs are the same.
  • computing node a needs to be connected to computing node b and computing node c respectively, and computing node b needs to be connected to computing node a and computing node d respectively.
  • computing node a may also need to be connected to other computing nodes.
  • computing node a also needs to be connected to computing node d.
  • This application provides a distributed training system, which includes K core switches and computing clusters.
  • the computing cluster includes M groups, each group includes one or more computing nodes, and each group The number of computing nodes is the same or different.
  • a computing node can be considered a physical server, and a computing node includes one or more computing units (or processing units), such as CPU, NPU, or GPU.
  • Both M and K are integers greater than 2.
  • K core switches are used to connect computing nodes located in different groups in M groups. That is, two computing nodes included in any two groups among the M groups can be connected through one or more core switches among the K core switches.
  • the core switch is, for example, a spine switch.
  • Figure 4 exemplarily shows a schematic diagram of a distributed training system.
  • K core switches are respectively marked as core switch 1 to core switch K
  • M groups are respectively marked as group 1 to group M.
  • Each group includes k computing nodes, taking group 1 as an example, the k computing nodes in group 1 are respectively recorded as computing node 1.1 to computing node 1.k.
  • the labels of computing nodes in other groups can be seen in Figure 4.
  • the computing node 1.1 located in group 1 and the computing node 2.1 located in group 2 can be connected through the core switch 1. That is, the computing node 1.1 can transmit data with the computing node 1.2 through the core switch 1.
  • the distributed training system also includes M access switches corresponding to the M groups respectively.
  • the access switch is used to connect the computing nodes in its corresponding group and the core switch that the computing node needs to connect to.
  • the M access switches are respectively denoted as access switch 1 to access switch M.
  • access switch 1 is used to connect computing node 1.1 and core switch 1, or access switch 1 is used to connect Computing node 1.2 and core switch 2, etc.
  • access switch 2 is used to connect computing node 2.1 and core switch 1, or access switch 2 is used to connect computing node 2.2 and core switch 2, etc.
  • the access switch is, for example, a high-performance top-of-rack (tor) switch.
  • the access switch is connected upward to the core switch and downward to the computing node.
  • the computing nodes are connected upwards with access switches, and the core switches are connected downwards with access switches.
  • any one of the M access switches is also connected downward to multiple computing nodes in its corresponding group, thereby enabling the access switch to connect any two of the multiple computing nodes in its corresponding group. calculate node.
  • group 1 includes computing nodes 1.1 to 1.k, and any two computing nodes among computing nodes 1.1 to 1.k can be connected through access switch 1;
  • group 2 includes computing nodes. 1.2 to computing node 2.k, any two computing nodes from computing node 2.1 to computing node 2.k can be connected through access switch 2, etc.
  • the M access switches are all connected upward to the same core switch, so that any two access switches among the M access switches can be connected through the core switch. Combined with the example in Figure 4, all M access switches are connected upward to the core switch 1, so any two access switches among the M access switches can be connected through the core switch 1.
  • Data can be transmitted between any two computing nodes in the distributed training system. For details, see Example 1 and Example 2 below.
  • Example 1 Two computing nodes connected to the same access switch can transmit data through the access switch. Based on the example in Figure 4, both computing node 1.1 and computing node 1.2 are connected to the access switch 1. The path for the computing node 1.1 to send data to the computing node 1.2 is: computing node 1.1 ⁇ access switch 1 ⁇ computing node 1.2. In this application, " ⁇ " may indicate the direction of data transmission.
  • Example 2 Two computing nodes connected to different access switches can transmit data through their respective access switches and the core switch jointly connected by the two access switches.
  • computing node 1.1 is connected to access switch 1
  • computing node 2.1 is connected to access switch 2
  • both access switch 1 and access switch 2 are connected to core switch 1.
  • Computing node 1.1 is connected to The path for data sent by computing node 2.1 is: computing node 1.1 ⁇ access switch 1 ⁇ core switch 1 ⁇ access switch 2 ⁇ computing node 2.1.
  • two computing nodes located in the same group can perform intra-group communication through the access switch corresponding to the group.
  • Communication the path traveled by intra-group communication can be called an intra-group path, and this transmission method can be called an intra-group transmission method.
  • two computing nodes located in different groups can communicate between groups through the access switches and core switches corresponding to the different groups.
  • the path passed by the inter-group communication can be called an inter-group path.
  • This transmission method can It is called the inter-group transmission method.
  • the access switch includes a first port used for upward connection to the core switch, and a second port used for downward connection to the computing node.
  • the compute node includes a third port for upward connection to the access switch.
  • the core switch includes a fourth port for downward connectivity to the access switch.
  • each access switch includes K first ports, and the K first ports are respectively upwardly connected to one of the K core switches.
  • Port 4 Each core switch includes M fourth ports, and the M fourth ports are respectively downwardly connected to a first port of each of the M access switches.
  • each access switch also includes K second ports, and the K second ports are connected downward to the computing nodes in the corresponding group of the access switch.
  • the third port of the node For example, each computing node includes a third port, the access switch includes 4 second ports, and the access switch is connected downward to 4 computing nodes in the corresponding group of the access switch; for another example, each The computing node includes 8 third ports, the access switch includes 32 second ports, and the access switch is connected downwardly to 4 computing nodes in the corresponding group of the access switch.
  • Method 1 in Figure 5a, there are 4 core switches, 32 access switches, and 4 computing nodes in each group, that is, K and k are both equal to 4, and M is equal to 32.
  • each core switch includes 32 fourth ports (denoted as fourth port 1 to fourth port 32);
  • each access switch includes 4 first ports (denoted as first port 1 to first port 4). ) and 4 second ports (denoted as second port 1 to second port 4);
  • each computing node includes 1 third port.
  • the port connection relationships included in the distributed training system can be seen in Figure 5a.
  • the four first ports of access switch 1 are connected upward to the fourth port 1 of core switch 1, the fourth port 1 of core switch 2, the fourth port 1 of core switch 3, and the core switch respectively.
  • the four second ports are respectively connected downward to four computing nodes, namely computing node 1.1 to computing node 1.4.
  • the 32 fourth ports of core switch 1 are respectively connected downward to the first port 1 of access switch 1, the first port 1 of access switch 2,..., and the first port 1 of access switch 31.
  • Method 2 in Figure 5b, there are 32 core switches, 32 access switches, and 4 computing nodes in each group, that is, K equals 32, M equals 32, and k equals 4.
  • each core switch includes 32 fourth ports (denoted as fourth port 1 to fourth port 32);
  • each access switch includes 32 first ports (denoted as first port 1 to first port 32). ) and 32 second ports (denoted as second port 1 to second port 32);
  • each computing node includes 8 third ports.
  • the port connection relationships included in the distributed training system can be seen in Figure 5b.
  • the 32 first ports of access switch 1 are connected upward to the fourth port 1 of core switch 1, the fourth port 1 of core switch 2, the fourth port 1 of core switch 3,... , the fourth port 1 of the core switch 31, and the fourth port 1 of the core switch 32.
  • the 32 second ports are respectively connected downward to four computing nodes, namely computing node 1.1 to computing node 1.4.
  • the 32 fourth ports of core switch 1 are respectively connected downward to the first port 1 of access switch 1, the first port 1 of access switch 2,..., and the first port 1 of access switch 31. One port 1, and then Enter the first port 1 of switch 32.
  • the K first ports and K second ports are bound, or in other words, inside the access switch, a one-to-one mapping of K first ports and K second ports is set relationship, so that in the access switch, data input from a certain first port among the plurality of first ports is output from a second port corresponding to the first port among the plurality of second ports.
  • the first port 1 to the first port 4 correspond to the second port 1 to the second port 4 respectively.
  • the access switch 1 receives data through the first port 1, it can The data is output through the second port 1.
  • the access switch 1 receives the data through the first port 2, the data can be output through the second port 2. In this way, the access switch is prevented from inputting data to an uncertain core switch through an uncertain second port among multiple second ports based on the load balancing principle.
  • the distributed training system shown in Figure 4 or Figure 5a or Figure 5b includes a core layer and an access layer, wherein the core layer includes core switches 1 to core switches K, and the access layer includes access switch 1 to access switch M.
  • the present application may also include multiple core layers, which are located above the access layer and used to achieve connectivity between any two access switches in the access layer. Among any two adjacent core layers of the plurality of core layers, one or more core switches in the previous core layer are used to connect any two core switches in the next core layer.
  • FIG 6 is an architectural schematic diagram of yet another distributed training system provided by the present application.
  • the distributed training system includes two core layers.
  • the two core layers can be respectively recorded as a first core layer and a second core layer.
  • the second core layer is located on the first core layer
  • the first core layer is located on the access layer.
  • the second core layer includes one or more core switches (Figure 6 shows two core switches, denoted as core switch A and core switch B respectively).
  • the first core layer includes K core switches (still represented as core switch 1 to core switch K in Figure 6).
  • the access layer includes M access switches (still represented as access switch 1 to access switch M in Figure 6), and access switch 1 to access switch M respectively correspond to group 1 to group M. Each group still includes one or more compute nodes.
  • one or more core switches in the second core layer are used to realize connectivity between any two core switches in the first core layer; for the first core layer, the first core switch
  • the K core switches in the access layer are used to realize the connection between any two access switches in the access layer; for the access layer, the M access switches in the access layer are used to realize the connection between any two access switches in their respective groups. connectivity between computing nodes.
  • the distributed training system also includes management nodes.
  • the management node is a node independent of the computing cluster. This node is connected to multiple computing nodes in the computing cluster to manage each computing node in the computing cluster.
  • the management node is, for example, a computer, or a module installed on the computer, such as a plug-in.
  • the management node is a computing node in the computing cluster.
  • the computing node is connected to multiple other computing nodes in the computing cluster. It not only has the ability to manage the other multiple computing nodes in the computing cluster, but also has other computing capabilities.
  • the computing power of the node is, for example, a physical server, which includes one or more computing units (or processing units), such as CPU, NPU or GPU.
  • the management node includes multiple functional modules, some of the multiple functional modules are deployed in computing nodes of the computing cluster, and the remaining functional modules are deployed in external nodes independent of the computing cluster.
  • the management node is used to select N computing nodes for distributed training from the computer group, and then generate a communication plan based on the N computing nodes.
  • the management node is also used to instruct the communication plan to the N computing nodes, so that the N computing nodes execute the aggregation algorithm during the distributed training process to obtain aggregated data.
  • Step 701 The management node obtains the network topology.
  • the network topology includes the connectivity relationship between core switches and computing nodes in the computing cluster.
  • the network topology obtained by the management node includes:
  • Topology 1 Computing node 1.1, computing node 2.1,..., computing node 32.1 are all connected to core switch 1;
  • Topology 2 Computing node 1.2, computing node 2.2,..., computing node 32.2 are all connected to core switch 2, etc.
  • the network topology also includes the connectivity relationships between the access switch and the core switch and the computing nodes in the computing cluster.
  • the network topology obtained by the management node :
  • Topology 1 further includes:
  • Topology 1-1 computing node 1.1 is connected to core switch 1 through access switch 1;
  • Topology 1-2 computing node 2.1 is connected to core switch 1 through access switch 2;
  • Topology 1-3 computing node 3.1 is connected to core switch 1 through access switch 3, etc.
  • Topology 2 further includes:
  • Topology 2-1 computing node 1.2 is connected to core switch 2 through access switch 1;
  • computing node 2.2 is connected to core switch 2 through access switch 2;
  • computing node 3.2 is connected to core switch 2 through access switch 3, etc.
  • the network topology not only includes the connectivity relationship between the access switch and the computing nodes in the computing cluster. , as well as the connectivity relationship between the access switch and the core switch in the first core layer, and also includes the connectivity relationship between the core switch in the first core layer and the core switch in the second core layer.
  • the network topology obtained by the management node not only includes the above-mentioned topology 1 and topology 2, but also includes the following topology A and topology B:
  • Topology A Core switch 1, core switch 2,..., core switch K are all connected to core switch A;
  • Topology B Core switch 1, core switch 2,..., core switch K are all connected to core switch B.
  • the above is only an exemplary form of the network topology, and the network topology obtained by the management node can also be in other forms, which is not limited by this application.
  • Step 702 The management node determines the communication plan between the N computing nodes based on the network topology.
  • the N computing nodes are used to jointly train a certain model (called the target model) in a distributed system.
  • the communication plan includes X inter-group paths (denoted as inter-group path 1 to inter-group path X), where X is an integer greater than 2.
  • each inter-group path includes two computing nodes belonging to different groups among the N computing nodes, and a core switch used to connect the two computing nodes.
  • inter-group path 1 includes computing node 1.1, core switch 1, and computing node 2.1;
  • inter-group path 2 includes computing node 2.2, core switch 2, and computing node 32.2.
  • Each of the X inter-group paths can be used to transmit data between two computing nodes in the inter-group path.
  • inter-group path 1 includes computing node 1.1 and computing node 2.1, and inter-group path 1 is used to transmit data between computing node 1.1 and computing node 2.1;
  • inter-group path 2 includes computing node 2.2 and computing node 32.2, inter-group path 2 is used to transmit data between computing node 2.2 and computing node 32.2.
  • the management node determines that the amount of data transmitted in the X inter-group paths must meet preset conditions based on the network topology.
  • the inter-group path For one inter-group path among the X inter-group paths: when the inter-group path passes through the core switch included in the inter-group path, it specifically passes through an input port and an output port of the core switch.
  • the output port of the core switch that the inter-group path passes through is used as a traffic port, and the data flow (or called traffic) of the traffic port is used to measure whether the amount of data transmitted in the inter-group path is preset.
  • the data traffic of the traffic port is related to the amount of data transmitted between the two computing nodes in the inter-group path.
  • the X inter-group paths each include Y traffic ports, where Y is an integer greater than 2.
  • the X traffic ports corresponding to the X inter-group paths do not have the same traffic port, that is, X equals Y.
  • the same traffic port exists among the X traffic ports corresponding to the X inter-group paths, that is, two or more inter-group paths among the X inter-group paths correspond to the same traffic port, that is, That is, X is greater than Y.
  • the amount of data transmitted by the X inter-group paths respectively meets the preset conditions. Specifically, among the Y traffic ports, the difference in data traffic of any two traffic ports is less than the threshold.
  • the X inter-group paths are specifically inter-group path 1 to inter-group path 10, that is, X is equal to 10.
  • Inter-group path 1 to inter-group path 10 respectively correspond to traffic port 1 to traffic port 10, that is, Y is equal to 10, where the difference in data traffic of any two traffic ports from traffic port 1 to traffic port 10 is less than the threshold.
  • inter-group path 1 to inter-group path 6 correspond to traffic port 1 to traffic port 6 respectively
  • inter-group path 7 and inter-group path 8 correspond to the same traffic port 7
  • inter-group path 9 and inter-group path 10 correspond to The same traffic port 8, that is, Y is equal to 8, where the difference in data traffic of any two traffic ports from traffic port 1 to traffic port 8 is less than the threshold.
  • Inter-group path 1 includes computing node 1.1, core switch 1 and computing node 2.1.
  • the core switch 1 receives the data of the computing node 1.1 through the fourth port 1 of the core switch 1, and outputs the data to the computing node 2.1 through the fourth port 2 of the core switch 1.
  • the fourth port 2 of the core switch 1 is the traffic port of the core switch 1 (recorded as traffic port 1), where the data flow of the traffic port 1 is related to the amount of data exchanged between the computing node 1.1 and the computing node 2.1.
  • Inter-group path 2 includes computing node 2.2, core switch 2 and computing node 32.2.
  • the core switch 2 receives the data of the computing node 2.2 through the fourth port 2 of the core switch 2, and outputs the data to the computing node 32.2 through the fourth port 32 of the core switch 2.
  • the fourth port 32 of the core switch 2 is the traffic port of the core switch 2 (recorded as traffic port 2), where the data flow of the traffic port 2 is related to the amount of data exchanged between the computing node 2.2 and the computing node 32.2.
  • Inter-group path 1 and inter-group path 2 meet the preset conditions, specifically, the difference between the data flow of traffic port 1 and the data flow of traffic port 2 is less than the threshold.
  • the distributed training system includes multiple core layers, specifically, for the same core layer, the difference in data traffic of any two traffic ports among the Y traffic ports is less than the threshold.
  • the thresholds corresponding to different core layers are the same or different.
  • inter-group paths are specifically inter-group path 1 to inter-group path 5, that is, X is equal to 5.
  • inter-group path 1 to inter-group path 3 correspond to core switches respectively.
  • inter-group path 4 and inter-group path 5 correspond to the same traffic port 21 of core switch 2;
  • inter-group path 1 to inter-group path 5 respectively correspond to Traffic port A1 to traffic port A5 of core switch A.
  • the first core layer corresponds to threshold 1 and the second core layer corresponds to threshold 2. Then, the difference in data traffic of any two traffic ports among traffic port 11 to traffic port 13 and traffic port 21 is less than threshold 1. , the difference in data traffic of any two traffic ports from traffic port A1 to traffic port A5 is less than threshold 2.
  • the management node has the ability to determine the communication plan between N computing nodes.
  • the object may be to determine that any two inter-group paths among the X inter-group paths respectively include different core switches. It can also be understood that the X inter-group paths correspond to X core switches respectively, so that X different core switches transmit the data corresponding to the data, causing traffic congestion problems.
  • the management node determines the communication plan between the N computing nodes, specifically, when it is determined that two of the X inter-group paths include the same core switch, it determines that the same core switch is in The traffic ports in the two inter-group paths are different. For example, if the management node determines that inter-group path 1 and inter-group path 2 both pass through core switch 1, the management node can further determine that inter-group path 1 passes through traffic port 11 of core switch 1. Intergroup path 2 passes through traffic port 12 of core switch 1. In this way, even if there is a core switch that needs to transmit data corresponding to multiple inter-group paths at the same time, the core switch can transmit the two data through two different traffic ports, which can also avoid traffic congestion problems.
  • the network topology also includes the connectivity relationship between the access switch, the core switch, and the computing nodes in the computing cluster. That is, the network topology specifically includes the connectivity relationship between the core switch, the computing nodes in the computing cluster, and the access switch.
  • the inter-group path determined by the management node based on the network topology also includes the access switch corresponding to the group to which the two computing nodes belong. The access switch is used to connect the core switch and the computing nodes under the access switch.
  • inter-group path 1 includes computing node 1.1, access switch 1, core switch 1, access switch 2 and computing node 2.1.
  • Inter-group path 1 can also be expressed as: computing node 1.1 Access switch 1 Core switch 1 Access switch 2 Compute node 2.1, where, Indicates two-way transmission, such as "compute node 1.1 "Access switch 1" means that the computing node 1.1 can transmit data to the access switch 1, and the access switch 1 can also transmit data to the computing node 1.1.
  • the access switch 1 is used to connect the computing node 1.1 and the core switch 1;
  • Input switch 2 is used to connect computing node 2.1 and core switch 1.
  • the management node when it determines the communication plan between N computing nodes, it can not only determine X inter-group paths, but also Z intra-group paths, where Z is an integer greater than 2. Among them, for any intra-group path, the intra-group path includes two computing nodes belonging to the same group among the N computing nodes, and an access switch (or, in other words, used to connect the two computing nodes). access switch corresponding to this group). Taking the above example of topology 1, the management node determines that intra-group path 1 includes computing node 1.1, access switch 1 and computing node 1.2, or represents intra-group path 1 as: computing node 1.1 Access switch 1 Compute node 1.2, where, Indicates two-way transmission.
  • the management node determines the communication plan, it can determine that the amount of data transmitted by the intra-group path is greater than the amount of data transmitted by the inter-group path.
  • the management node determines the communication plan between the N computing nodes based on the network topology and communication algorithm.
  • the communication algorithm is used to aggregate the data obtained by the N computing nodes performing model training in each iteration during the distributed training process, so that the N computing nodes perform the next round of model based on the aggregated data. Train to get the final target model.
  • Communication algorithms include ring algorithm, HD algorithm, binary tree algorithm, etc.
  • FIG. 8 is a schematic flowchart of a management node determining a communication plan provided by this application as an example:
  • Step 801 The management node obtains a training task, which includes the communication algorithm and the total number of computing nodes N.
  • the front-end interface when users prepare to use a computing cluster to train a certain target model, they can enter the total number of computing nodes N and communication algorithm required for distributed training in the front-end interface.
  • the front-end interface generates a training task based on user input and sends the training task to the management node.
  • the training task also includes the resource type of the computing node, the parameters of the training task, the task priority, etc., where the resource type includes one or more of GPU, NPU, and CPU; the parameters of the training task are, for example, iteration termination. Conditions (such as the number of iterations, gradient conditions, etc.); the task priority indicates the priority of the current training task. The higher the priority, the more important the training task is.
  • the management node needs to prioritize the selection of computing nodes for training tasks with high priority.
  • Step 802 The management node determines N computing nodes and communication plans between the N computing nodes from multiple computing nodes in the idle state in the computing cluster based on the network topology, the total number of computing nodes N and the communication algorithm.
  • a computing cluster includes occupied computing nodes and multiple computing nodes that are idle.
  • the management node obtains which multiple computing nodes are in the idle state in the current computing cluster, and then selects N from these multiple computing nodes in the idle state based on the network topology, the total number of computing nodes N and the communication algorithm. computing nodes, and then determine the communication plan among the N computing nodes.
  • the management node first selects N computing nodes, and then performs communication planning on the selected computing nodes, reducing the amount of calculation in the communication planning process.
  • the management node When the management node selects N computing nodes from multiple computing nodes that are idle in the current computing cluster, it can select based on the affinity principle, that is, try to select computing nodes in the same group to improve transmission within the group. (Intra-group path) proportion in each iteration. Correspondingly, reduce the proportion of inter-group transmission mode (inter-group path) in each iteration to avoid too many inter-group transmission modes causing core switch failure. Traffic congestion occurs on the port.
  • the management node may also perform the following steps a to c to reasonably plan the communication method (ie, communication planning) of the N computing nodes in the communication algorithm.
  • step a the management node pairs two computing nodes belonging to the same group among the N computing nodes; when there are multiple computing nodes that have not yet been paired, the management node pairs the multiple computing nodes that have not yet been paired to obtain N/2 node pairs.
  • the management node needs to first pair N computing nodes, and try to pair the two computing nodes in the same group. If the two computing nodes in the same group are paired, , there are still multiple two computing nodes located in different groups that have not yet been paired, then pair the multiple two computing nodes located in different groups to obtain N/2 node pairs.
  • 16 computing nodes are selected, namely, 16
  • the management node can first pair the computing nodes in group 1 to obtain: (computing node 1.1, computing node 1.2), (computing node 1.3, computing node 1.4); pair the computing nodes in group 2 to obtain: ( Computing node 2.1, computing node 2.2); pairing of computing nodes in group 3 results in: (computing node 3.1, computing node 3.2), (computing node 3.5, computing node 3.6); pairing of computing nodes in group 4 results in: (computing node 4.1, computing node 4.2), (computing node 4.3, computing node 4.4). Further, there are remaining computing nodes 1.5 and 2.3 that have not yet been paired, and the management node pairs the two computing nodes to obtain (computing node 1.5, computing node 2.3).
  • Step b The management node determines the communication plans of the N computing nodes in the multi-round communication based on the multi-round communication of the communication algorithm and the N/2 node pairs.
  • the two computing nodes transmit data through the access switch without passing through the core switch, so the possibility of traffic congestion is small. Therefore, the management node will need to transmit a larger amount of data when planning communication. Steps with large amounts of data are completed using relatively large amounts of intra-group communication. (Or relatively less inter-group communication is completed) to avoid traffic congestion problems.
  • step 1 the amount of data that needs to be transmitted in step 1 in reduce-scatter is greater than the amount of data that needs to be transmitted in step 2, so the inter-group path included in step 1 is smaller than the inter-group path included in step 2.
  • the N/2 node pairs obtained by the management node are as in the example in step a above, then the reduce-scatter in the HD algorithm includes 4 steps, respectively represented as S1 to S4, and the allgather in the HD algorithm includes three steps. , respectively represented as S5 to S8, that is, the HD algorithm has a total of 8 rounds of communication.
  • S1 to S4 in reduce-scatter and S5 to S8 in allgather please refer to the description in the relevant embodiment of Figure 3 above.
  • the management node can determine the communication plans of the 16 computing nodes in the eight rounds of communication based on the HD algorithm.
  • FIG. 9 illustrating an explanation of the communication relationship diagram based on the HD algorithm.
  • the vertices and edges respectively represent steps in the HD algorithm.
  • the vertices correspond to node pairs, and the management node determines the communication plan corresponding to each step based on the node pair corresponding to the vertex.
  • the management node can determine the association between the vertex and node pairs in the following way:
  • the management node places any node pair on a certain vertex of the cube, for example, (computing node 1.1, computing node 1.2) is placed at the first vertex of the cube.
  • (computing node 1.1, computing node Node 1.2) is represented as (1.1, 1.2), and the others are similar and will not be described again.
  • the first vertex represents S1 in reduce-scatter
  • the three edges connected to the first vertex represent S2, S3 and S4 in reduce-scatter respectively
  • the first vertex represents S8 in allgather, which is connected to the first vertex
  • the three connected edges represent S7, S6 and S5 in allgather respectively.
  • reduce-scatter is used as an example.
  • the amount of data transmitted by S2 is larger than the amount of data transmitted by S3 or S4
  • priority will be given to determining the node pair for the second vertex on the edge corresponding to S2
  • priority will be given to selecting the computing node in the same group as the node pair on the first vertex.
  • the node pair to which the calculated node belongs for example, select (1.3,1.4) and place it on the second vertex.
  • the remaining two edges connected to the first vertex represent S3 and S4 respectively
  • the remaining two edges connected to the second vertex represent S3 and S4 respectively.
  • the amount of data transmitted by S3 is larger than that transmitted by S4. If the amount of data is large, priority is given to determining the node pair for the vertex on the corresponding edge of S3. For example, the node pair for S3 of the first vertex is selected first, and the calculation node in the same group as the calculation node in the node pair on the first vertex is still preferred.
  • the corresponding node pair for example, select (1.5, 2.3) and place it on the third vertex; then, the management node selects the node pair for S3 of the second vertex, and so on, the 8 vertices of the cube can be assigned corresponding node pairs to obtain the corresponding relationship shown in (b) in Figure 9.
  • the first vertex corresponding to S1 corresponds to (1.1, 1.2), that is, the computing node 1.1 and the computing node 1.2 communicate in S1.
  • One of the edge connections corresponding to S2 has two vertices, namely (1.1, 1.2) and (1.3, 1.4).
  • the two computing nodes at the corresponding positions of the two vertices are computing node 1.1 and computing node 1.3.
  • the computing node 1.1 and the computing node 1.3 communicate in S2; the computing node 1.2 and the computing node 1.4 communicate in S2.
  • One of the S4 corresponds to the two vertices connected by the edge, which are (1.1, 1.2) and (3.1, 3.2).
  • the two computing nodes at the corresponding positions of the two vertices are computing node 1.1 and computing node 3.1, and Compute node 1.2 and compute node 3.2.
  • computing node 1.1 communicates with computing node 3.1 in S4
  • computing node 1.2 communicates with computing node 3.2 in S4.
  • the two vertices connected by the corresponding edge of S4 are (1.5, 2.3) and (4.1, 4.2).
  • the two computing nodes at the corresponding positions of the two vertices are computing node 1.5 and computing node 4.1. and computing nodes 2.3 and compute node 4.2.
  • computing node 1.5 communicates with computing node 4.1 in S4
  • computing node 2.3 communicates with computing node 4.2 in S4.
  • Step c If the management node determines that in the i-th round of communication among multiple rounds of communication, the communication plan of the N computing nodes includes multiple inter-group paths, and the amount of data transmitted by the multiple inter-group paths does not meet the preset conditions. , then adjust the communication plan of N computing nodes in the i-th round of communication, i is a positive integer.
  • computing node 4.1 When computing node 4.1 sends data to computing node 1.5, the path it passes is computing node 4.1 ⁇ access switch 4 ⁇ core switch 1 ⁇ access switch 1 ⁇ computing node 1.5. Specifically, it passes through the fourth port 1 of core switch 1 ;
  • computing node 3.1 When computing node 3.1 sends data to computing node 1.1, the path it passes is computing node 3.1 ⁇ access switch 3 ⁇ core switch 1 ⁇ access switch 1 ⁇ computing node 1.1. Specifically, it passes through the fourth port 1 of core switch 1 .
  • the two inter-group paths both pass through the fourth port 1 of the core switch 1, that is, there is traffic congestion on the fourth port 1 of the core switch 1, that is, the amount of data transmitted by the two inter-group paths does not meet the preset condition.
  • the management node can adjust the communication plan of the N computing nodes in this step so that the amount of data transmitted by the multiple inter-group paths meets the preset conditions. For example, adjust the node pairs in step a, such as exchanging the order of the node pair (computing node 4.1, computing node 4.2) and the node pair (computing node 4.3, computing node 4.4).
  • the corresponding relationship after the exchange is shown in (c) in Figure 9. Further, in S4:
  • Computing node 4.3 sends data to computing node 1.5, and the path it passes is computing node 4.3 ⁇ access switch 4 ⁇ core switch 3 ⁇ access switch 1 ⁇ computing node 1.5. Specifically, it passes through the fourth port 1 of core switch 3;
  • Computing node 3.5 sends data to computing node 1.3, and the path it passes is computing node 3.5 ⁇ access switch 3 ⁇ core switch 5 ⁇ access switch 1 ⁇ computing node 1.3. Specifically, it passes through the fourth port 1 of core switch 5;
  • Computing node 3.6 sends data to computing node 1.4, and the path it passes is computing node 3.6 ⁇ access switch 3 ⁇ core switch 6 ⁇ access switch 1 ⁇ computing node 1.4. Specifically, it passes through the fourth port 1 of core switch 6;
  • Computing node 3.1 sends data to computing node 1.1, and the path it passes is computing node 3.1 ⁇ access switch 3 ⁇ core switch 1 ⁇ access switch 1 ⁇ computing node 1.1. Specifically, it passes through the fourth port 1 of core switch 1;
  • the computing node 3.2 sends data to the computing node 1.2 through the path of computing node 3.2 ⁇ access switch 3 ⁇ core switch 2 ⁇ access switch 1 ⁇ computing node 1.2. Specifically, it passes through the fourth port 1 of the core switch 2.
  • the communication plans of S8, S7, S6 and S5 in allgather are the same as the communication plans of S1, S2, S3 and S4 in reduce-scatter respectively.
  • computing node 3.1 communicates with computing node 3.5
  • computing Node 3.2 communicates with computing node 3.6
  • computing node 1.1 communicates with computing node 1.5
  • computing node 1.2 communicates with computing node 2.3
  • computing node 1.4 communicates with computing node Node 3.6 communication etc.
  • the communication plans of S8, S7, S6 and S5 are the same as those of S1, S2, S3 and S4 respectively, the communication plans of S8, S7, S6 and S5 are not shown in (c) of Figure 9 .
  • the management node also needs to determine what data needs to be transmitted between the two computing nodes in the path between each group.
  • the computing node 1.1 communicates with the computing node 1.3. Specifically, the computing node 1.1 sends half of its intermediate data to the computing node 1.3.
  • the communication plan determined by the management node not only includes the intra-group path "Compute Node 1.1" access switch Computing node 1.3" also includes indication information of data to be sent from computing node 1.1 to computing node 1.3 (such as half of the intermediate data).
  • the management node can also determine planning information based on the communication plan.
  • the planning information includes path information corresponding to multiple inter-group paths, where the path information corresponding to the inter-group path indicates the use of the inter-group path.
  • Two computing nodes in the inter-group path transmit data to each other.
  • the management node sends the planning information to N computing nodes respectively.
  • Each computing node among the N computing nodes determines what data needs to be sent to which computing node based on the received planning information, and/or determines which data needs to be received from Calculate what data the node has.
  • this application may also include steps 703 to 705:
  • the first inter-group path includes a first computing node, a second computing node and a first core switch.
  • the first component path also includes a first access switch corresponding to the group to which the first computing node belongs, and a second access switch corresponding to the group to which the second computing node belongs.
  • Step 703 The management node sends the first information to the first computing node and the second computing node respectively.
  • the planning information determined by the management node includes path information corresponding to the first inter-group path (denoted as first information), where the first information indicates that the first inter-group path is used for mutual transmission between the first computing node and the second computing node. data.
  • the first information includes a first inter-group path, or includes a first computing node and a second computing node; the first information also includes indication information of data to be sent by the first computing node to the second computing node. , and/or, indication information of data to be sent by the second computing node to the first computing node.
  • the management node sends planning information to the first computing node and the second computing node respectively.
  • the first computing node and the second computing node respectively receive the planning information from the management node and obtain the first computing node from the planning information. information.
  • the management node directly sends the first information to the first computing node and the second computing node, and accordingly, the first computing node and the second computing node receive the first information from the management node.
  • Step 704 The first computing node determines the data to be sent to the second computing node (recorded as first data) based on the first information, and sends the first data to the second computing node.
  • the second computing node receives the data from the first computing node and determines based on the first information that the received data is the first data from the first computing node. Subsequently, the second computing node updates the first data to local.
  • Step 705 The second computing node determines the data to be sent to the first computing node (recorded as first data) based on the first information, and sends the first data to the first computing node.
  • the first computing node receives the data from the second computing node, and determines based on the first information that the received data is the first data from the second computing node. Subsequently, the first computing node updates the first data to the local .
  • the first computing node sends data to the second computing node
  • the first computing node sends the first data to the first access switch
  • the first access switch sends the first data to the first core switch.
  • a core switch sends first data to a second access switch
  • the second access switch sends the first data to a second computing node.
  • the first computing node when the first computing node sends the first data, it directly transmits the first data to the first access switch to which it is connected.
  • the first port and the second port in the first access switch are bound to each other. Therefore, after receiving the first data, the first access switch directly outputs the first data through the second port bound to the first port that received the first data. Further, the first access switch outputs the first data. to the core switch connected to the second port.
  • the core switch and the second access switch also determine to transmit the first data to the second computing node based on the existing connection relationship or internal binding relationship.
  • the computing nodes, core switches, and access switches involved in the inter-group path all transmit data according to the existing paths, ensuring the orderliness of data transmission and avoiding traffic congestion on the core switch ports during inter-group communication.
  • This description also applies to the situation where the second computing node sends data 1 to the first computing node, and will not be described again.
  • this application may also include steps 706 to 708:
  • the first intra-group path includes a first computing node, a third computing node, and a first access switch.
  • Step 706 The management node sends the second information to the first computing node and the third computing node respectively.
  • the planning information determined by the management node includes path information corresponding to the path in the first group (denoted as second information), where the second information indicates that the path in the first group is used for mutual transmission between the first computing node and the third computing node. data.
  • the second information includes a path within the first group, or includes a first computing node and a third computing node; the second information also includes indication information of data to be sent by the first computing node to the third computing node. , and/or, indication information of data to be sent by the third computing node to the first computing node.
  • the management node sends planning information to the first computing node and the third computing node respectively.
  • the first computing node and the third computing node respectively receive the planning information from the management node and obtain the second computing node from the planning information. information.
  • the management node directly sends the second information to the first computing node and the third computing node respectively, and accordingly, the first computing node and the third computing node receive the second information from the management node.
  • Step 707 The first computing node determines the data to be sent to the third computing node (recorded as second data) based on the second information, and sends the second data to the third computing node.
  • the third computing node receives the data from the first computing node, determines based on the second information that the received data is the second data from the first computing node, and updates the second data locally.
  • Step 708 The third computing node determines the data to be sent to the first computing node (denoted as second data) based on the second information, and sends the second data to the first computing node.
  • the first computing node receives the data from the third computing node, and determines based on the second information that the received data is the second data from the third computing node. Subsequently, the first computing node updates the second data to the local .
  • the first computing node sends data to the third computing node
  • the first computing node sends the second data to the access switch to which the first computing node belongs (ie, the first access switch).
  • the first access switch is connected to the third computing node, and the first access switch sends the second data to the third computing node.
  • Figure 7 is divided into a planning phase and a training phase, where the planning phase includes: step 701 to step 703, step 706; the training phase includes: step 704, step 705, step 707 and step 708, where step 704 Step 705 is a step for data transmission between two computing nodes located in different groups; Step 707 and step 708 are steps for data transmission between two computing nodes located in the same group.
  • the management node when the management node is one of the N computing nodes, the management node sends the planning information to the other N-1 computing nodes respectively. In this way, each of the N computing nodes calculates the planning information according to the planning information. , determine what data needs to be sent to which computing node, and/or determine what data needs to be received from which computing node. For specific implementation, please refer to the above steps 703 to 708.
  • FIG 10 is a schematic structural diagram of a management node provided by this application as an example.
  • the management node includes a task management module 1001, a resource management module 1002 and a training task module 1003.
  • the task management module 1001 obtains the training task, and applies to the resource management module 1002 for resources according to the communication algorithm and the total number of computing nodes N in the training task, that is, applying for computing nodes for distributed training. Specifically, the task management module 1001 sends a task resource application to the resource management module 1002.
  • the task resource application includes the communication algorithm and the total number of computing nodes N.
  • the functions of the task management module 1001 can be referred to the description in step 801 above.
  • the resource management module 1002 receives the task resource application from the task management module 1001. According to the task resource application communication algorithm and the total number of computing nodes N, select N computing nodes for distributed training from the multiple computing nodes in the idle state in the computing cluster. For specific implementation, please refer to step 802 about the management node from the current computing node. Implementation method of selecting N computing nodes from multiple computing nodes that are idle in the cluster.
  • the resource management module 1002 receives multiple task resource applications.
  • Each task resource application includes the priority of its corresponding training task.
  • the resource management module 1002 determines which training task to perform first based on the priorities of the multiple training tasks. Task application resources.
  • the resource management module 1002 may return the identifiers of the N computing nodes currently applied for to the task management module 1001.
  • the task management module 1001 instructs the training task module 1003 to respectively start training tasks corresponding to the N computing nodes among the N computing nodes.
  • the training task module 1003 can also obtain the network topology of the N computing nodes, or obtain the network topology of the computing cluster, and determine the communication plan of the N computing nodes based on the obtained network topology. For specific implementation, please refer to the management node in step 802. Determine the description of the communications plan. Further, the training task module 1003 also determines the planning information according to the communication plan, and sends the planning information to the N computing nodes respectively.
  • the storage module 1004 shown in FIG. 10 is used to store computer program instructions. When the module in the management node executes the computer program instructions in the storage module 1004, it can perform actions corresponding to the module.
  • the communication module 1005 shown in Figure 10 is used for communication between any two modules in the management node. For example, the task management module 1001 sends a task resource application to the resource management module 1002 through the communication module 1005.
  • FIGS 11 and 12 are schematic structural diagrams of a possible distributed training device provided by this application.
  • These distributed training devices may be management nodes in the above method embodiments, used to implement the functions of the management nodes in the above method embodiments, and therefore can also achieve the beneficial effects of the above method embodiments.
  • the distributed training device 1100 includes an acquisition module 1101 and a processing module 1102.
  • the acquisition module 1101 is used to obtain the network topology.
  • the network topology includes the connectivity relationship between the core switch and the computing nodes in the computing cluster.
  • the computing cluster includes M groups, and each group includes one or more computing nodes; processing Module 1102 is used to determine the communication plan between N computing nodes according to the network topology; the N computing nodes are computing nodes used for distributed training target models in the computing cluster; the communication plan includes multiple inter-group paths, for multiple Each inter-group path among the inter-group paths: the inter-group path includes two computing nodes belonging to different groups among the N computing nodes, and the core switch used to connect the two computing nodes.
  • the inter-group path is used to transmit the group The data between two computing nodes in the inter-group path; the amount of data transmitted by multiple inter-group paths respectively meets the preset conditions; M and N are both integers greater than 2.
  • the processing module 1102 determines the communication plan between the N computing nodes according to the network topology
  • the processing module 1102 determines the communication plan between the N computing nodes according to the network topology and communication algorithm.
  • Communication planning among them, the communication algorithm is used to aggregate the data obtained by N computing nodes performing training respectively in distributed training to obtain the target model.
  • the acquisition module 1101 is also used to: acquire a training task, which includes the total number of computing nodes N and the communication algorithm; the processing module 1102 determines the communication plan between N computing nodes according to the network topology. , specifically used to: determine the communication plan between N computing nodes and N computing nodes from multiple computing nodes in the idle state in the computing cluster based on the network topology, the total number of computing nodes N and the communication algorithm.
  • the processing module 1102 determines N computing nodes and N computing nodes from multiple computing nodes in the idle state in the computing cluster based on the network topology, the total number of computing nodes N and the communication algorithm.
  • N computing nodes are determined; two computing nodes belonging to the same group among the N computing nodes are paired, and when there are multiple computing nodes that have not yet been paired, multiple computing nodes that have not yet been paired are paired.
  • Compute nodes are paired to obtain N/2 node pairs; according to the multi-round communication and N/2 node pairs of the communication algorithm, the communication plans of the N computing nodes in the multi-round communication are determined; among them, for any round Communication planning in communication, the greater the amount of data transmitted by two computing nodes in the communication planning, the smaller the number of inter-group paths included in the communication planning; if it is determined that in the i-th round of communication in multiple rounds of communication, N calculations If the node's communication plan includes multiple inter-group paths, and the amount of data transmitted by the multiple inter-group paths does not meet the preset conditions, then the communication plan of the N computing nodes in the i-th round of communication is adjusted, where i is a positive integer.
  • the multiple inter-group paths include a first inter-group path, and the first inter-group path includes a first computing node, a second computing node, and a first core switch;
  • the distributed training device 1100 also includes Sending module 1103;
  • the sending module 1103 is configured to: send the first information to the first computing node and the second computing node respectively; wherein the first information indicates the first inter-group path for the first computing node to send the second computing node to the second computing node.
  • One data is configured to: send the first information to the first computing node and the second computing node respectively; wherein the first information indicates the first inter-group path for the first computing node to send the second computing node to the second computing node.
  • the multiple intra-group paths include a first intra-group path, and the first intra-group path includes a first computing node, a third computing node, and a first access switch;
  • the distributed training device 1100 also Includes a sending module 1103; the sending module 1103 is configured to: send second information to the first computing node and the third computing node respectively; wherein the second information indicates a path within the first group for the first computing node to send to the third computing node. Second data.
  • some functional modules may be deployed in the computing nodes of the computing cluster, and the remaining other functional modules are deployed in external nodes independent of the computing cluster. middle.
  • the acquisition module 1101 and the sending module 1103 are deployed in the computing nodes of the computing cluster, and the processing module 1102 is deployed in an external node independent of the computing cluster; or the acquisition module 1101 is deployed in the computing nodes of the computing cluster, and the processing module 1102 and The sending module 1103 is deployed in an external node independent of the computing cluster; or in other ways, this application will not give examples one by one.
  • the acquisition module 1101, the processing module 1102 and the sending module 1103 can all be implemented by software, or can be implemented by hardware.
  • the processing module 1102 is taken as an example to introduce the implementation of the processing module 1102.
  • the implementation of the acquisition module 1101 and the sending module 1103 can refer to the implementation of the processing module 1102.
  • the processing module 1102 may include code running on a computing instance.
  • the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container.
  • the above computing instance may be one or more.
  • processing module 1102 may include code running on multiple hosts/virtual machines/containers.
  • multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions.
  • multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs.
  • Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.
  • the multiple hosts/VMs/containers used to run the code can be distributed in the same virtual private cloud (VPC), or across multiple VPCs.
  • VPC virtual private cloud
  • Cross-region communication between two VPCs in the same region and between VPCs in different regions requires a communication gateway in each VPC, and the interconnection between VPCs is realized through the communication gateway. .
  • the processing module 1102 may include at least one computing device, such as a server.
  • the processing module 1102 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). wait.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be implemented by a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • Multiple computing devices included in the processing module 1102 may be distributed in the same region or in different regions. Multiple computing devices included in the processing module 1102 may be distributed in the same AZ or in different AZs. Similarly, multiple computing devices included in the processing module 1102 may be distributed in the same VPC or in multiple VPCs.
  • the plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • the processing module 1102 can be used to perform any step in the method of Figure 7 or Figure 8
  • the acquisition module 1101 can be used to perform any step in the method of Figure 7 or Figure 8
  • the sending module 1103 It can be used to perform any steps in the method of Figure 7 or Figure 8.
  • the steps responsible for implementation by the processing module 1102, the acquisition module 1101, and the sending module 1103 can be specified as needed, through the processing module 1102, the acquisition module 1101, and the sending module 1103 respectively. Implement different steps in the method of Figure 7 or Figure 8 to realize all functions of the distributed training device 1100.
  • the functions of the acquisition module 1101 and the sending module 1103 are included in the functions of the communication module 1005 shown in Figure 10 , that is, the communication module 1005 has the functions of the acquisition module 1101 and the sending module 1103; the processing module 1102 has the functions shown in Figure 10
  • the functions of the task management module 1001, the resource management module 1002 and the training task module 1003 shown in Figure 10 can be mutually referenced or cited between Figures 10 and 11.
  • some functional modules in the task management module 1001, resource management module 1002, training task module 1003, storage module 1004, and communication module 1005 are deployed in the computing nodes of the computing cluster, and the remaining other functional modules are deployed in independent computing clusters. in external nodes.
  • Figure 12 shows a distributed training device 1200 provided by an embodiment of the present application.
  • the distributed training device shown in Figure 12 can be an implementation of a hardware circuit of the device shown in Figure 11.
  • the device can be adapted to the flow chart shown above to perform the functions of the management node in the above method embodiment.
  • FIG. 12 only shows the main components of the distributed training device 1200.
  • the distributed training device 1200 includes: a bus 102, a processor 104, a memory 106 and a communication interface 108.
  • the processor 104, the memory 106 and the communication interface 108 communicate through the bus 102.
  • the distributed training device 1200 may be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the distributed training device 1200.
  • the bus 102 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 12, but it does not mean that there is only one bus or one type of bus.
  • Bus 104 may include a path for communicating information between various components of distributed training device 1200 (eg, memory 106, processor 104, communication interface 108).
  • the processor 104 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • Memory 106 may include volatile memory, such as random access memory (RAM).
  • RAM random access memory
  • the processor 104 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, hard disk drive (HDD) or solid state drive (solid state drive). drive, SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 106 stores executable program code, and the processor 104 executes the executable program code to respectively implement the functions of the aforementioned acquisition module 1101, processing module 1102 or sending module 1103, thereby implementing the distributed training method. That is, the memory 106 stores instructions for executing the above-mentioned distributed training method.
  • the communication interface 108 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the distributed training device 1200 and other devices or communication networks.
  • the memory 106 has the function of the storage module 1004 shown in Figure 10
  • the processor 104 has the functions of the task management module 1001
  • the bus 102 and the communication interface 108 has the function of the communication module 1005 shown in Figure 10.
  • Figures 10, 11 and 12 can be referred to or quoted from each other.
  • the computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.
  • the computer-readable storage medium includes instructions that instruct the computing device to perform the method in the relevant embodiment of FIG. 7 or FIG. 8 .
  • embodiments of the present application provide a computer program product.
  • the computing device When a computing device reads and executes the computer program product, the computing device implements the method in the above-mentioned embodiment of FIG. 7 or FIG. 8 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种分布式训练方法、系统及装置,用于解决现有技术中交换机传输链路拥塞,导致传输数据较慢的问题。方法包括:管理节点获取网络拓扑,其中,网络拓扑包括核心交换机和计算集群中的计算节点的连通关系,随后,管理节点根据网络拓扑,确定N个计算节点之间的通信规划;其中,N个计算节点是计算集群中用于分布式训练目标模型的计算节点;通信规划包括多条组间路径,对于多条组间路径中的每条组间路径:组间路径包括N个计算节点中、属于不同分组的两个计算节点,以及用于连通两个计算节点的核心交换机,组间路径用于传输组间路径中两个计算节点之间的数据;多条组间路径分别传输的数据量符合预设条件;M和N均为大于2的整数。

Description

一种分布式训练方法、系统及装置
相关申请的交叉引用
本申请要求在2022年06月29日提交中国专利局、申请号为202210756779.4、申请名称为“一种分布式训练方法、系统及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算技术领域,尤其涉及一种分布式训练方法、系统及装置。
背景技术
深度学习(deep learning)是一类基于深层次神经网络算法的机器学习技术,深度学习主要应用于人工智能(artificial intelligence,AI)领域的感知、决策等场景,例如,图像和语音识别、自然语言翻译、计算机博弈等。
分布式训练指的是多个计算节点(worker)联合训练同一个模型。任两个计算节点(即一对计算节点)可通过多层交换机连通,以使得该两个计算节点之间相互传输中间数据(如权重梯度)。其中,某层交换机在向上一层交换机传输数据时,可根据负载均衡原理从上一层的多个交换机中选择一个交换机,并将数据传输给该选择出的上一层交换机。但是该上一层的交换机若接收到其下一层的多个交换机的数据,则该上一层的交换机可能存在传输链路拥塞,如此,将会导致传输数据较慢的问题。
发明内容
本申请提供一种分布式训练方法、系统及装置,用于提高数据传输速度。
第一方面,本申请提供一种分布式训练方法,适用于包括计算集群和核心交换机的分布式训练系统中,该方法由管理节点执行。
管理节点是独立于计算集群的外部节点,该外部节点与计算集群中的多个计算节点分别连接,以用于管理计算集群中的各个计算节点。在一个具体实现中,管理节点比如是计算机,或者计算机中的模块,比如插件。
又或者,管理节点是计算集群中的计算节点,该计算节点与计算集群中其他的多个计算节点分别连接,不仅具备管理计算集群中的该其他的多个计算节点的能力,还具备其他计算节点的计算能力。在一个具体实现中,管理节点比如是物理服务器,物理服务器中包括一个或多个计算单元(或称为处理单元),计算单元比如是图形处理器(graphics processing unit,GPU)、中央处理器(central processing unit,CPU)、神经网络加速器(neural-network processing unit,NPU)等。
又或者,管理节点中包括多个功能模块,多个功能模块中的部分功能模块部署在计算集群的计算节点中,剩余的其他功能模块部署在独立于计算集群的外部节点中。
分布式训练方法中包括:管理节点获取网络拓扑,其中,网络拓扑包括核心交换机和计算集群中的计算节点的连通关系,进一步的,计算集群中包括M个分组,每个分组中包 括一个或多个计算节点。随后,管理节点根据网络拓扑,确定N个计算节点之间的通信规划;其中,N个计算节点是计算集群中用于分布式训练目标模型的计算节点;通信规划包括多条组间路径,对于多条组间路径中的每条组间路径:组间路径包括N个计算节点中、属于不同分组的两个计算节点,以及用于连通两个计算节点的核心交换机,组间路径用于传输组间路径中两个计算节点之间的数据;多条组间路径分别传输的数据量符合预设条件;M和N均为大于2的整数。
上述技术方案中,管理节点根据网络拓扑,确定N个计算节点在分布式训练的数据聚合过程中的通信规划,以实现通信规划包括的多条组间路径分别传输的数据量符合预设条件,从而避免该N个计算节点在进行数据聚合时,出现某个核心交换机在组间传输方式中需要传输较多的数据量,导致核心交换机出现传输链路拥塞的问题,如此,有助于提高数据传输速度,从而进一步提高分布式训练的速度。
在一种可能的实现方式中,管理节点根据网络拓扑,确定N个计算节点之间的通信规划,具体是:管理节点根据网络拓扑和通信算法,确定N个计算节点之间的通信规划;其中,通信算法用于在分布式训练中聚合N个计算节点分别执行训练得到的数据,以得到目标模型。通信算法比如是ring(环)算法、halving-doubling(减半-加倍,HD)算法、binary tree(二叉树)算法等。
上述技术方案中,管理节点基于不同通信算法的原理,结合网络拓扑,确定N个计算节点之间的通信规划,有助于实现N个计算节点更高效的执行分布式训练。
在一种可能的实现方式中,多条组间路径包括的多个核心交换机中,每个核心交换机包括一个或多个流量端口;多条组间路径分别传输的数据量符合预设条件,包括:多条组间路径包括的多个流量端口中,任两个流量端口的流量的差值小于阈值,其中,流量端口的流量与所属组间路径中两个计算节点之间传输数据的数据量关联。在一种可能的实现方式中,在每条组间路径包括多级核心交换机时,差值小于阈值的任两个流量端口所属的核心交换机属于同一级。
上述技术方案中,管理节点确定的通信规划用于实现多条组间路径所经过的多个核心交换机的流量端口中流量的负载均衡,从而避免某个核心交换机在数据传输时存在较为严重的流量拥堵,保证整个分布式训练中各条组间路径所传输数据的均衡。
在一种可能的实现方式中,对于多条组间路径中的任两条组间路径:两条组间路径分别包含有不同的核心交换机,或者,两条组间路径包含相同的核心交换机,且核心交换机在两条组间路径中的流量端口不同。如此,实现多条组间路径所经过的流量端口均不重叠,避免某个核心交换机的某个流量端口需要传输多条组间路径中的数据,进而避免出现流量端口的堵塞,有助于提高数据传输速度。
在一种可能的实现方式中,网络拓扑包括核心交换机、计算集群,以及接入交换机的连通关系;对于多条组间路径中的每条组间路径:组间路径中还包括两个计算节点分别对应的两个接入交换机,组间路径中每个计算节点通过计算节点对应的接入交换机与核心交换机连通。如上,提供一种计算节点与核心交换机连通的实现方式。
在一种可能的实现方式中,通信规划中还包括多条组内路径,每条组内路径中包括N个计算节点中、属于同一个分组的两个计算节点,以及分组对应的接入交换机,组内路径用于传输组内路径中两个计算节点之间的数据。在一种可能的实现方式中,组内路径中两个计算节点之间传输数据的数据量,大于组间路径中两个计算节点之间传输数据的数据量。
上述技术方案中,管理节点确定的通信规划中,不仅包括多条组间路径,还包括多条组内路径,组内路径的数据传输性能优于组间路径的数据传输性能,如此,管理节点可规划组间路径用于传输数据量较少的数据,组内路径用于传输数据量较多的数据,以实现较为高效的数据传输且避免组间路径中核心交换机端口的拥塞,提高分布式训练的速度。
在一种可能的实现方式中,M个分组分别对应于M个接入交换机;针对M个接入交换机中每个接入交换机:接入交换机包括K个第一端口、K个第一端口分别对应的K个第二端口;K个第一端口分别与K个核心交换机连接;K个第二端口分别与接入交换机对应的分组中计算节点的K个端口连接;K为大于2的整数。
如此,接入交换机不仅能够连通任一个核心交换机和该接入交换机对应分组中的任一个计算节点,还能够连通该接入交换机对应分组中的任两个计算节点,从而实现整个计算集群中任两个计算节点可以相互连通,并分布式训练目标模型。
在一种可能的实现方式中,管理节点在根据网络拓扑,确定N个计算节点之间的通信规划时,具体是,管理节点获取训练任务,其中,该训练任务包括计算节点总数N和通信算法;管理节点再根据网络拓扑、计算节点总数N和通信算法,从计算集群中处于空闲状态的多个计算节点中,确定N个计算节点和该N个计算节点之间的通信规划。上述技术方案中,用户向管理节点下发训练任务,并在训练任务中包括用户所需的参数,即计算节点总数N和通信算法,如此,能够更好地满足用户对分布式训练的需求。
在一种可能的实现方式中,管理节点在根据网络拓扑、计算节点总数N和通信算法,从计算集群中处于空闲状态的多个计算节点中,确定N个计算节点和该N个计算节点之间的通信规划时,具体是,管理节点根据网络拓扑和计算节点总数N,从计算集群中处于空闲状态的多个计算节点中,确定N个计算节点;将N个计算节点中、属于同一个分组的两个计算节点配对,以及在剩余尚未配对的多个计算节点时,将尚未配对的多个计算节点配对,以得到的N/2个节点对;根据通信算法的多轮通信和N/2个节点对,确定N个计算节点分别在多轮通信中的通信规划;对于任一轮通信中的通信规划,通信规划中两个计算节点所传输的数据量越大,通信规划包括的组间路径数越小;若确定在多轮通信中的第i轮通信中,N个计算节点的通信规划中包括多条组间路径,且多条组间路径分别传输的数据量不符合预设条件,则调整第i轮通信中N个计算节点的通信规划,i为正整数。
上述技术方案中,管理节点先从计算集群中选择N个计算节点,再对N个计算节点进行通信规划,如此,有助于降低通信规划过程中的计算量。进一步的,管理节点先对N个计算节点进行配对,然后根据配对之后的多个节点对以及通信算法的多轮通信,确定该N个计算节点在每轮通信中的通信规划,如此,有助于实现每轮通信中的多条组间路径分别传输的数据量符合预设条件,进一步提高每轮通信中数据传输的效率。
在一种可能的实现方式中,多条组间路径中包括第一组间路径,第一组间路径包括第一计算节点、第二计算节点和第一核心交换机。管理节点在确定N个计算节点之间的通信规划之后,还根据通信规划,分别向第一计算节点和第二计算节点发送第一信息;其中,第一信息指示第一组间路径用于第一计算节点向第二计算节点发送第一数据。相应的,第一计算节点和第二计算节点可分别根据该第一信息,通过第一组间路径传输第一数据。
在一种可能的实现方式中,多条组内路径中包括第一组内路径,第一组内路径包括第一计算节点、第三计算节点和第一接入交换机;管理节点在确定N个计算节点之间的通信规划之后,还根据通信规划,分别向第一计算节点和第三计算节点发送第二信息;其中, 第二信息指示第一组内路径用于第一计算节点向第三计算节点发送第二数据。相应的,第一计算节点和第三计算节点可分别根据该第二信息,通过第一组内路径传输第一数据。
第二方面,本申请提供一种分布式训练系统,该分布式训练系统中包括:K个核心交换机和计算集群,其中,计算集群中包括M个分组,每个分组中包括一个或多个计算节点;
K个核心交换机,用于连通M个分组中位于不同分组的计算节点。
进一步的,分布式训练系统中包括管理节点。
管理节点是独立于计算集群的外部节点,该管理节点与计算集群中的多个计算节点分别连接,以用于管理计算集群中的各个计算节点。在一个具体实现中,管理节点比如是计算机,或者计算机中的模块,比如插件。
又或者,管理节点是计算集群中的计算节点,该计算节点与计算集群中其他的多个计算节点分别连接,不仅具备管理计算集群中的该其他的多个计算节点的能力,还具备其他计算节点的计算能力。在一个具体实现中,管理节点比如是物理服务器,物理服务器中包括一个或多个计算单元(或称为处理单元),计算单元比如是GPU、CPU、NPU等。
又或者,管理节点中包括多个功能模块,多个功能模块中的部分功能模块部署在计算集群的计算节点中,剩余的其他功能模块部署在独立于计算集群的外部节点中。
管理节点,用于获取网络拓扑,根据网络拓扑,确定N个计算节点之间的通信规划,网络拓扑包括K个核心交换机和计算集群中的计算节点的连通关系,其中,该N个计算节点是计算集群中用于分布式训练目标模型的计算节点;
其中,通信规划包括多条组间路径,对于多条组间路径中的每条组间路径:组间路径包括N个计算节点中、属于不同分组的两个计算节点,以及K个核心交换机中用于连通两个计算节点的核心交换机,组间路径用于传输组间路径中两个计算节点之间的数据;
多条组间路径分别传输的数据量符合预设条件;
K、M和N均为大于2的整数。
在一种可能的实现方式中,管理节点在根据网络拓扑,确定N个计算节点之间的通信规划时,具体用于:根据网络拓扑和通信算法,确定N个计算节点之间的通信规划;通信算法用于在分布式训练中聚合N个计算节点分别执行训练得到的数据,以得到目标模型。
在一种可能的实现方式中,多条组间路径包括的多个核心交换机中,每个核心交换机包括一个或多个流量端口;多条组间路径分别传输的数据量符合预设条件,包括:多条组间路径包括的多个流量端口中,任两个流量端口的流量的差值小于阈值,其中,流量端口的流量与所属组间路径中两个计算节点之间传输数据的数据量关联。
在一种可能的实现方式中,在每条组间路径包括多级核心交换机时,差值小于阈值的任两个流量端口所属的核心交换机属于同一级。
在一种可能的实现方式中,分布式训练系统中还包括:分别与M个分组对应的M个接入交换机;M个接入交换机中任一个接入交换机用于连通接入交换机对应分组中的计算节点和K个核心交换机;网络拓扑包括K个核心交换机、M个接入交换机和计算集群中的计算节点的连通关系;对于多条组间路径中的每条组间路径:组间路径中还包括两个计算节点所属分组分别对应的两个接入交换机。
在一种可能的实现方式中,通信规划中还包括多条组内路径,每条组内路径中包括N个计算节点中、属于同一个分组的两个计算节点,以及M个接入交换机中该分组对应的接 入交换机,组内路径用于传输组内路径中两个计算节点之间的数据。
在一种可能的实现方式中,组内路径中两个计算节点之间传输数据的数据量,大于组间路径中两个计算节点之间传输数据的数据量。
在一种可能的实现方式中,多条组间路径中包括第一组间路径,第一组间路径包括第一计算节点、第二计算节点和第一核心交换机;管理节点还用于:根据通信规划,分别向第一计算节点和第二计算节点发送第一信息,第一信息指示第一组间路径用于第一计算节点向第二计算节点发送第一数据;第一计算节点,用于根据第一信息,向第一核心交换机发送第一数据;第一核心交换机,用于接收来自第一计算节点的第一数据,将第一数据转发至第二计算节点;第二计算节点,用于根据第一信息,接收来自第一核心交换机的第一数据。
在一种可能的实现方式中,第一组间路径中还包括第一节点对应的第一接入交换机,和第二节点对应的第二接入交换机。其中,第一计算节点,具体用于根据第一信息,向第一接入交换机发送第一数据;第一接入交换机用于接收来自第一计算节点的第一数据,向第一核心交换机发送第一数据;第一核心交换机,具体用于接收来自第一接入交换机的第一数据,将第一数据转发至第二接入交换机;第二接入交换机用于接收来自第一核心交换机的第一数据,向第二计算节点发送第一数据;第二计算节点,具体用于根据第一信息,接收第二接入交换机的第一数据。
在一种可能的实现方式中,多条组内路径中包括第一组内路径,第一组内路径包括第一计算节点、第三计算节点和第一接入交换机;管理节点还用于:根据通信规划,分别向第一计算节点和第三计算节点发送第二信息,第二信息指示第一组内路径用于第一计算节点向第三计算节点发送第二数据;相应的,第一计算节点,用于根据第二信息,向第一接入交换机发送第二数据;第一接入交换机,用于将第二数据转发至第三计算节点;第三计算节点,用于根据第一信息,接收来自第一接入交换机的第二数据。
第三方面,本申请提供一种分布式训练装置,该装置具体是管理节点。
管理节点是独立于计算集群的外部节点,该外部节点与计算集群中的多个计算节点分别连接,以用于管理计算集群中的各个计算节点。在一个具体实现中,管理节点比如是计算机,或者计算机中的模块,比如插件。
又或者,管理节点是计算集群中的计算节点,该计算节点与计算集群中其他的多个计算节点分别连接,不仅具备管理计算集群中的该其他的多个计算节点的能力,还具备其他计算节点的计算能力。在一个具体实现中,管理节点比如是物理服务器,物理服务器中包括一个或多个计算单元(或称为处理单元),计算单元比如是GPU、CPU、NPU等。
又或者,管理节点中包括多个功能模块,多个功能模块中的部分功能模块部署在计算集群的计算节点中,剩余的其他功能模块部署在独立于计算集群的外部节点中。
分布式训练装置包括:
获取模块,用于获取网络拓扑,网络拓扑包括核心交换机和计算集群中的计算节点的连通关系,计算集群中包括M个分组,每个分组中包括一个或多个计算节点;
处理模块,用于根据网络拓扑,确定N个计算节点之间的通信规划;其中,N个计算节点是计算集群中用于分布式训练目标模型的计算节点;通信规划包括多条组间路径,对于多条组间路径中的每条组间路径:组间路径包括N个计算节点中、属于不同分组的两个 计算节点,以及用于连通两个计算节点的核心交换机,组间路径用于传输组间路径中两个计算节点之间的数据;多条组间路径分别传输的数据量符合预设条件;
M和N均为大于2的整数。
在一种可能的实现方式中,处理模块在根据网络拓扑,确定N个计算节点之间的通信规划时,具体用于:根据网络拓扑和通信算法,确定N个计算节点之间的通信规划;通信算法用于在分布式训练中聚合N个计算节点分别执行训练得到的数据,以得到目标模型。
在一种可能的实现方式中,多条组间路径包括的多个核心交换机中,每个核心交换机包括一个或多个流量端口;多条组间路径分别传输的数据量符合预设条件,包括:多条组间路径包括的多个流量端口中,任两个流量端口的流量的差值小于阈值,其中,流量端口的流量与所属组间路径中两个计算节点之间传输数据的数据量关联。
在一种可能的实现方式中,在每条组间路径包括多级核心交换机时,差值小于阈值的任两个流量端口所属的核心交换机属于同一级。
在一种可能的实现方式中,对于多条组间路径中的任两条组间路径:两条组间路径分别包括有不同的核心交换机,或者,两条组间路径包含相同的核心交换机,且核心交换机在两条组间路径中的流量端口不同。
在一种可能的实现方式中,网络拓扑包括核心交换机、计算集群,以及接入交换机的连通关系;对于多条组间路径中的每条组间路径:组间路径中还包括两个计算节点分别对应的两个接入交换机,组间路径中每个计算节点通过该计算节点对应的接入交换机与核心交换机连通。
在一种可能的实现方式中,通信规划中还包括多条组内路径,每条组内路径中包括N个计算节点中、属于同一个分组的两个计算节点,以及分组对应的接入交换机,组内路径用于传输组内路径中两个计算节点之间的数据。
在一种可能的实现方式中,组内路径中两个计算节点之间传输数据的数据量,大于组间路径中两个计算节点之间传输数据的数据量。
在一种可能的实现方式中,M个分组分别对应于M个接入交换机;针对M个接入交换机中每个接入交换机:接入交换机包括K个第一端口、K个第一端口分别对应的K个第二端口;K个第一端口分别与K个核心交换机连接;K个第二端口分别与接入交换机对应的分组中计算节点的K个端口连接;K为大于2的整数。
在一种可能的实现方式中,获取模块还用于:获取训练任务,训练任务包括计算节点总数N和通信算法;处理模块在根据网络拓扑,确定N个计算节点之间的通信规划时,具体用于:根据网络拓扑、计算节点总数N和通信算法,从计算集群中处于空闲状态的多个计算节点中,确定N个计算节点和该N个计算节点之间的通信规划。
在一种可能的实现方式中,处理模块在根据网络拓扑、计算节点总数N和通信算法,从计算集群中处于空闲状态的多个计算节点中,确定N个计算节点和N个计算节点之间的通信规划时,具体用于:根据网络拓扑和计算节点总数N,从计算集群中处于空闲状态的多个计算节点中,确定N个计算节点;将N个计算节点中、属于同一个分组的两个计算节点配对,以及在剩余尚未配对的多个计算节点时,将尚未配对的多个计算节点配对,以得到的N/2个节点对;根据通信算法的多轮通信和N/2个节点对,确定N个计算节点分别在多轮通信中的通信规划;其中,对于任一轮通信中的通信规划,通信规划中两个计算节点所传输的数据量越大,通信规划中包括的组间路径数越小;若确定在多轮通信中的第i轮 通信中,N个计算节点的通信规划中包括多条组间路径,且多条组间路径分别传输的数据量不符合预设条件,则调整第i轮通信中N个计算节点的通信规划,i为正整数。
在一种可能的实现方式中,多条组间路径中包括第一组间路径,第一组间路径包括第一计算节点、第二计算节点和第一核心交换机;装置还包括发送模块;发送模块用于:分别向第一计算节点和第二计算节点发送第一信息;其中,第一信息指示第一组间路径用于第一计算节点向第二计算节点发送第一数据。
在一种可能的实现方式中,多条组内路径中包括第一组内路径,第一组内路径包括第一计算节点、第三计算节点和第一接入交换机;装置还包括发送模块;发送模块用于:分别向第一计算节点和第三计算节点发送第二信息;其中,第二信息指示第一组内路径用于第一计算节点向第三计算节点发送第二数据。
第四方面,本申请实施例提供一种计算设备,包括处理器,处理器与存储器相连,存储器用于存储计算机程序,处理器用于执行存储器中存储的计算机程序,以使得计算设备执行上述第一方面或第一方面的任一种可能的实现方式中的方法。
第五方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序或指令,当计算机程序或指令被计算设备执行时,实现上述第一方面或第一方面的任一种可能的实现方式中的方法。
第六方面,本申请实施例提供一种计算机程序产品,当计算机读取并执行计算机程序产品时,使得计算机执行上述第一方面或第一方面的任一种可能的实现方式中的方法。
上述第二方面至第六方面中任一方面可以达到的技术效果还可以参照上述第一方面中有益效果的描述,此处不再重复赘述。
附图说明
图1为一种神经网络的结构示意图;
图2为一种随机梯度下降方法的示意图;
图3为一种基于HD算法进行数据聚合的示意图;
图4为本申请示例性提供的一种分布式训练系统的示意图;
图5a为本申请示例性提供的一种分布式训练系统中接口连接关系的示意图;
图5b为本申请示例性提供的再一种分布式训练系统中接口连接关系的示意图;
图6为本申请示例性提供的又一种分布式训练系统的架构示意图;
图7为本申请示例性提供的一种分布式训练方法的流程示意图;
图8为本申请示例性提供的一种管理节点确定通信规划的流程示意图;
图9为本申请示例性提供的一种基于HD算法的通信关系图;
图10为本申请示例性提供的一种管理节点的结构示意图;
图11为本申请示例性提供的一种分布式训练装置的结构示意图;
图12为本申请示例性提供的再一种分布式训练装置的结构示意图。
具体实施方式
为了更好的解释本申请实施例,先对本申请中的相关术语或技术解释:
一、神经网络
神经网络(neural networks,NN)是一种模仿生物神经网络(动物的中枢神经系统,特别是大脑)的结构和功能的数学模型或计算模型。神经网络由大量的神经元联结进行计算。一个神经网络可以包括多种不同功能的神经网络层,每层可表达为函数y=fw(x),其中,f为函数的功能,w为权重,x为输入,y为输出。
图1为一种神经网络的结构示意图,该神经网络中可包括有首尾相连的m层,m为大于或等于2的整数。神经网络的第1层可表达为函数f0,f0的输入是x,输出是y0,权重是w0;神经网络的第2层可表达为函数f1,f1的输入是y0,输出是y1,权重是w1等。
二、模型训练
假设存在数据集合{(x0,l0),…,(xn-1,ln-1)},其中x0,…,xn-1是n个输入,而对应的l0,…,ln-1分别是这n个输入的期望输出,通常也称为标签(label)。每个(xj,lj)称为一个样本数据,j取遍[0,n-1]中的整数。
将该数据集合中的任一个输入(可表示为xj)输入至如图1的神经网络中,可得到神经网络的实际输出,比如表示为yj m-1。根据神经网络的实际输出yj m-1、期望输出lj,以及损失函数L算出损失(loss)。
模型训练的目标是,求解w0,…,wm-1,以使得在损失函数L下,yj m-1和lj最为接近。其中,求解过程可参见图2示例性示出的随机梯度下降(stochastic gradient descent,SGD)方法,先通过loss和ym-1确定第m-1层的梯度Δym-1;根据Δym-1和wm-1确定第m-1层梯度Δwm-1;再通过Δym-1和ym-2确定第m-2层的梯度Δym-2;根据Δym-2和wm-2确定第m-2层梯度Δwm-2;以此类推,得到每一层Δy和Δw,即得到Δy0、Δw0,......,Δym-1、Δwm-1
三、分布式训练
在模型训练中,由于模型训练所需训练数据的数据量过大,或者模型本身的计算量较大,为了更高效地、快速地训练模型,可通过计算集群包括的多个计算节点来共同训练模型,该模型训练的方式可称为是分布式模型训练或分布式训练。
其中,计算节点可包括一个或多个计算单元,计算单元比如是GPU、CPU、NPU。
具体的,将训练数据的数据集合切分为多个计算节点分别对应的多个数据子集,其中数据子集的尺寸比如是批尺寸(batch size),或者迷你批尺寸(mini batch size)。在分布式训练的每一轮迭代中,多个计算节点将各自对应的数据子集输入至本地的神经网络中,以得到各自神经网络的实际输出,进而根据各自神经网络的实际输出、期望输出和损失函数,确定各自神经网络的第m-1层对应的权重梯度。随后,该多个计算节点进行数据聚合,并根据聚合之后的中间数据进行下一轮的迭代,
可将对数据集合进行切分的分布式训练方式,称为是数据并行的训练方式。分布式训练方式还包括模型并行的训练方式,具体的,对模型切分得到多个子模型,该多个子模型分别由各自对应的计算节点运行。在模型并行的训练方式的每轮训练迭代中,该多个计算节点同样进行上述类似的数据聚合,以得到下一轮模型训练中的输入。
四、聚合通信(collective communication)
在分布式训练过程的每一轮迭代中,多个计算节点需要将各自执行模型训练得到的中 间数据进行聚合,根据聚合之后的中间数据进行下一轮的迭代。经过多轮迭代以得到最终的模型(记为目标模型)。
中间数据可包括计算节点执行各自模型训练得到的特征(feature或activation)、梯度和模型参数中的一项或多项。其中,特征比如是经模型学习到的训练数据的特征,模型参数比如是神经网络中函数f的参数w等,梯度比如是后向传播中产生的wj的差值Δwj等。如下为方便描述,可将中间数据均简称为数据。
具体的,多个计算节点可通过聚合通信的方式,来完成数据聚合。其中,聚合通信所采用的聚合算法(或称为通信算法)比如是ring(环)算法、halving-doubling(减半-加倍,HD)算法、binary tree(二叉树)算法等。
如下以HD算法为例说明:
比如4个计算节点用于进行分布式训练,其中,4个计算节点分别表示为计算节点a至计算节点d。在一轮迭代中,该4个计算节点通过HD算法进行数据聚合。各计算节点将各自的数据划分为4份。具体的,计算节点a包括数据a1至a4,计算节点b包括数据b1至b4,计算节点c包括数据c1至c4,计算节点d包括数据d1至d4。
如图3为本申请示出的4个计算节点通过HD算法进行数据聚合的例子,HD算法包括reduce-scatter和allgather两个部分。
HD算法的reducescatter中包括如下步骤1和步骤2:
在步骤1中:
4个计算节点组成的2个节点对,分别是(计算节点a和计算节点b)、(计算节点c和计算节点d)。其中,节点对中两个计算节点相互交换数据。
以(计算节点a和计算节点b)为例,计算节点a与计算节点b交换数据,具体是,计算节点a向计算节点b发送数据a1和a2,计算节点b向计算节点a发送数据b3和b4。相应的,计算节点a中包括数据:a1、a2、a3+b3、a4+b4;计算节点b中包括数据:a1+b1、a2+b2、b3、b4。(计算节点c和计算节点d)交换数据的方式,与(计算节点a和计算节点b)类似,具体可参见图3中步骤1。
在步骤2中:
4个计算节点组成的2个节点对,分别是(计算节点a和计算节点c)、(计算节点b和计算节点d)。其中,节点对中两个计算节点相互交换数据。
以(计算节点a和计算节点c)为例,计算节点a与计算节点c交换数据,具体是,计算节点a向计算节点c发送数据a3+b3,计算节点c向计算节点a发送数据c4+d4。相应的,计算节点a中包括数据:a1、a2、a3+b3、a4+b4+c4+d4;计算节点c中包括数据:c1、c2、a3+b3+c3+d3、c4+d4。(计算节点b和计算节点d)交换数据的方式,与(计算节点a和计算节点c)类似,具体可参见图3中步骤2。
HD算法的allgather中包括如下步骤3和步骤4:
在步骤3中:
4个计算节点组成的2个节点对,分别是(计算节点a和计算节点c)、(计算节点b和计算节点d)。其中,节点对中两个计算节点相互交换数据。
以(计算节点a和计算节点c)为例,计算节点a与计算节点c交换数据,具体是,计算节点a向计算节点c发送数据a4+b4+c4+d4,计算节点c向计算节点a发送数据 a3+b3+c3+d3。相应的,计算节点a中包括数据:a1、a2、a3+b3+c3+d3、a4+b4+c4+d4;计算节点c中包括数据:c1、c2、a3+b3+c3+d3、a4+b4+c4+d4。(计算节点b和计算节点d)交换数据的方式,与(计算节点a和计算节点c)类似,具体可参见图3中步骤3。
在步骤4中:
4个计算节点组成的2个节点对,分别是(计算节点a和计算节点b)、(计算节点c和计算节点d)。其中,节点对中两个计算节点相互交换数据。
以(计算节点a和计算节点b)为例,计算节点a与计算节点b交换数据,具体是,计算节点a向计算节点b发送数据a3+b3+c3+d3、a4+b4+c4+d4,计算节点b向计算节点a发送数据a1+b1+c1+d1、a2+b2+c2+d2。相应的,计算节点a中包括数据:a1+b1+c1+d1、a2+b2+c2+d2、a3+b3+c3+d3、a4+b4+c4+d4;计算节点b中包括数据:a1+b1+c1+d1、a2+b2+c2+d2、a3+b3+c3+d3、a4+b4+c4+d4。(计算节点c和计算节点d)交换数据的方式,与(计算节点a和计算节点b)类似,具体可参见图3中步骤4。
如此,计算节点a、计算节点b、计算节点c和计算节点d中每个计算节点都获取到a1+b1+c1+d1、a2+b2+c2+d2、a3+b3+c3+d3、a4+b4+c4+d4。
在上述HD算法的每个步骤中,对所有计算节点进行配对得到多个节点对(即每个步骤对应于一次计算节点的配对),每个节点对中的两个计算节点交换数据。进一步的,可假设计算节点a至计算节点d在实际部署中为依次排列,且任两个相邻计算节点之间的距离为固定值,即计算节点a与计算节点b,计算节点b与计算节点c,计算节点c与计算节点d之间的距离均为该固定值。也可以理解,计算节点a距离计算节点d最远,计算节点a距离计算节点b最近等。可根据该4个计算节点的距离确定每个步骤中计算节点的配对。
具体的,在reduce-scatter中,设置配对的两个计算节点之间的距离逐渐增加,而传输的数据量逐渐减少。比如在步骤1中,计算节点a和计算节点b配对,在步骤2中,计算节点a和计算节点c配对,计算节点a和计算节点b之间的距离是计算节点a和计算节点c之间的距离的一半,计算节点a和计算节点b之间传输的数据量是计算节点a和计算节点c之间传输的数据量的两倍。在allgather中,设置配对的两个计算节点之间的距离逐渐减少,而传输的数据量逐渐增大。比如在步骤3中,计算节点a和计算节点c配对,在步骤4中,计算节点a和计算节点b配对,计算节点a和计算节点c之间的距离是计算节点a和计算节点b之间的距离的两倍,计算节点a和计算节点c之间传输的数据量是计算节点a和计算节点b之间传输的数据量的一半。可以理解的是,reduce-scatter中步骤1和步骤2,与allgather中步骤3和步骤4相反,步骤1中的节点对可与步骤4中节点对相同,步骤2中的节点对可与步骤3中节点对相同。
需要指出的是,上述节点对中的两个计算节点需要交换数据,即二者需要连通。结合上述图3中的例子,计算节点a需要分别与计算节点b、计算节点c连通,计算节点b需要分别与计算节点a、计算节点d连通。而在其他的通信算法中,计算节点a还可能需要与其他计算节点连通,比如在ring算法中,计算节点a还需要与计算节点d连通等。
此外,上述以4个计算节点为例说明。当然还可以理解,在其他的通信算法中,还可包括更多的计算节点,比如8个计算节点、16个计算节点等。
本申请提供的一种分布式训练系统,该分布式训练系统包括K个核心交换机和计算集群。其中,计算集群中包括M个分组,每个分组中包括一个或多个计算节点,每个分组中 计算节点的个数相同或不同。计算节点可认为是物理服务器,计算节点中包括一个或多个计算单元(或称为处理单元),比如CPU、NPU或GPU等。M和K均为大于2的整数。
具体的,K个核心交换机用于连通M个分组中位于不同分组中的计算节点。也即,M个分组中的任两个分组分别包括的两个计算节点,能够通过该K个核心交换机中的某一个或多个核心交换机连通。核心交换机比如是脊(spine)交换机。
图4示例性示出的一种分布式训练系统的示意图,K个核心交换机分别记为核心交换机1至核心交换机K,M个分组分别记为分组1至分组M,每个分组中包括有k个计算节点,以分组1为例,分组1中k个计算节点分别记为计算节点1.1至计算节点1.k,其他分组中计算节点的标号可参见图4所示。
示例性的,位于分组1中的计算节点1.1与位于分组2中的计算节点2.1可通过核心交换机1连通,也即,计算节点1.1可通过核心交换机1与计算节点1.2传输数据。
可选的,分布式训练系统中还包括与M个分组分别对应的M个接入交换机。以M个接入交换机中的任一个接入交换机为例说明,接入交换机用于连通其对应分组中的计算节点与该计算节点所需连通的核心交换机。结合图4中例子,M个接入交换机分别记为接入交换机1至接入交换机M,其中,接入交换机1用于连通计算节点1.1与核心交换机1,或者,接入交换机1用于连通计算节点1.2与核心交换机2等;接入交换机2用于连通计算节点2.1与核心交换机1,或者,接入交换机2用于连通计算节点2.2与核心交换机2等。
其中,接入交换机比如是高性能架顶式(top-of-rack,tor)交换机。
本申请中,可认为接入交换机向上连接有核心交换机,向下连接有计算节点。相应的,计算节点向上连接有接入交换机,核心交换机向下连接有接入交换机。
进一步的,该M个接入交换机中的任一个接入交换机还向下连接其对应分组中的多个计算节点,从而实现该接入交换机连通其对应分组中多个计算节点中的任两个计算节点。仍参见图4所示,分组1中包括计算节点1.1至计算节点1.k,计算节点1.1至计算节点1.k中任两个计算节点能够通过接入交换机1连通;分组2中包括计算节点1.2至计算节点2.k,计算节点2.1至计算节点2.k中任两个计算节点能够通过接入交换机2连通等。
M个接入交换机均向上连接同一个核心交换机,从而该M个接入交换机中任两个接入交换机能够通过该核心交换机连通。结合图4中例子,M个接入交换机均向上连接有核心交换机1,从而M个接入交换机中的任两个接入交换机能够通过核心交换机1连通。
分布式训练系统中任两个计算节点之间能够传输数据,具体参见下述示例1和示例2。
示例1,接入至同一个接入交换机的两个计算节点,可通过该接入交换机传输数据。结合图4中例子,计算节点1.1和计算节点1.2均接入至接入交换机1中,计算节点1.1向计算节点1.2发送数据的路径为:计算节点1.1→接入交换机1→计算节点1.2。本申请中,“→”可表示数据的传输方向。
示例2,接入至不同接入交换机的两个计算节点,可通过各自接入的接入交换机,以及该两个接入交换机共同接入的核心交换机传输数据。结合图4中例子,计算节点1.1接入至接入交换机1,计算节点2.1接入至接入交换机2,且接入交换机1和接入交换机2均接入至核心交换机1,计算节点1.1向计算节点2.1发送数据的路径为:计算节点1.1→接入交换机1→核心交换机1→接入交换机2→计算节点2.1。
本申请中,位于同一个分组的两个计算节点可通过该分组对应的接入交换机进行组内 通信,组内通信所经过的路径可称为是组内路径,该传输方式可称为是组内传输方式。相应的,位于不同分组的两个计算节点可通过该不同分组分别对应的接入交换机,以及核心交换机进行组间通信,组间通信所经过的路径可称为是组间路径,该传输方式可称为是组间传输方式。
进一步的,接入交换机包括用于向上连接核心交换机的第一端口,以及用于向下连接计算节点的第二端口。计算节点包括用于向上连接接入交换机的第三端口。核心交换机包括用于向下连接接入交换机的第四端口。
示例性的,在接入交换机和核心交换机的连接关系中:每个接入交换机中包括有K个第一端口,该K个第一端口分别向上连接于K个核心交换机中各核心交换机的一个第四端口。每个核心交换机包括有M个第四端口,该M个第四端口分别向下连接于M个接入交换机中各接入交换机的一个第一端口。
示例性的,在接入交换机和计算节点的连接关系中:每个接入交换机中还包括有K个第二端口,该K个第二端口向下连接于该接入交换机对应分组中的计算节点的第三端口。比如,每个计算节点包括有一个第三端口,接入交换机中包括有4个第二端口,接入交换机向下连接有该接入交换机对应分组中的4个计算节点;再比如,每个计算节点包括有8个第三端口,接入交换机中包括有32个第二端口,接入交换机向下连接有该接入交换机对应分组中的4个计算节点等。
如下提供两个具体实现中核心交换机、接入交换机和计算节点的端口连接方式:
方式1,图5a中的核心交换机为4个,接入交换机为32个,每个分组中的计算节点也为4个,即K、k均等于4,M等于32。相应的,每个核心交换机包括32个第四端口(记为第四端口1至第四端口32);每个接入交换机包括4个第一端口(记为第一端口1至第一端口4)和4个第二端口(记为第二端口1至第二端口4);每个计算节点中包括1个第三端口。分布式训练系统中包含的端口连接关系可参见图5a所示。
以接入交换机1为例,接入交换机1的4个第一端口向上分别连接核心交换机1的第四端口1,核心交换机2的第四端口1,核心交换机3的第四端口1,以及核心交换机4的第四端口1。4个第二端口向下分别连接4个计算节点,即计算节点1.1至计算节点1.4。
再以核心交换机1为例,核心交换机1的32个第四端口向下分别连接接入交换机1的第一端口1,接入交换机2的第一端口1,……,接入交换机31的第一端口1,以及接入交换机32的第一端口1。
方式2,图5b中的核心交换机为32个,接入交换机为32个,每个分组中的计算节点为4个,即K等于32,M等于32,k等于4。相应的,每个核心交换机包括32个第四端口(记为第四端口1至第四端口32);每个接入交换机包括32个第一端口(记为第一端口1至第一端口32)和32个第二端口(记为第二端口1至第二端口32);每个计算节点中包括8个第三端口。分布式训练系统中包含的端口连接关系可参见图5b所示。
以接入交换机1为例,接入交换机1的32个第一端口向上分别连接核心交换机1的第四端口1,核心交换机2的第四端口1,核心交换机3的第四端口1,……,核心交换机31的第四端口1,以及核心交换机32的第四端口1。32个第二端口向下分别连接4个计算节点,即计算节点1.1至计算节点1.4。
再以核心交换机1为例,核心交换机1的32个第四端口向下分别连接接入交换机1的第一端口1,接入交换机2的第一端口1,……,接入交换机31的第一端口1,以及接 入交换机32的第一端口1。
进一步的,在接入交换机内部,该K个第一端口和K个第二端口绑定,或者说,在接入交换机的内部,设置K个第一端口和K个第二端口的一一映射关系,从而实现在接入交换机中,从多个第一端口中的某个第一端口输入的数据,从多个第二端口中、该第一端口对应的第二端口输出。仍结合图5a中例子,接入交换机1中,第一端口1至第一端口4分别与第二端口1至第二端口4对应,当接入交换机1通过第一端口1接收数据时,可将该数据由第二端口1输出,当接入交换机1通过第一端口2接收数据时,可将该数据由第二端口2输出。如此,避免接入交换机基于负载均衡原理,将数据通过多个第二端口中某个不确定的第二端口输入至某个不确定的核心交换机中。
还可认为,图4或图5a或图5b示出的分布式训练系统包括一个核心层和一个接入层,其中,该一个核心层中包括核心交换机1至核心交换机K,该一个接入层中包括接入交换机1至接入交换机M。此外,本申请还可包括多个核心层,该多个核心层位于接入层之上,用于实现接入层中任两个接入交换机连通。其中,该多个核心层的任两个相邻核心层中,上一核心层中的一个或多个核心交换机,用于下一核心层中的任两个核心交换机连通。
也可以理解,多个核心层用于实现计算集群中任两个计算节点之间的连通。
图6为本申请示例性提供的再一种分布式训练系统的架构示意图,该分布式训练系统中包括两个核心层,该两个核心层可分别记为第一核心层和第二核心层。其中,第二核心层位于第一核心层之上,第一核心层位于接入层之上。
第二核心层中包括一个或多个核心交换机(图6示出两个核心交换机,分别表示为核心交换机A和核心交换机B)。第一核心层中包括K个核心交换机(图6中仍表示为核心交换机1至核心交换机K)。接入层中包括M个接入交换机(图6中仍表示为接入交换机1至接入交换机M),且接入交换机1至接入交换机M分别对应于分组1至分组M,每个分组中仍包括一个或多个计算节点。
对于第二核心层来说,该第二核心层中的一个或多个核心交换机用于实现第一核心层中任两个核心交换机之间连通;对于第一核心层来说,该第一核心层中的K个核心交换机用于实现接入层中任两个接入交换机之间连通;对于接入层来说,该接入层中的M个接入交换机用于实现各自分组中任两个计算节点之间连通。具体连通方式可参见关于图4或图5a或图5b实施例中的论述,不再赘述。
进一步的,分布式训练系统中还包括管理节点。
在图4至图6的任一个图中,管理节点是独立于计算集群的一个节点,该节点与计算集群中的多个计算节点分别连接,以用于管理计算集群中的各个计算节点。在一个具体实现中,管理节点比如是计算机,或者是安装在计算机上的模块,比如插件。
又或者,管理节点是计算集群中的计算节点,该计算节点与计算集群中其他的多个计算节点分别连接,不仅具备管理计算集群中的该其他的多个计算节点的能力,还具备其他计算节点的计算能力。在一个具体实现中,管理节点比如是物理服务器,其中包括一个或多个计算单元(或称为处理单元),比如CPU、NPU或GPU等。
又或者,管理节点中包括多个功能模块,多个功能模块中的部分功能模块部署在计算集群的计算节点中,剩余的其他功能模块部署在独立于计算集群的外部节点中。
具体的,管理节点用于从计算机群中选择出N个用于进行分布式训练的计算节点,进而根据该N个计算节点,生成通信规划。管理节点还用于将通信规划指示给该N个计算节点,以使得该N个计算节点在分布式训练过程中执行聚合算法,以得到聚合之后数据。
参照图7示例性示出的一种分布式训练方法的流程示意图说明。
步骤701,管理节点获取网络拓扑。
其中,网络拓扑包括核心交换机和计算集群中的计算节点的连通关系。结合图5a中例子,管理节点获取的网络拓扑比如包括:
拓扑1:计算节点1.1、计算节点2.1、……、计算节点32.1均与核心交换机1连通;
拓扑2:计算节点1.2、计算节点2.2、……、计算节点32.2均与核心交换机2连通等。
可选的,网络拓扑中还包括接入交换机分别与核心交换机和计算集群中的计算节点的连通关系。结合图5a中例子,管理节点获取的网络拓扑中:
拓扑1进一步包括:
拓扑1-1,计算节点1.1通过接入交换机1与核心交换机1连通;
拓扑1-2,计算节点2.1通过接入交换机2与核心交换机1连通;
拓扑1-3,计算节点3.1通过接入交换机3与核心交换机1连通等。
拓扑2进一步包括:
拓扑2-1,计算节点1.2通过接入交换机1与核心交换机2连通;
拓扑2-2,计算节点2.2通过接入交换机2与核心交换机2连通;
拓扑2-3,计算节点3.2通过接入交换机3与核心交换机2连通等。
可选的,在分布式训练系统包括多个核心层(比如图6中的第一核心层和第二核心层)时,网络拓扑中不仅包括接入交换机和计算集群中的计算节点的连通关系,以及接入交换机与第一核心层中核心交换机的连通关系,还包括第一核心层中核心交换机与第二核心层中核心交换机的连通关系。结合图6中例子,管理节点获取的网络拓扑中不仅包括上述拓扑1和拓扑2等,还包括如下拓扑A和拓扑B:
拓扑A:核心交换机1、核心交换机2、……、核心交换机K均与核心交换机A连通;
拓扑B:核心交换机1、核心交换机2、……、核心交换机K均与核心交换机B连通。
当然,上面仅是示例性示出网络拓扑的形式,管理节点获取到的网络拓扑还可以是其他形式,本申请不限定。
步骤702,管理节点根据网络拓扑,确定N个计算节点之间的通信规划。
该N个计算节点用于在分布式系统中,共同训练某个模型(称为目标模型)。
其中,通信规划包括X条组间路径(记为组间路径1至组间路径X),其中X为大于2的整数。进一步的,每条组间路径中包括N个计算节点中、属于不同分组的两个计算节点,以及用于连通该两个计算节点的核心交换机。结合拓扑1举例,组间路径1中包括计算节点1.1、核心交换机1和计算节点2.1;结合拓扑2举例,组间路径2中包括计算节点2.2、核心交换机2和计算节点32.2。
X条组间路径中每条组间路径可用于传输该组间路径中两个计算节点之间的数据。比如,组间路径1中包括计算节点1.1和计算节点2.1,组间路径1用于传输计算节点1.1和计算节点2.1之间的数据;再比如,组间路径2中包括计算节点2.2和计算节点32.2,组间路径2用于传输计算节点2.2和计算节点32.2之间的数据。
为避免核心交换机的端口出现流量拥塞,管理节点根据网络拓扑,确定的X条组间路径中传输的数据量需要符合预设条件。
针对X条组间路径中一条组间路径来说:该组间路径在经过该组间路径所包括的核心交换机时,具体经过的是,核心交换机的一个输入端口和一个输出端口。在一个可能方式中,将组间路径所经过的核心交换机的输出端口作为流量端口,该流量端口的数据流量(或称为流量)用于衡量该组间路径中传输的数据量是否预设条件。其中,该流量端口的数据流量与该组间路径中两个计算节点之间传输数据的数据量关联。
相应的,X条组间路径分别包括Y个流量端口,其中,Y为大于2的整数。一个示例中,X条组间路径分别对应的X个流量端口中不存在相同的流量端口,也即X等于Y。再一个示例中,X条组间路径分别对应的X个流量端口中存在相同的流量端口,即X条组间路径中有两条或两条以上的组间路径对应于同一个流量端口,也即X大于Y。
X条组间路径分别传输的数据量符合预设条件,具体是,Y个流量端口中,任两个流量端口的数据流量的差值小于阈值。
举例来说,X条组间路径具体是组间路径1至组间路径10,即X等于10。组间路径1至组间路径10分别对应于流量端口1至流量端口10,即Y等于10,其中,流量端口1至流量端口10中任两个流量端口的数据流量的差值小于阈值。或者,组间路径1至组间路径6分别对应于流量端口1至流量端口6,组间路径7和组间路径8对应于同一个流量端口7,组间路径9和组间路径10对应于同一个流量端口8,即Y等于8,其中,流量端口1至流量端口8中任两个流量端口的数据流量的差值小于阈值。
结合图5a中例子解释:组间路径1中包括计算节点1.1、核心交换机1和计算节点2.1。核心交换机1通过核心交换机1的第四端口1接收计算节点1.1的数据,并通过核心交换机1的第四端口2输出至计算节点2.1。核心交换机1的第四端口2即为核心交换机1的流量端口(记为流量端口1),其中,流量端口1的数据流量与计算节点1.1、计算节点2.1二者之间交换的数据量关联。组间路径2中包括计算节点2.2、核心交换机2和计算节点32.2。核心交换机2通过核心交换机2的第四端口2接收计算节点2.2的数据,并通过核心交换机2的第四端口32输出至计算节点32.2。核心交换机2的第四端口32即为核心交换机2的流量端口(记为流量端口2),其中,流量端口2的数据流量与计算节点2.2、计算节点32.2二者之间交换的数据量关联。组间路径1和组间路径2符合预设条件,具体是,流量端口1的数据流量与流量端口2的数据流量的差值小于阈值。
需要说明的是,在分布式训练系统包括多个核心层时,具体是,针对同一个核心层来说,该Y个流量端口中,任两个流量端口的数据流量的差值小于阈值。不同核心层对应的阈值相同或不同。
结合图6中例子举例,X条组间路径具体是组间路径1至组间路径5,即X等于5,在第一核心层中,组间路径1至组间路径3分别对应于核心交换机1的流量端口11至流量端口13,组间路径4、组间路径5对应于核心交换机2的同一个流量端口21;在第二核心层中,组间路径1至组间路径5分别对应于核心交换机A的流量端口A1至流量端口A5。进一步的,第一核心层对应于阈值1,第二核心层对应于阈值2,那么,流量端口11至流量端口13、流量端口21中的任两个流量端口的数据流量的差值小于阈值1,流量端口A1至流量端口A5中的任两个流量端口的数据流量的差值小于阈值2。
为了更好的达到上述预设条件,管理节点在确定N个计算节点之间的通信规划时,具 体可以是,确定X条组间路径中的任两条组间路径分别包含有不同的核心交换机。也可以理解,X条组间路径分别对应于X个核心交换机,从而X个不同核心交换机分别传输X条组间路径对应的数据,避免出现某个核心交换机需要同时传输多条组间路径对应的数据,从而导致流量拥塞的问题。
管理节点在确定N个计算节点之间的通信规划时,具体还可以是,在确定X条组间路径中存在某两条组间路径包含有相同核心交换机的情况下,确定该相同核心交换机在该两条组间路径中的流量端口不同,比如管理节点确定组间路径1和组间路径2均经过核心交换机1,则管理节点可进一步确定组间路径1经过核心交换机1的流量端口11,组间路径2经过核心交换机1的流量端口12。如此,即使存在某个核心交换机需要同时传输多条组间路径对应的数据,但该核心交换机可通过两个不同的流量端口来传输该两个数据,同样可避免出现流量拥塞的问题。
可选的,网络拓扑还包括接入交换机分别与核心交换机、计算集群中计算节点的连通关系,也即,网络拓扑中具体包括核心交换机、计算集群中计算节点,以及接入交换机的连通关系。管理节点根据网络拓扑,确定的组间路径中还包括该两个计算节点所属分组对应的接入交换机,其中接入交换机即用于连通核心交换机和该接入交换机下的计算节点。
仍结合拓扑1举例,组间路径1中包括计算节点1.1、接入交换机1、核心交换机1、接入交换机2和计算节点2.1,也可以将组间路径1表示为:计算节点1.1接入交换机1核心交换机1接入交换机2计算节点2.1,其中,表示双向传输,比如“计算节点1.1接入交换机1”表示,计算节点1.1能够向接入交换机1传数据,接入交换机1也能够向计算节点1.1传数据。其中,接入交换机1用于连通计算节点1.1和核心交换机1;接入交换机2用于连通计算节点2.1和核心交换机1。
此外,管理节点在确定N个计算节点之间的通信规划时,不仅能够确定X个组间路径,还能确定Z个组内路径,Z为大于2的整数。其中,对于任一条组内路径来说,该组内路径中包括N个计算节点中、属于同一个分组的两个计算节点,以及用于连通该两个计算节点的接入交换机(或者说,该分组对应的接入交换机)。结合上述拓扑1举例,管理节点确定组内路径1中包括计算节点1.1、接入交换机1和计算节点1.2,或者,将组内路径1表示为:计算节点1.1接入交换机1计算节点1.2,其中,表示双向传输。
进一步的,由于组内路径所经过的交换机层数小于组间路径所经过的交换机层数,相应的,经组内路径传输数据的速度高于经组间路径传输数据的速度;且交换机内部设置有各输入端口至各输出端口的流量路线,并不会存在多条组内路径的数据流量冲突。如此,管理节点在确定通信规划时,可确定组内路径传输的数据量,大于组间路径传输的数据量。
一种可能方式中,管理节点根据网络拓扑和通信算法,确定N个计算节点之间的通信规划。通信算法用于在该分布式训练的过程中,聚合该N个计算节点在每轮迭代中分别执行模型训练而得到的数据,从而该N个计算节点根据聚合后的数据,进行下一轮模型训练,以得到最终的目标模型。通信算法比如是ring算法、HD算法、binary tree算法等。
如图8为本申请示例性提供的一种管理节点确定通信规划的流程示意图:
步骤801,管理节点获取训练任务,训练任务中包括通信算法和计算节点总数N。
一个具体实现中,用户在准备使用计算集群训练某个目标模型时,可在前端界面中输入分布式训练所需的计算节点总数N和通信算法。相应的,前端界面基于用户输入,生成训练任务,并向管理节点发送该训练任务。
可选的,训练任务中还包括计算节点的资源类型、训练任务的参数、任务优先级等,其中资源类型包括GPU、NPU、CPU中的一项或多项;训练任务的参数比如是迭代终止条件(比如迭代次数、梯度条件等)等;任务优先级指示当前训练任务的优先级,优先级越高,则表明训练任务越重要,管理节点需要优先为优先级高的训练任务选择计算节点。
步骤802,管理节点根据网络拓扑、计算节点总数N和通信算法,从计算集群中处于空闲状态的多个计算节点中,确定N个计算节点和该N个计算节点之间的通信规划。
计算集群包括被占用的计算节点,和处于空闲状态的多个计算节点。一个可能方式中,管理节点获取当前计算集群中处于空闲状态的多个计算节点是哪些,然后根据网络拓扑、计算节点总数N和通信算法,从这些处于空闲状态的多个计算节点中选择N个计算节点,进而确定该N个计算节点之间的通信规划。该方案中,管理节点先选择出N个计算节点,再对选择出的计算节点进行通信规划,降低通信规划过程中的计算量。
管理节点在从当前计算集群中处于空闲状态的多个计算节点中选择N个计算节点时,可以基于亲和性原则选择,即,尽量选择处于同一个分组中的计算节点,以提高组内传输(组内路径)方式在每次迭代中的占比,相应的,降低组间传输方式(组间路径)在每次迭代中的占比,避免过多的组间传输方式而导致核心交换机的端口出现流量拥塞。
进一步的,管理节点在选择出N个计算节点之后,还可执行如下步骤a至步骤c,以合理规划出N个计算节点在通信算法中的通信方式(即通信规划)。
步骤a,管理节点将N个计算节点中、属于同一个分组的两个计算节点配对;在剩余尚未配对的多个计算节点时,将尚未配对的多个计算节点配对,以得到的N/2个节点对。
也即,管理节点需要先将N个计算节点进行节点配对,并尽可能地将位于同一个分组中的两个计算节点配对,若在将位于同一个分组中的两个计算节点均配对完成之后,仍存在尚未配对的、多个位于不同分组中的两个计算节点,则将该多个位于不同分组中的两个计算节点进行节点配对,从而得到N/2个节点对。
举例来说,选择出16个计算节点,分别是,
分组1中的计算节点1.1、计算节点1.2、计算节点1.3、计算节点1.4、计算节点1.5;
分组2中的计算节点2.1、计算节点2.2、计算节点2.3;
分组3中的计算节点3.1、计算节点3.2、计算节点3.5、计算节点3.6;
分组4中的计算节点4.1、计算节点4.2、计算节点4.3、计算节点4.4。
管理节点在进行节点配对时,可先将分组1中的计算节点配对得到:(计算节点1.1、计算节点1.2)、(计算节点1.3、计算节点1.4);分组2中的计算节点配对得到:(计算节点2.1、计算节点2.2);分组3中的计算节点配对得到:(计算节点3.1、计算节点3.2)、(计算节点3.5、计算节点3.6);分组4中的计算节点配对得到:(计算节点4.1、计算节点4.2)、(计算节点4.3、计算节点4.4)。进一步的,剩余尚未配对的计算节点1.5和计算节点2.3,管理节点将该两个计算节点配对得到(计算节点1.5、计算节点2.3)。
步骤b,管理节点根据通信算法的多轮通信和N/2个节点对,确定N个计算节点分别在多轮通信中的通信规划。
其中,对于任一轮通信中的通信规划,通信规划中两个计算节点所传输的数据量越大,通信规划中包括的组间路径数越小。可以理解,在组内通信方式中,两个计算节点通过接入交换机进行数据传输,无需经过核心交换机,所以存在流量拥塞的可能性较小,所以管理节点在进行通信规划时,将需要传输较大数据量的步骤,采用相对较多的组内通信完成 (或者说相对较少的组间通信完成),以避免出现流量拥塞的问题。结合图3中例子,可以理解,reduce-scatter中的步骤1需要传输的数据量大于步骤2需要传输的数据量,所以步骤1中包括的组间路径小于步骤2中包括的组间路径。
举例来说,管理节点获取到的N/2个节点对如上述步骤a中的例子,则HD算法中reduce-scatter包括4个步骤,分别表示为S1至S4,HD算法中allgather包括三个步骤,分别表示为S5至S8,也即HD算法共计有8轮通信,其中,reduce-scatter中S1至S4、allgather中S5至S8的说明可参见上述图3相关实施例中描述。管理节点可基于HD算法,确定16个计算节点分别在该8轮通信中的通信规划。
为方便描述,参见图9示例性示出的一种基于HD算法的通信关系图解释。其中,图9中(a)示出的立方体中,顶点和棱分别代表HD算法中的步骤,顶点对应于节点对,管理节点根据该顶点对应的节点对确定该各步骤对应的通信规划。
图9中(a)尚未关联立方体中顶点和节点对,管理节点可根据如下方式确定顶点和节点对的关联关系:
管理节点将任一个节点对放置在立方体的某个顶点上,比如将(计算节点1.1,计算节点1.2)放置在立方体的第一顶点处,本申请中为方便描述,将(计算节点1.1,计算节点1.2)表示为(1.1,1.2),其他类似,不再赘述。其中,第一顶点代表reduce-scatter中的S1,与第一顶点连接的三个棱分别代表reduce-scatter中的S2、S3和S4;或者,第一顶点代表allgather中的S8,与第一顶点连接的三个棱分别代表allgather中S7、S6和S5。如下,均以reduce-scatter为例说明。
其中,S2传输的数据量比S3或S4传输的数据量大,则优先为S2对应的棱上的第二顶点确定节点对,优先选择与第一顶点上节点对中计算节点位于同一个分组中的计算节点所属的节点对,比如选择(1.3,1.4)放置到第二顶点上。
进一步的,与第一顶点连接的、剩余的两个棱分别代表S3和S4,与第二顶点连接的、剩余的两个棱分别代表S3和S4,其中,S3传输的数据量比S4传输的数据量大,则优先为S3对应棱上的顶点确定节点对,比如先为第一顶点的S3选择节点对,仍优先选择与第一顶点上节点对中计算节点位于同一个分组中的计算节点所属的节点对,比如选择(1.5,2.3)放置到第三顶点上;随后,管理节点再为第二顶点的S3选择节点对,以此类推,即可为该立方体的8个顶点分别赋予对应的节点对,以得到图9中(b)示出的对应关系。
结合图9中(b),其中S1对应的第一顶点中对应于(1.1,1.2),即,该计算节点1.1与计算节点1.2在S1中通信。
其中一个S2对应的棱连接有两个顶点,分别是(1.1,1.2)和(1.3,1.4),该两个顶点中位于相对应位置的两个计算节点分别是计算节点1.1与计算节点1.3,以及计算节点1.2与计算节点1.4。相应的,该计算节点1.1与计算节点1.3在S2中通信;计算节点1.2与计算节点1.4在S2中通信。
其中一个S4对应棱连接的两个顶点,分别是(1.1,1.2)和(3.1,3.2),该两个顶点中位于相对应位置的两个计算节点分别是计算节点1.1与计算节点3.1,以及计算节点1.2与计算节点3.2。相应的,计算节点1.1与计算节点3.1在S4中通信,计算节点1.2与计算节点3.2在S4中通信。
在另外一个S4对应棱连接的两个顶点,分别是(1.5,2.3)和(4.1,4.2),该两个顶点中位于相对应位置的两个计算节点分别是计算节点1.5与计算节点4.1,以及计算节点 2.3与计算节点4.2。相应的,计算节点1.5与计算节点4.1在S4中通信,计算节点2.3与计算节点4.2在S4中通信。
步骤c,管理节点若确定在多轮通信中的第i轮通信中,N个计算节点的通信规划中包括多条组间路径,且多条组间路径分别传输的数据量不符合预设条件,则调整第i轮通信中N个计算节点的通信规划,i为正整数。
仍结合上述例子,在S4(即i=4)中:
计算节点4.1在向计算节点1.5发送数据时,经过的路径是计算节点4.1→接入交换机4→核心交换机1→接入交换机1→计算节点1.5,具体的,经过核心交换机1的第四端口1;
计算节点3.1在向计算节点1.1发送数据时,经过的路径是计算节点3.1→接入交换机3→核心交换机1→接入交换机1→计算节点1.1,具体的,经过核心交换机1的第四端口1。
如此,存在该两个组间路径均经过核心交换机1的第四端口1,即核心交换机1的第四端口1存在流量拥塞,即存在该两条组间路径分别传输的数据量不符合预设条件。
为此,管理节点可通过调整该步骤中N个计算节点的通信规划,来使得多条组间路径分别传输的数据量符合预设条件。比如,调整步骤a中节点对,比如交换节点对(计算节点4.1、计算节点4.2)与节点对(计算节点4.3、计算节点4.4)的顺序,交换后的对应关系参见图9中(c)。进一步的,在S4中:
计算节点4.3向计算节点1.5发送数据,经过的路径是计算节点4.3→接入交换机4→核心交换机3→接入交换机1→计算节点1.5,具体的,经过核心交换机3的第四端口1;
计算节点3.5向计算节点1.3发送数据,经过的路径是计算节点3.5→接入交换机3→核心交换机5→接入交换机1→计算节点1.3,具体的,经过核心交换机5的第四端口1;
计算节点3.6向计算节点1.4发送数据,经过的路径是计算节点3.6→接入交换机3→核心交换机6→接入交换机1→计算节点1.4,具体的,经过核心交换机6的第四端口1;
计算节点3.1向计算节点1.1发送数据,经过的路径是计算节点3.1→接入交换机3→核心交换机1→接入交换机1→计算节点1.1,具体的,经过核心交换机1的第四端口1;
计算节点3.2向计算节点1.2发送数据,经过的路径是计算节点3.2→接入交换机3→核心交换机2→接入交换机1→计算节点1.2,具体的,经过核心交换机2的第四端口1。
依次分析,在交换后实现多条组间路径分别传输的数据量符合预设条件。
进一步的,allgather中S8、S7、S6和S5的通信规划,分别与reduce-scatter中S1、S2、S3和S4的通信规划相同,比如,在S7中,计算节点3.1与计算节点3.5通信,计算节点3.2与计算节点3.6通信等;在S6中,计算节点1.1与计算节点1.5通信,计算节点1.2与计算节点2.3通信等;在S5中,计算节点1.3与计算节点3.5通信,计算节点1.4与计算节点3.6通信等。由于S8、S7、S6和S5的通信规划,分别与S1、S2、S3和S4的通信规划相同,所以在图9中(c)未示出S8、S7、S6和S5的通信规划。
需要指出的是,管理节点还需要确定各组间路径中两个计算节点之间需要传输的数据是什么。结合图9中(c)示出的HD算法的通信关系图举例,在S1中,计算节点1.1与计算节点1.3通信,具体包括,计算节点1.1将自己的中间数据的一半发送至计算节点1.3。相应的,管理节点确定的通信规划中不仅包括组内路径“计算节点1.1接入交换机计算节点1.3”,还包括计算节点1.1向计算节点1.3待发送的数据的指示信息(比如中间数据的一半)。
管理节点在确定出通信规划之后,还可根据通信规划确定规划信息,该规划信息中包括多条组间路径分别对应的路径信息,其中,该组间路径对应的路径信息指示该组间路径用于该组间路径中的两个计算节点相互传输数据。随后管理节点分别向N个计算节点发送该规划信息,该N个计算节点中每个计算节点根据接收到的规划信息,确定需要向哪个计算节点发送什么数据,和/或,确定需要接收来自哪个计算节点什么数据。
具体的,本申请中还可以包括步骤703至步骤705:
以多条组间路径中的第一组间路径为例说明,其中,第一组间路径包括第一计算节点、第二计算节点和第一核心交换机。可选的,第一组件路径中还包括第一计算节点所属分组对应的第一接入交换机,和第二计算节点所属分组对应的第二接入交换机。
步骤703,管理节点分别向第一计算节点、第二计算节点发送第一信息。
管理节点确定的规划信息中包括第一组间路径对应的路径信息(记为第一信息),其中第一信息指示第一组间路径用于第一计算节点与第二计算节点之间相互传输数据。
示例性的,第一信息中包括第一组间路径,或者,包括第一计算节点和第二计算节点;第一信息中还包括第一计算节点向第二计算节点待发送的数据的指示信息,和/或,第二计算节点向第一计算节点待发送的数据的指示信息。
一个具体实现中,管理节点分别向第一计算节点、第二计算节点发送规划信息,相应的,第一计算节点和第二计算节点分别接收来自管理节点的规划信息,从规划信息中获取第一信息。再一个具体实现中,管理节点直接向第一计算节点、第二计算节点发送第一信息,相应的,第一计算节点和第二计算节点接收来自管理节点的第一信息。
步骤704,第一计算节点根据第一信息确定待发送至第二计算节点的数据(记为第一数据),并向第二计算节点发送第一数据。
相应的,第二计算节点接收来自第一计算节点的数据,并根据第一信息确定该接收到的数据是来自第一计算节点的第一数据,随后,第二计算节点将第一数据更新至本地。
步骤705,第二计算节点根据第一信息确定待发送至第一计算节点的数据(记为第1数据),并向第一计算节点发送第1数据。
相应的,第一计算节点接收来自第二计算节点的数据,并根据第一信息确定接收到的数据是来自第二计算节点的第1数据,随后,第一计算节点将第1数据更新至本地。
其中,第一计算节点在向第二计算节点发送数据时,具体是,第一计算节点向第一接入交换机发送第一数据,第一接入交换机向第一核心交换机发送第一数据,第一核心交换机向第二接入交换机发送第一数据,第二接入交换机向第二计算节点发送第一数据。
可以理解的是,第一计算节点在发送第一数据时,直接将第一数据传输至其连接的第一接入交换机中,第一接入交换机中第一端口和第二端口相互绑定,所以第一接入交换机在接收到第一数据之后,直接将第一数据通过与接收第一数据的第一端口绑定的第二端口输出,进一步的,第一接入交换机将第一数据输出至与第二端口连接的核心交换机中。同理的,核心交换机、第二接入交换机也是根据已有的连接关系,或者内部绑定关系,确定将第一数据传输至第二计算节点中。如此,组间路径所涉及的计算节点、核心交换机、接入交换机均按照已有的路径传输数据,保障数据传输的有序性,避免在组间通信的过程中核心交换机的端口出现流量拥塞。该说明同样适用于第二计算节点向第一计算节点发送数据1的情况,不再赘述。
具体的,本申请中还可以包括步骤706至步骤708:
以多条组内路径中包括的第一组内路径为例说明,其中,第一组内路径包括第一计算节点、第三计算节点和第一接入交换机。
步骤706,管理节点分别向第一计算节点、第三计算节点发送第二信息。
管理节点确定的规划信息中包括第一组内路径对应的路径信息(记为第二信息),其中第二信息指示第一组内路径用于第一计算节点与第三计算节点之间相互传输数据。
示例性的,第二信息中包括第一组内路径,或者,包括第一计算节点和第三计算节点;第二信息中还包括第一计算节点向第三计算节点待发送的数据的指示信息,和/或,第三计算节点向第一计算节点待发送的数据的指示信息。
一个具体实现中,管理节点分别向第一计算节点、第三计算节点发送规划信息,相应的,第一计算节点和第三计算节点分别接收来自管理节点的规划信息,从规划信息中获取第二信息。再一个具体实现中,管理节点直接向第一计算节点、第三计算节点分别发送第二信息,相应的,第一计算节点和第三计算节点接收来自管理节点的第二信息。
步骤707,第一计算节点根据第二信息确定待发送至第三计算节点的数据(记为第二数据),并向第三计算节点发送第二数据。
相应的,第三计算节点接收来自第一计算节点的数据,并根据第二信息确定接收到的数据是来自第一计算节点的第二数据,将第二数据更新至本地。
步骤708,第三计算节点根据第二信息确定待发送至第一计算节点的数据(记为第2数据),并向第一计算节点发送第2数据。
相应的,第一计算节点接收来自第三计算节点的数据,并根据第二信息确定接收到的数据是来自第三计算节点的第2数据,随后,第一计算节点将第2数据更新至本地。
需要指出的是,第一计算节点在向第三计算节点发送数据时,具体是,第一计算节点向第一计算节点所属的接入交换机(即第一接入交换机)发送第二数据。第一接入交换机与第三计算节点连通,第一接入交换机向第三计算节点发送第二数据。
可以理解的是,图7划分为规划阶段和训练阶段,其中,规划阶段包括:步骤701至步骤703、步骤706;训练阶段包括:步骤704、步骤705、步骤707和步骤708,其中,步骤704、步骤705是位于不用分组的两个计算节点进行数据传输的步骤;步骤707和步骤708是位于同一个分组的两个计算节点进行数据传输的步骤。
还需要指出的是,在管理节点是N个计算节点中的一个时,该管理节点分别向其他N-1计算节点发送该规划信息,如此,该N个计算节点中每个计算节点根据规划信息,确定需要向哪个计算节点发送什么数据,和/或,确定需要接收来自哪个计算节点什么数据,具体实现仍可参见上述步骤703至步骤708。
如图10为本申请示例性提供的一种管理节点的结构示意图,该管理节点中包括任务管理模块1001、资源管理模块1002和训练任务模块1003。
任务管理模块1001获取训练任务,根据训练任务中的通信算法和计算节点总数N,向资源管理模块1002申请资源,也即申请用于进行分布式训练的计算节点。具体的,任务管理模块1001向资源管理模块1002发送任务资源申请,该任务资源申请中包括通信算法和计算节点总数N。其中,任务管理模块1001的功能可参见上述步骤801中描述。
资源管理模块1002接收来自任务管理模块1001的任务资源申请,根据任务资源申请 中通信算法和计算节点总数N,从计算集群中的处于空闲状态的多个计算节点中,选择N个用于进行分布式训练的计算节点,具体实现可参见步骤802中关于管理节点从当前计算集群中处于空闲状态的多个计算节点中选择N个计算节点的实现方式。
可选的,资源管理模块1002接收到多个任务资源申请,每个任务资源申请中包括各自对应的训练任务的优先级,资源管理模块1002根据多个训练任务的优先级,确定先为哪个训练任务申请资源。
资源管理模块1002在选择出该训练任务对应的N个计算节点之后,可向任务管理模块1001返回当前申请的N个计算节点的标识。任务管理模块1001指示训练任务模块1003在N个计算节点中分别启动N个计算节点分别对应的训练任务。
训练任务模块1003还可获取该N个计算节点的网络拓扑,或者获取计算集群的网络拓扑,根据获取到的网络拓扑,确定N个计算节点的通信规划,具体实现可参见步骤802中关于管理节点确定通信规划的描述。进一步的,训练任务模块1003还根据通信规划,确定规划信息,向N个计算节点分别发送规划信息。
需要补充的是,图10示出的存储模块1004用于存储计算机程序指令,当管理节点中模块在执行该存储模块1004中的计算机程序指令时,可执行该模块对应的动作。图10示出的通信模块1005用于管理节点中任两个模块之间通信,比如,任务管理模块1001通过通信模块1005,向资源管理模块1002发送任务资源申请等。
基于上述内容和相同构思,图11和图12为本申请示例性提供的一种可能的分布式训练装置的结构示意图。这些分布式训练装置可以是上述方法实施例中管理节点,用于实现上述方法实施例中管理节点的功能,因此也能实现上述方法实施例所具备的有益效果。
如图11所示,该分布式训练装置1100包括获取模块1101和处理模块1102。
具体的,获取模块1101,用于获取网络拓扑,网络拓扑包括核心交换机和计算集群中的计算节点的连通关系,计算集群中包括M个分组,每个分组中包括一个或多个计算节点;处理模块1102,用于根据网络拓扑,确定N个计算节点之间的通信规划;N个计算节点是计算集群中用于分布式训练目标模型的计算节点;通信规划包括多条组间路径,对于多条组间路径中的每条组间路径:组间路径包括N个计算节点中、属于不同分组的两个计算节点,以及用于连通两个计算节点的核心交换机,组间路径用于传输组间路径中两个计算节点之间的数据;多条组间路径分别传输的数据量符合预设条件;M和N均为大于2的整数。
在一种可能的实现方式中,处理模块1102在根据网络拓扑,确定N个计算节点之间的通信规划时,具体用于:处理模块1102根据网络拓扑和通信算法,确定N个计算节点之间的通信规划;其中,通信算法用于在分布式训练中聚合N个计算节点分别执行训练得到的数据,以得到目标模型。
在一种可能的实现方式中,获取模块1101还用于:获取训练任务,训练任务包括计算节点总数N和通信算法;处理模块1102在根据网络拓扑,确定N个计算节点之间的通信规划时,具体用于:根据网络拓扑、计算节点总数N和通信算法,从计算集群中处于空闲状态的多个计算节点中,确定N个计算节点和N个计算节点之间的通信规划。
在一种可能的实现方式中,处理模块1102在根据网络拓扑、计算节点总数N和通信算法,从计算集群中处于空闲状态的多个计算节点中,确定N个计算节点和N个计算节点之间的通信规划时,具体用于:根据网络拓扑和计算节点总数N,从计算集群中处于空闲 状态的多个计算节点中,确定N个计算节点;将N个计算节点中、属于同一个分组的两个计算节点配对,以及在剩余尚未配对的多个计算节点时,将尚未配对的多个计算节点配对,以得到的N/2个节点对;根据通信算法的多轮通信和N/2个节点对,确定N个计算节点分别在多轮通信中的通信规划;其中,对于任一轮通信中的通信规划,通信规划中两个计算节点所传输的数据量越大,通信规划中包括的组间路径数越小;若确定在多轮通信中的第i轮通信中,N个计算节点的通信规划中包括多条组间路径,且多条组间路径分别传输的数据量不符合预设条件,则调整第i轮通信中N个计算节点的通信规划,i为正整数。
在一种可能的实现方式中,多条组间路径中包括第一组间路径,第一组间路径包括第一计算节点、第二计算节点和第一核心交换机;分布式训练装置1100还包括发送模块1103;发送模块1103用于:分别向第一计算节点和第二计算节点发送第一信息;其中,第一信息指示第一组间路径用于第一计算节点向第二计算节点发送第一数据。
在一种可能的实现方式中,多条组内路径中包括第一组内路径,第一组内路径包括第一计算节点、第三计算节点和第一接入交换机;分布式训练装置1100还包括发送模块1103;发送模块1103用于:分别向第一计算节点和第三计算节点发送第二信息;其中,第二信息指示第一组内路径用于第一计算节点向第三计算节点发送第二数据。
在一种可能的实现方式中,获取模块1101、处理模块1102和发送模块1103中,可能存在部分功能模块部署在计算集群的计算节点中,剩余的其他功能模块部署在独立于计算集群的外部节点中。比如,获取模块1101和发送模块1103部署在计算集群的计算节点中,处理模块1102部署在独立于计算集群的外部节点中;或者,获取模块1101部署在计算集群的计算节点中,处理模块1102和发送模块1103部署在独立于计算集群的外部节点中;或者其他方式,本申请不再一一举例。
其中,获取模块1101、处理模块1102和发送模块1103均可以通过软件实现,或者可以通过硬件实现。接下来以处理模块1102为例,介绍处理模块1102的实现方式。类似的,获取模块1101和发送模块1103的实现方式可以参考处理模块1102的实现方式。
模块作为软件功能单元的一种举例,处理模块1102可以包括运行在计算实例上的代码。其中,计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地,上述计算实例可以是一台或者多台。例如,处理模块1102可以包括运行在多个主机/虚拟机/容器上的代码。
需要说明的是,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中,也可以分布在不同的region中。进一步地,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone,AZ)中,也可以分布在不同的AZ中,每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中,通常一个region可以包括多个AZ。
同样,用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud,VPC)中,也可以分布在多个VPC中。其中,通常一个VPC设置在一个region内,同一region内两个VPC之间,以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关,经通信网关实现VPC之间的互连。
模块作为硬件功能单元的一种举例,处理模块1102可以包括至少一个计算设备,如服务器等。或者,处理模块1102也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备 等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。
处理模块1102包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。处理模块1102包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。同样,处理模块1102包括的多个计算设备可以分布在同一个VPC中,也可以分布在多个VPC中。其中,所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。
需要说明的是,在其他实施例中,处理模块1102可以用于执行图7或图8方法中的任意步骤,获取模块1101可以用于执行图7或图8方法中的任意步骤,发送模块1103可以用于执行图7或图8方法中的任意步骤,处理模块1102、获取模块1101、以及发送模块1103负责实现的步骤可根据需要指定,通过处理模块1102、获取模块1101、以及发送模块1103分别实现图7或图8方法中不同的步骤来实现分布式训练装置1100的全部功能。
或者也可以理解,获取模块1101和发送模块1103的功能包含于图10示出的通信模块1005的功能中,也即,通信模块1005具备获取模块1101和发送模块1103的功能;处理模块1102具备图10示出的任务管理模块1001、资源管理模块1002和训练任务模块1003的功能,图10和图11之间可相互参照或引用。相应的,任务管理模块1001、资源管理模块1002、训练任务模块1003、存储模块1004、通信模块1005中的部分功能模块部署在计算集群的计算节点中,剩余的其他功能模块部署在独立于计算集群的外部节点中。
如图12所示为本申请实施例提供的分布式训练装置1200,图12所示的分布式训练装置可以为图11所示的装置的一种硬件电路的实现方式。该装置可适用于前面所示出的流程图中,执行上述方法实施例中管理节点的功能。
为了便于说明,图12仅示出了该分布式训练装置1200的主要部件。
本申请还提供一种分布式训练装置1200。如图12所示,分布式训练装置1200包括:总线102、处理器104、存储器106和通信接口108。处理器104、存储器106和通信接口108之间通过总线102通信。分布式训练装置1200可以是服务器或终端设备。应理解,本申请不限定分布式训练装置1200中的处理器、存储器的个数。
总线102可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图12中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线104可包括在分布式训练装置1200各个部件(例如,存储器106、处理器104、通信接口108)之间传送信息的通路。
处理器104可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
存储器106可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。处理器104还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。
存储器106中存储有可执行的程序代码,处理器104执行该可执行的程序代码以分别实现前述获取模块1101、处理模块1102或发送模块1103的功能,从而实现分布式训练方法。也即,存储器106上存有用于执行上述分布式训练方法的指令。
通信接口108使用例如但不限于网络接口卡、收发器一类的收发模块,来实现分布式训练装置1200与其他设备或通信网络之间的通信。
或者也可以理解,存储器106具备图10示出的存储模块1004的功能,处理器104具备图10示出的任务管理模块1001、资源管理模块1002和训练任务模块1003的功能,总线102和通信接口108具备图10示出的通信模块1005的功能,图10、图11和图12之间可相互参照或引用。
基于上述内容和相同构思,本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行图7或图8相关实施例中的方法。
基于上述内容和相同构思,本申请实施例提供一种计算机程序产品,当计算设备读取并执行计算机程序产品时,使得计算设备实现上述图7或图8相关实施例中的方法。
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的保护范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (36)

  1. 一种分布式训练方法,其特征在于,包括:
    获取网络拓扑,所述网络拓扑包括核心交换机和计算集群中的计算节点的连通关系,所述计算集群中包括M个分组,每个分组中包括一个或多个计算节点;
    根据所述网络拓扑,确定N个计算节点之间的通信规划;
    其中,所述N个计算节点是所述计算集群中用于分布式训练目标模型的计算节点;
    所述通信规划包括多条组间路径,对于所述多条组间路径中的每条组间路径:所述组间路径包括所述N个计算节点中、属于不同分组的两个计算节点,以及用于连通所述两个计算节点的核心交换机,所述组间路径用于传输所述组间路径中两个计算节点之间的数据;
    所述多条组间路径分别传输的数据量符合预设条件;
    M和N均为大于2的整数。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述网络拓扑,确定N个计算节点之间的通信规划,包括:
    根据所述网络拓扑和通信算法,确定所述N个计算节点之间的通信规划;
    其中,所述通信算法用于在所述分布式训练中聚合所述N个计算节点分别执行训练得到的数据,以得到所述目标模型。
  3. 如权利要求1或2所述的方法,其特征在于,所述多条组间路径包括的多个核心交换机中,每个核心交换机包括一个或多个流量端口;
    所述多条组间路径分别传输的数据量符合预设条件,包括:
    所述多条组间路径包括的多个流量端口中,任两个流量端口的流量的差值小于阈值,其中,流量端口的流量与所属组间路径中两个计算节点之间传输数据的数据量关联。
  4. 如权利要求3所述的方法,其特征在于,在每条组间路径包括多级核心交换机时,所述差值小于阈值的任两个流量端口所属的核心交换机属于同一级。
  5. 如权利要求1-4中任一项所述的方法,其特征在于,
    对于所述多条组间路径中的任两条组间路径:
    所述两条组间路径分别包含有不同的核心交换机,或者,所述两条组间路径包含相同的核心交换机,且所述核心交换机在所述两条组间路径中的流量端口不同。
  6. 如权利要求1-5中任一项所述的方法,其特征在于,所述网络拓扑包括所述核心交换机、所述计算集群中计算节点,以及接入交换机的连通关系;
    对于所述多条组间路径中的每条组间路径:
    所述组间路径中还包括所述两个计算节点分别对应的两个接入交换机,所述组间路径中每个计算节点通过所述计算节点对应的接入交换机与所述核心交换机连通。
  7. 如权利要求6所述的方法,其特征在于,所述通信规划中还包括多条组内路径,每条组内路径中包括所述N个计算节点中、属于同一个分组的两个计算节点,以及所述分组对应的接入交换机,所述组内路径用于传输所述组内路径中两个计算节点之间的数据。
  8. 如权利要求7所述的方法,其特征在于,所述组内路径中两个计算节点之间传输数据的数据量,大于所述组间路径中两个计算节点之间传输数据的数据量。
  9. 如权利要求1-8中任一项所述的方法,其特征在于,所述M个分组分别对应于M个 接入交换机;
    针对所述M个接入交换机中每个接入交换机:
    所述接入交换机包括K个第一端口、所述K个第一端口分别对应的K个第二端口;
    所述K个第一端口分别与K个核心交换机连接;
    所述K个第二端口分别与所述接入交换机对应的分组中计算节点的K个端口连接;
    K为大于2的整数。
  10. 如权利要求1-9中任一项所述的方法,其特征在于,所述根据所述网络拓扑,确定N个计算节点之间的通信规划,包括:
    获取训练任务,所述训练任务包括计算节点总数N和通信算法;
    根据所述网络拓扑、所述计算节点总数N和所述通信算法,从所述计算集群中处于空闲状态的多个计算节点中,确定所述N个计算节点和所述N个计算节点之间的通信规划。
  11. 如权利要求10所述的方法,其特征在于,所述根据所述网络拓扑、所述计算节点总数N和所述通信算法,从所述计算集群中处于空闲状态的多个计算节点中,确定所述N个计算节点和所述N个计算节点之间的通信规划,包括:
    根据所述网络拓扑和所述计算节点总数N,从所述计算集群中处于空闲状态的多个计算节点中,确定所述N个计算节点;
    将所述N个计算节点中、属于同一个分组的两个计算节点配对,以及在剩余尚未配对的多个计算节点时,将所述尚未配对的多个计算节点配对,以得到的N/2个节点对;
    根据所述通信算法的多轮通信和所述N/2个节点对,确定所述N个计算节点分别在所述多轮通信中的通信规划;其中,对于任一轮通信中的通信规划,所述通信规划中两个计算节点所传输的数据量越大,所述通信规划中包括的组间路径数越小;
    若确定在所述多轮通信中的第i轮通信中,所述N个计算节点的通信规划中包括多条组间路径,且所述多条组间路径分别传输的数据量不符合所述预设条件,则调整所述第i轮通信中所述N个计算节点的通信规划,i为正整数。
  12. 如权利要求1-11中任一项所述的方法,其特征在于,所述多条组间路径中包括第一组间路径,所述第一组间路径包括第一计算节点、第二计算节点和第一核心交换机;
    所述根据所述网络拓扑,确定N个计算节点之间的通信规划之后,还包括:
    根据所述通信规划,分别向所述第一计算节点和所述第二计算节点发送第一信息;
    其中,所述第一信息指示所述第一组间路径用于所述第一计算节点向所述第二计算节点发送第一数据。
  13. 如权利要求7-12中任一项所述的方法,其特征在于,所述多条组内路径中包括第一组内路径,所述第一组内路径包括第一计算节点、第三计算节点和第一接入交换机;
    所述根据所述网络拓扑,确定N个计算节点之间的通信规划之后,还包括:
    根据所述通信规划,分别向所述第一计算节点和所述第三计算节点发送第二信息;
    其中,所述第二信息指示所述第一组内路径用于所述第一计算节点向所述第三计算节点发送第二数据。
  14. 一种分布式训练系统,其特征在于,包括:
    管理节点、K个核心交换机和计算集群,其中,所述计算集群中包括M个分组,每个分组中包括一个或多个计算节点;
    所述K个核心交换机,用于连通所述M个分组中位于不同分组的计算节点;
    所述管理节点,用于获取网络拓扑,根据所述网络拓扑,确定N个计算节点之间的通信规划,所述网络拓扑包括所述K个核心交换机和所述计算集群中的计算节点的连通关系,所述N个计算节点是所述计算集群中用于分布式训练目标模型的计算节点;
    所述通信规划包括多条组间路径,对于所述多条组间路径中的每条组间路径:所述组间路径包括所述N个计算节点中、属于不同分组的两个计算节点,以及所述K个核心交换机中用于连通所述两个计算节点的核心交换机,所述组间路径用于传输所述组间路径中两个计算节点之间的数据;
    所述多条组间路径分别传输的数据量符合预设条件;
    K、M和N均为大于2的整数。
  15. 如权利要求14所述的系统,其特征在于,所述管理节点在根据所述网络拓扑,确定N个计算节点之间的通信规划时,具体用于:
    根据所述网络拓扑和通信算法,确定所述N个计算节点之间的通信规划;
    其中,所述通信算法用于在所述分布式训练中聚合所述N个计算节点分别执行训练得到的数据,以得到所述目标模型。
  16. 如权利要求14或15所述的系统,其特征在于,所述多条组间路径包括的多个核心交换机中,每个核心交换机包括一个或多个流量端口;
    所述多条组间路径分别传输的数据量符合预设条件,包括:
    所述多条组间路径包括的多个流量端口中,任两个流量端口的流量的差值小于阈值,其中,流量端口的流量与所属组间路径中两个计算节点之间传输数据的数据量关联。
  17. 如权利要求14-16中任一项所述的系统,其特征在于,还包括:分别与所述M个分组对应的M个接入交换机;所述M个接入交换机中任一个接入交换机用于连通所述接入交换机对应分组中的计算节点和所述K个核心交换机;
    所述网络拓扑包括所述K个核心交换机、所述M个接入交换机和所述计算集群中的计算节点的连通关系;
    对于所述多条组间路径中的每条组间路径:所述组间路径中还包括所述两个计算节点所属分组分别对应的两个接入交换机。
  18. 如权利要求17所述的系统,其特征在于,
    所述通信规划中还包括多条组内路径,每条组内路径中包括所述N个计算节点中、属于同一个分组的两个计算节点,以及所述M个接入交换机中所述分组对应的接入交换机,所述组内路径用于传输所述组内路径中两个计算节点之间的数据。
  19. 如权利要求14-18中任一项所述的系统,其特征在于,所述多条组间路径中包括第一组间路径,所述第一组间路径包括第一计算节点、第二计算节点和第一核心交换机;
    所述管理节点还用于:根据所述通信规划,分别向所述第一计算节点和所述第二计算节点发送第一信息,所述第一信息指示所述第一组间路径用于所述第一计算节点向所述第二计算节点发送第一数据;
    所述第一计算节点,用于根据所述第一信息,向所述第一核心交换机发送所述第一数据;
    所述第一核心交换机,用于将所述第一数据转发至所述第二计算节点;
    所述第二计算节点,用于根据所述第一信息,接收来自所述第一核心交换机的所述第 一数据。
  20. 如权利要求19所述的系统,其特征在于,所述第一组间路径中还包括所述第一节点对应的第一接入交换机,和所述第二节点对应的第二接入交换机;
    所述第一计算节点,具体用于根据所述第一信息,向所述第一接入交换机发送所述第一数据,以使得所述第一接入交换机向所述第一核心交换机发送所述第一数据;
    所述第二计算节点,具体用于根据所述第一信息,接收所述第二接入交换机转发的、来自所述第一核心交换机的所述第一数据。
  21. 如权利要求18-20中任一项所述的系统,其特征在于,所述多条组内路径中包括第一组内路径,所述第一组内路径包括第一计算节点、第三计算节点和第一接入交换机;
    所述管理节点还用于:根据所述通信规划,分别向所述第一计算节点和所述第三计算节点发送第二信息,所述第二信息指示所述第一组内路径用于所述第一计算节点向所述第三计算节点发送第二数据;
    所述第一计算节点,用于根据所述第二信息,向所述第一接入交换机发送所述第二数据;
    所述第一接入交换机,用于将所述第二数据转发至所述第三计算节点;
    所述第三计算节点,用于根据所述第一信息,接收来自所述第一接入交换机的所述第二数据。
  22. 一种分布式训练装置,其特征在于,包括:
    获取模块,用于获取网络拓扑,所述网络拓扑包括核心交换机和计算集群中的计算节点的连通关系,所述计算集群中包括M个分组,每个分组中包括一个或多个计算节点;
    处理模块,用于根据所述网络拓扑,确定N个计算节点之间的通信规划;
    其中,所述N个计算节点是所述计算集群中用于分布式训练目标模型的计算节点;
    所述通信规划包括多条组间路径,对于所述多条组间路径中的每条组间路径:所述组间路径包括所述N个计算节点中、属于不同分组的两个计算节点,以及用于连通所述两个计算节点的核心交换机,所述组间路径用于传输所述组间路径中两个计算节点之间的数据;
    所述多条组间路径分别传输的数据量符合预设条件;
    M和N均为大于2的整数。
  23. 如权利要求22中所述的装置,其特征在于,所述处理模块在根据所述网络拓扑,确定N个计算节点之间的通信规划时,具体用于:
    根据所述网络拓扑和通信算法,确定所述N个计算节点之间的通信规划;
    其中,所述通信算法用于在所述分布式训练中聚合所述N个计算节点分别执行训练得到的数据,以得到所述目标模型。
  24. 如权利要求22或23所述的装置,其特征在于,所述多条组间路径包括的多个核心交换机中,每个核心交换机包括一个或多个流量端口;
    所述多条组间路径分别传输的数据量符合预设条件,包括:
    所述多条组间路径包括的多个流量端口中,任两个流量端口的流量的差值小于阈值,其中,流量端口的流量与所属组间路径中两个计算节点之间传输数据的数据量关联。
  25. 如权利要求24所述的装置,其特征在于,在每条组间路径包括多级核心交换机时,所述差值小于阈值的任两个流量端口所属的核心交换机属于同一级。
  26. 如权利要求22-25中任一项所述的装置,其特征在于,
    对于所述多条组间路径中的任两条组间路径:
    所述两条组间路径分别包括有不同的核心交换机,或者,所述两条组间路径包含相同的核心交换机,且所述核心交换机在所述两条组间路径中的流量端口不同。
  27. 如权利要求22-26中任一项所述的装置,其特征在于,所述网络拓扑包括所述核心交换机、所述计算集群,以及接入交换机的连通关系;
    对于所述多条组间路径中的每条组间路径:
    所述组间路径中还包括所述两个计算节点分别对应的两个接入交换机,所述组间路径中每个计算节点通过所述计算节点对应的接入交换机与所述核心交换机连通。
  28. 如权利要求27所述的装置,其特征在于,所述通信规划中还包括多条组内路径,每条组内路径中包括所述N个计算节点中、属于同一个分组的两个计算节点,以及所述分组对应的接入交换机,所述组内路径用于传输所述组内路径中两个计算节点之间的数据。
  29. 如权利要求28所述的装置,其特征在于,所述组内路径中两个计算节点之间传输数据的数据量,大于所述组间路径中两个计算节点之间传输数据的数据量。
  30. 如权利要求22-29中任一项所述的装置,其特征在于,所述M个分组分别对应于M个接入交换机;
    针对所述M个接入交换机中每个接入交换机:
    所述接入交换机包括K个第一端口、所述K个第一端口分别对应的K个第二端口;
    所述K个第一端口分别与K个核心交换机连接;
    所述K个第二端口分别与所述接入交换机对应的分组中计算节点的K个端口连接;
    K为大于2的整数。
  31. 如权利要求22-30中任一项所述的装置,其特征在于,所述获取模块还用于:获取训练任务,所述训练任务包括计算节点总数N和通信算法;
    所述处理模块在根据所述网络拓扑,确定N个计算节点之间的通信规划时,具体用于:根据所述网络拓扑、所述计算节点总数N和所述通信算法,从所述计算集群中处于空闲状态的多个计算节点中,确定所述N个计算节点和所述N个计算节点之间的通信规划。
  32. 如权利要求31所述的装置,其特征在于,所述处理模块在根据所述网络拓扑、所述计算节点总数N和所述通信算法,从所述计算集群中处于空闲状态的多个计算节点中,确定所述N个计算节点和所述N个计算节点之间的通信规划时,具体用于:
    根据所述网络拓扑和所述计算节点总数N,从所述计算集群中处于空闲状态的多个计算节点中,确定所述N个计算节点;
    将所述N个计算节点中、属于同一个分组的两个计算节点配对,以及在剩余尚未配对的多个计算节点时,将所述尚未配对的多个计算节点配对,以得到的N/2个节点对;
    根据所述通信算法的多轮通信和所述N/2个节点对,确定所述N个计算节点分别在所述多轮通信中的通信规划;其中,对于任一轮通信中的通信规划,所述通信规划中两个计算节点所传输的数据量越大,所述通信规划中包括的组间路径数越小;
    若确定在所述多轮通信中的第i轮通信中,所述N个计算节点的通信规划中包括多条组间路径,且所述多条组间路径分别传输的数据量不符合所述预设条件,则调整所述第i轮通信中所述N个计算节点的通信规划,i为正整数。
  33. 如权利要求22-32中任一项所述的装置,其特征在于,所述多条组间路径中包括第 一组间路径,所述第一组间路径包括第一计算节点、第二计算节点和第一核心交换机;
    所述装置还包括发送模块;
    所述发送模块用于:分别向所述第一计算节点和所述第二计算节点发送第一信息;
    其中,所述第一信息指示所述第一组间路径用于所述第一计算节点向所述第二计算节点发送第一数据。
  34. 如权利要求28-33中任一项所述的装置,其特征在于,所述多条组内路径中包括第一组内路径,所述第一组内路径包括第一计算节点、第三计算节点和第一接入交换机;
    所述装置还包括发送模块;
    所述发送模块用于:分别向所述第一计算节点和所述第三计算节点发送第二信息;
    其中,所述第二信息指示所述第一组内路径用于所述第一计算节点向所述第三计算节点发送第二数据。
  35. 一种计算设备,其特征在于,包括处理器,所述处理器与存储器相连,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,以使得所述计算设备执行如权利要求1至13中任一项所述的方法。
  36. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序或指令,当所述计算机程序或指令被计算设备执行时,实现如权利要求1至13中任一项所述的方法。
PCT/CN2023/078777 2022-06-29 2023-02-28 一种分布式训练方法、系统及装置 WO2024001259A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210756779.4 2022-06-29
CN202210756779.4A CN117395186A (zh) 2022-06-29 2022-06-29 一种分布式训练方法、系统及装置

Publications (1)

Publication Number Publication Date
WO2024001259A1 true WO2024001259A1 (zh) 2024-01-04

Family

ID=89382577

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/078777 WO2024001259A1 (zh) 2022-06-29 2023-02-28 一种分布式训练方法、系统及装置

Country Status (2)

Country Link
CN (1) CN117395186A (zh)
WO (1) WO2024001259A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106357537A (zh) * 2016-11-09 2017-01-25 北京工业大学 一种基于sdn多路径传输的链路监控方法
WO2018161793A1 (zh) * 2017-03-08 2018-09-13 华为技术有限公司 一种网格系统及在该网格系统的路径确定方法和控制设备
CN112866059A (zh) * 2021-01-18 2021-05-28 中国信息通信研究院 一种基于人工智能应用的无损网络性能测试方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106357537A (zh) * 2016-11-09 2017-01-25 北京工业大学 一种基于sdn多路径传输的链路监控方法
WO2018161793A1 (zh) * 2017-03-08 2018-09-13 华为技术有限公司 一种网格系统及在该网格系统的路径确定方法和控制设备
CN112866059A (zh) * 2021-01-18 2021-05-28 中国信息通信研究院 一种基于人工智能应用的无损网络性能测试方法和装置

Also Published As

Publication number Publication date
CN117395186A (zh) 2024-01-12

Similar Documents

Publication Publication Date Title
CN110851272B (zh) 基于吞噬的粒子群遗传混合算法的云任务调度方法
US10325343B1 (en) Topology aware grouping and provisioning of GPU resources in GPU-as-a-Service platform
CN112118312B (zh) 一种面向边缘服务器的网络突发负载疏散方法
CN115473901B (zh) 一种分布式算力集群智慧调度方法、装置及计算机设备
CN108667657B (zh) 一种面向sdn的基于局部特征信息的虚拟网络映射方法
WO2020134133A1 (zh) 一种资源配置方法、变电站及计算机可读存储介质
CN107992353A (zh) 一种基于最小迁移量的容器动态迁移方法及系统
CN113645146B (zh) 基于新流密度的软件定义网络控制器负载均衡方法及系统
WO2023130656A1 (zh) 一种异构多节点互联拓扑生成方法和存储介质
US20180197110A1 (en) Metrics to Train Machine Learning Predictor for NoC Construction
EP3953813A1 (en) Distributed object placement, replication, and retrieval for cloud-scale storage and data delivery
Wang et al. Adaptive service function chain scheduling in mobile edge computing via deep reinforcement learning
Reza et al. Energy-efficient and high-performance NoC architecture and mapping solution for deep neural networks
Wen et al. Load balancing job assignment for cluster-based cloud computing
Ke et al. Aggregation on the fly: Reducing traffic for big data in the cloud
US20180198687A1 (en) Infrastructure to Apply Machine Learning for NoC Construction
WO2024001259A1 (zh) 一种分布式训练方法、系统及装置
US10419300B2 (en) Cost management against requirements for the generation of a NoC
CN110958192B (zh) 一种基于虚拟交换机的虚拟数据中心资源分配系统及方法
CN116566891A (zh) 时延敏感的服务功能链并行路由优化方法、装置及介质
CN113556242B (zh) 一种基于多处理节点来进行节点间通信的方法和设备
CN113630330B (zh) 软件定义网络多控制器负载均衡方法及系统
CN117951216A (zh) 数据处理的系统及数据处理的方法
CN117763376A (zh) 一种数据聚合方法及装置
WO2023134590A1 (zh) 一种聚合通信方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23829451

Country of ref document: EP

Kind code of ref document: A1