WO2023138009A1 - 一种数据传输系统、方法及相关设备 - Google Patents

一种数据传输系统、方法及相关设备 Download PDF

Info

Publication number
WO2023138009A1
WO2023138009A1 PCT/CN2022/106309 CN2022106309W WO2023138009A1 WO 2023138009 A1 WO2023138009 A1 WO 2023138009A1 CN 2022106309 W CN2022106309 W CN 2022106309W WO 2023138009 A1 WO2023138009 A1 WO 2023138009A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
accelerator
data
accelerators
communication link
Prior art date
Application number
PCT/CN2022/106309
Other languages
English (en)
French (fr)
Inventor
端启航
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22921439.0A priority Critical patent/EP4293984A1/en
Priority to US18/356,475 priority patent/US20230403232A1/en
Publication of WO2023138009A1 publication Critical patent/WO2023138009A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/16Flow control; Congestion control in connection oriented networks, e.g. frame relay
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • H04L47/125Avoiding congestion; Recovering from congestion by balancing the load, e.g. traffic engineering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/17Interaction among intermediate nodes, e.g. hop by hop

Definitions

  • the present application relates to the field of computer technology, in particular to a data transmission system, method and related equipment.
  • the application discloses a data transmission system, method and related equipment, which can reduce congestion and transmission delay during data transmission and improve data transmission efficiency.
  • the present application provides a data transmission system.
  • the data transmission system includes a plurality of nodes, and each node in the plurality of nodes includes a plurality of accelerators, and the accelerators in each node are connected through a first communication link;
  • the accelerators of the plurality of nodes constitute a plurality of communication planes, each communication plane includes an accelerator in each node, and the accelerators included in any two communication planes are different from each other, and the accelerators in the same communication plane are connected through a second communication link;
  • the above-mentioned first node and the second node are any two nodes in the above-mentioned plurality of nodes, and the first accelerator and the second accelerator are accelerators in the first communication plane;
  • the first accelerator is also used to send the above-mentioned first data to the second accelerator through the above-mentioned second communication link.
  • the data is first sent to the first accelerators belonging to the first node and belonging to the first communication plane through the communication links in the first node, and then the first accelerators respectively send the data to the accelerators in the first communication plane through the second communication links.
  • the above method can reduce the number of times the accelerators between nodes send data to each other, reduce data congestion and transmission delay on the network, and improve data transmission efficiency.
  • the other accelerators in the first node can send the data that needs to be sent to each accelerator in the first communication plane to the first accelerator first, and then the first accelerator sends the received data to each accelerator in the first communication plane through the second communication link.
  • the first node includes four accelerators
  • the first communication plane includes six accelerators.
  • the other three accelerators in the first node send all the data that needs to be sent to the six accelerators in the first communication plane to the first accelerator, and then the first accelerator sends the data received by the other five accelerators in the first communication plane to the other five accelerators respectively through the second communication link.
  • the data is first sent to the first accelerators belonging to the first node and belonging to the first communication plane through the communication link in the first node, and then the first accelerator sends the data required by each accelerator in the first communication plane to each accelerator through the second communication link.
  • the above method can reduce the number of times the accelerators between nodes send data to each other, reduce data congestion and transmission delay on the network, and improve data transmission efficiency.
  • the above data transmission system further includes a processor, and the processor is configured to send group information to each accelerator in the plurality of nodes, where the group information includes information about the accelerator included in each communication plane.
  • the above-mentioned first accelerator is further configured to establish a second communication link connection with the above-mentioned second accelerator according to the received group information.
  • the processor After the processor determines the nodes included in the data transmission system used for calculation, it can group the accelerators according to the accelerators of each node to determine the information of the accelerators included in each communication plane, and notify the accelerators in each node, so that the accelerators in each node can establish a connection according to the above group information.
  • the above-mentioned first accelerator is further configured to send the second data to a third accelerator in the first node when the second data needs to be sent to any accelerator in the second communication plane, and the third accelerator is an accelerator located in the second communication plane; the third accelerator is used to send the second data to any of the above-mentioned accelerators in the second communication plane through the second communication link.
  • first accelerator and other accelerators in the node have data that needs to be sent to one or more accelerators in the second communication plane, they are first sent to the third accelerator, and the third accelerator then sends the data required by each accelerator of the second communication plane to each accelerator through the second communication link, thereby reducing the communication scale between nodes, reducing data congestion and transmission delay on the network, and improving data transmission efficiency.
  • the above-mentioned first accelerator is further configured to receive the third data sent by each accelerator in the first communication plane through the second communication link.
  • the data sent to the first accelerator by an accelerator in the first communication plane includes data that needs to be sent to the first accelerator by multiple accelerators in the node where the accelerator is located.
  • the above-mentioned data transmission system is used for training an artificial intelligence (AI) model
  • the above-mentioned first data, second data, and third data are intermediate data generated during the training process of the AI model.
  • the intermediate data generated during the training of the artificial intelligence model is transmitted through the above method, which can improve the efficiency of model training.
  • the above-mentioned first communication link includes a peripheral component interface express (PCIe) bus or a unified bus (unified bus, UB);
  • the above-mentioned second communication link supports a transmission control protocol (transmission control protocol, TCP), and implements a remote direct memory access (remote memory access over converged ethernet, RoCE) protocol based on Ethernet or a wireless bandwidth ( InfiniBand, IB) protocol link.
  • PCIe peripheral component interface express
  • UB unified bus
  • TCP transmission control protocol
  • RoCE remote direct memory access over converged ethernet
  • IB InfiniBand
  • the above-mentioned multiple nodes are deployed in one or more physical machines, and the accelerators in the above-mentioned multiple nodes are image processors (graphics processing unit, GPU), embedded neural network processors (neural-network processing units, NPU), tensor processors (tensor processing unit, TPU) or deep learning processors (deep learning processing units, DPU).
  • image processors graphics processing unit, GPU
  • embedded neural network processors neural-network processing units, NPU
  • tensor processors tensor processing unit, TPU
  • deep learning processors deep learning processing units
  • the present application provides a data transmission method, which is applied to a data transmission system comprising a plurality of nodes.
  • Each node in the plurality of nodes includes a plurality of accelerators, and the accelerators in each node are connected through a first communication link;
  • the accelerators among the plurality of nodes constitute a plurality of communication planes, and each communication plane includes an accelerator in each node, and the accelerators included in any two communication planes are different from each other, and the accelerators included in the same communication plane are connected through a second communication link;
  • the data transmission method includes:
  • the first accelerator in the first node obtains the first data sent by other accelerators in the first node through the first communication link, and the first data includes the data that other accelerators in the first node need to send to the second accelerator in the second node; then the first accelerator sends the first data to the second accelerator through the second communication link.
  • the first node and the second node are any two nodes in the plurality of nodes, and the first accelerator and the second accelerator are accelerators in the first communication plane.
  • the other accelerators in the first node can first send all the data that needs to be sent to each accelerator in the first communication plane to the first accelerator, and then the first accelerator sends the data required by each accelerator in the received data to each accelerator in the first communication plane through the second communication link.
  • the first node includes four accelerators
  • the first communication plane includes six accelerators.
  • the other three accelerators in the first node send all the data that needs to be sent to the six accelerators in the first communication plane to the first accelerator, and then the first accelerator sends the data received by the other five accelerators in the first communication plane to the other five accelerators respectively through the second communication link.
  • the first accelerator receives group information sent by the processor, and establishes a connection based on the second communication link with the second accelerator according to the group information, where the group information includes information about accelerators included in each communication plane.
  • the method further includes: when the first accelerator has second data to be sent to any accelerator in the second communication plane, sending the second data to a third accelerator in the first node, where the third accelerator is an accelerator located in the second communication plane; so that the third accelerator sends the second data to any accelerator in the second communication plane through the second communication link.
  • the first accelerator when the first accelerator has data that needs to be sent to multiple accelerators in the second communication plane, it first sends it to the third accelerator, and the third accelerator then sends the data required by each accelerator of the second communication plane to each accelerator through the second communication link.
  • the above-mentioned first accelerator is further configured to receive the third data sent by each accelerator in the first communication plane through the second communication link.
  • the data sent to the first accelerator by an accelerator in the first communication plane includes data that needs to be sent to the first accelerator by multiple accelerators in the node where the accelerator is located.
  • the above-mentioned data transmission system is used for training an AI model, and the above-mentioned first data, second data, and third data are intermediate data generated during the training process of the AI model.
  • the training process of the artificial intelligence model it is necessary to use multiple accelerators in multiple nodes to process data, and a large amount of data transmission will be involved between different accelerators.
  • the intermediate data generated during the training of the artificial intelligence model is transmitted through the above method, which can improve the efficiency of model training.
  • the above-mentioned first communication link includes a PCIe bus or a UB; the above-mentioned second communication link is a link supporting the TCP, RoCE protocol, or IB protocol.
  • the above-mentioned multiple nodes are deployed in one or multiple physical machines, and the accelerators in the above-mentioned multiple nodes are GPUs, NPUs, TPUs or DPUs.
  • the present application provides a board, which includes a plurality of accelerators for executing the method described in the second aspect or any possible implementation manner of the second aspect.
  • the present application provides a computing device, including a processor, a memory, and multiple accelerators.
  • the memory stores computer instructions.
  • the processor executes the computer instructions
  • the computing device invokes one or more accelerators to execute the method described in the second aspect or any possible implementation of the second aspect.
  • the present application provides a computer-readable storage medium, which stores a computer program.
  • the accelerator executes the method described in the second aspect or any possible implementation manner of the second aspect.
  • FIG. 1 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a node cluster provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of a data transmission system provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a data transmission process provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of another data transmission process provided by the embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a data transmission method provided in an embodiment of the present application.
  • Fig. 7 is a schematic flow chart of another data transmission method provided by the embodiment of the present application.
  • Fig. 8 is a schematic diagram of a matrix calculation provided by an embodiment of the present application.
  • Fig. 9 is a schematic diagram of another matrix calculation provided by the embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a data transmission device provided by an embodiment of the present application.
  • Fig. 11 is a schematic structural diagram of a board provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of another computing device provided by an embodiment of the present application.
  • "at least one” means one or more, and “multiple” means two or more.
  • “And/or” describes the association relationship of associated objects, which means that there may be three kinds of relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and B exists alone, where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an "or” relationship. Any embodiment or design described herein as “exemplary” or “for example” is not to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplary” or “such as” is intended to present related concepts in a concrete manner.
  • AI chip It is a module used to process a large number of computing tasks in artificial intelligence applications, and one computing device can have one or more AI chips.
  • Network interface controller network interface controller
  • NIC network interface controller
  • NIC network interface controller
  • the NIC of a computing device is used to connect one computing device to another computing device, or to establish a connection between a computing device and a network device such as a switch.
  • PCIe Switch Peripheral component interface express switch
  • the PCIe Switch chip is a module used to expand the PCIe link.
  • the PCIe link uses an end-to-end connection mode. Only one device or device can be connected to each end of a PCIe link. Therefore, the PCIe link can use the PCIe Switch chip to expand the PCIe link, so that multiple devices or devices are connected at one end of the PCIe link. Among them, the PCIe Switch chip and other devices Or the devices are connected through the PCIe bus.
  • the internal structure of a computing device is first introduced below.
  • FIG. 1 is a schematic structural diagram of a computing device provided in an embodiment of the present application.
  • the computing device includes at least one central processing unit (central processing unit, CPU) and at least one node, and each node includes multiple accelerators.
  • the host CPU Host CPU
  • the central processing unit is connected to the accelerators in each node through the bus, or connected to multiple accelerators through the bus and the switch chip.
  • a computing device includes two CPUs and two nodes, and each node includes 4 accelerators as an example.
  • a Host CPU is connected to 4 accelerators in a node through a PCIe bus and a PCIe switch chip, and 4 accelerators in a node are connected through a PCIe bus.
  • the computing device also includes devices such as memory and network card corresponding to each accelerator.
  • the aforementioned accelerator can be any one of AI chips such as a graphics processing unit (GPU), an embedded neural network processor (neural-network processing units, NPU), a tensor processing unit (TPU), or a deep learning processor (deep learning processing units, DPU).
  • GPU graphics processing unit
  • NPU embedded neural network processor
  • TPU tensor processing unit
  • DPU deep learning processor
  • FIG. 2 is a schematic diagram of a node cluster, each node in the node cluster includes multiple accelerators, and different nodes are connected through a communication network.
  • the multiple nodes in the node cluster may be nodes in one computing device as shown in FIG. 1 , or nodes in different computing devices.
  • the number of nodes in different computing devices may be the same or different.
  • each accelerator will generate the data required by other accelerators, so the accelerator needs to send the data to other accelerators that need it.
  • the accelerator can send the data to the accelerator in the node through the internal high-speed link.
  • the multiple accelerators need to respectively send the data to the target accelerator in another node through a communication network.
  • each of the four accelerators in node N 0 generates the data required by accelerator 0 in node N 2 , and the four accelerators in node N 0 need to send data to accelerator 0 in node N 2 respectively through the communication network.
  • the communication scale in the communication network will be relatively large.
  • the four accelerators in node N0 each produce the data required by the four accelerators in node N1 and the four accelerators in node N2 ; the four accelerators in node N1 also generate the data required by the four accelerators in node N0 and the four accelerators in node N2 .
  • the communication scale in the communication network is large, it is easy to cause network congestion and reduce the efficiency of data transmission; and the communication scale of the node cluster will increase with the increase of the number of nodes, which is not conducive to the expansion of the cluster.
  • An embodiment of the present application provides a data transmission method, which is applied to a data transmission system comprising multiple nodes as shown in FIG. 3 .
  • Each of the multiple nodes includes at least two accelerators, and the multiple accelerators in each node are connected through a first communication link; the accelerators between multiple nodes constitute multiple communication planes, and each communication plane includes an accelerator in each node, and the accelerators included in any two communication planes are different from each other. Accelerators in the same communication plane are connected through a second communication link. FIG. connected via a second communication link.
  • the aforementioned multiple nodes may be nodes in the same computing device, or may be nodes in multiple computing devices. Wherein, the structure of the computing device is as the computing device described above in FIG. 1 .
  • multiple accelerators in the same node are connected through the first communication link, and can perform data interaction through the first communication link; since accelerators in different nodes of the same computing device are not connected through the first communication link, accelerators between different nodes of the same computing device need to perform data interaction through the second communication link.
  • accelerators between nodes of different computing devices are connected through a second communication link. It should be noted that when the above multiple nodes are located in multiple computing devices, the number of nodes included in any two computing devices may be the same or different. In FIG. 3 , four accelerators in each node are connected through the first communication link as an example. It should be understood that the number of accelerators in each node connected through the first communication link may also be other numbers.
  • first communication link comprises bus, such as PCIe bus or unified bus (unified bus, Ubus or UB) etc.
  • first communication link also can be the communication network that comprises bus and switching chip, such as PCIe bus and PCIe Switch chip etc.
  • the second communication link may be a link supporting TCP, RoCE protocol or IB protocol, such as Ethernet or IB network.
  • each accelerator corresponds to a network card, and accelerators of different nodes are connected through network devices such as network cards and switches.
  • the data transmission system includes n nodes N 0 -N n-1 , and each node includes m accelerators, then the data transmission system includes m*n accelerators in total, where m and n are both integers greater than 1.
  • an accelerator in each node is connected with an accelerator in each other node through a second communication link to form a communication plane connected through the second communication link, and each communication plane includes an accelerator in a node, and any two communication planes include different accelerators.
  • the above-mentioned data transmission system including n nodes, each node including m accelerators includes m communication planes, and each communication plane includes n accelerators.
  • each accelerator will generate data that needs to be sent to other accelerators.
  • the one or more destination accelerators may be located in the same node as the source accelerator, or the source accelerator may be located in a different node; and when there are multiple destination accelerators, some of the multiple destination accelerators may be located in the same node as the source accelerator, and some of the destination accelerators are located in different nodes from the source accelerator.
  • the data sent by the source accelerator to each destination accelerator may be the same, the data sent to some destination accelerators may be the same, or the data sent to each accelerator may be different, which is not specifically limited in this embodiment of the present application.
  • the accelerators in each node exchange data through the first communication link.
  • the first accelerator in the first node in the data transmission system Take the first accelerator in the first node in the data transmission system as an example, where the first node is any node in the data transmission system, the first accelerator is any accelerator in the first node, and the first accelerator is located on the first communication plane of the data transmission system.
  • the accelerator in the first node needs to send data to the accelerator in the first communication plane, the data is first sent to the first accelerator in the first node in the first communication plane through the first communication link.
  • the first accelerator and other accelerators in the first node need to send data to the accelerator in the second communication plane
  • the first accelerator and other accelerators both send data to the accelerator in the first node in the second communication plane.
  • the second communication plane is any communication plane in the data transmission system.
  • the accelerators in each node perform the above-mentioned data interaction operations in the node. After the accelerators in each node complete the data interaction in the node, each accelerator stores the data required by each accelerator in the communication plane where the accelerator is located. After the accelerator data interaction in each node is completed, each accelerator located in the same communication plane exchanges data through the second communication link, and finally each accelerator obtains the data that each accelerator in the data transmission system needs to send to the accelerator.
  • the data sent by each accelerator includes indication information indicating the destination accelerator corresponding to the data.
  • the indication information may be the identification or address of the destination accelerator.
  • the accelerator 1 in the node N0 has data to be sent to the accelerator 0 in the node N1 , and the accelerator 1 in the node N0 sends the data to the accelerator 0 in the node N0 , and the data includes the address of the accelerator 0 in the node N1 .
  • the numbers of the m accelerators in node 0 are A 0 ⁇ A m-1
  • the numbers of the m accelerators in node 1 are A m ⁇ A 2m-1
  • the numbers of the m accelerators in node N k are A km ⁇ A (k+1)*m-1 , where k is an integer less than or equal to n.
  • accelerators A 0 , A m , A 2m , A km , ..., A (n-1)m are accelerators located on the same communication plane, accelerators A 1 , A m+1 , A 2m+1 , ...
  • a km+1 , ..., A (n-1)m+1 are accelerators located on the same communication plane, and so on, accelerators A m-1 , A 2m-1 , A 3m-1 , ... A (k+1)m-1 , . . . , A n*m-1 are accelerators located in the same communication plane.
  • a x sends the data sent to the destination accelerators A 0 , A m , A 2m , ... A km , ..., A (n-1)m to the accelerator A km , and sends the data that needs to be sent to the destination accelerators A 1 , A m+1 , A 2m+1 , ... A km+1 ,..., A (n-1)m+1 to the accelerator A km+1 , and so on, the data that needs to be sent to the destination accelerator A m-1 , A 2m-1 , A 3m-1 , ... A (k+1)m-1 , ..., A n*m-1 data are sent to the accelerator A (k+1)*m-1 .
  • the accelerator A x will receive the data sent by other accelerators in the node N k , and the data received by the accelerator A x is the data that needs to be sent to the accelerator on the same communication plane as A x .
  • accelerator A km has data that needs to be sent to A 0 , A m , A 2m , ... A km , ..., A (n-1)m
  • accelerator A km+1 has data that needs to be sent to A 1 , A m+1 , A 2m+1 , ... A km+1 , ..., A (n-1)m+1 .
  • accelerator A 0 includes data that needs to be sent to A 0 , A m , A 2m , ... A km , ..., A (n-1) m
  • accelerator A 1 has data that needs to be sent to A 1 , A m+1 , A 2m+1 , ... A km+1 , ..., A (n-1)m+1
  • each accelerator performs data exchange between nodes.
  • Each accelerator sends the data required by each accelerator on the same communication plane as the accelerator to each accelerator through the second communication link to complete the data interaction between accelerators on the same communication plane. Finally, each accelerator obtains the data that each accelerator in the data transmission system needs to send to the accelerator. For example, accelerator A 0 sends the data that needs to be sent to A m to A m through the second communication link, and accelerator A 0 sends the data that needs to be sent to A km to A km through the second communication link, etc. Finally, accelerator A 0 gets the data that each accelerator needs to send to A 0 , and A km gets the data that each accelerator needs to send to A km .
  • FIG. 4 is a schematic diagram of data transmission provided by an embodiment of the present application.
  • FIG. 4 only shows the connection relationship of GPUs in a node and the connection relationship of GPUs in one communication plane.
  • N 0 includes four GPUs of G0, G1, G2, and G3, and N 1 includes four GPUs of G4, G5, G6, and G7.
  • N0 ( ⁇ L1 ⁇ L2 ⁇ L3 ⁇ , ⁇ , ⁇ L0 ⁇ G0 ⁇ G4, ⁇ L1 ⁇ G1 ⁇ G5, ⁇ L2 ⁇ G2 ⁇ G6, ⁇ L3 ⁇ G3 ⁇ G7 ⁇ GPU ⁇ GPU ⁇ , ⁇ G0 ⁇ G0 ⁇ G7 ⁇ (0,0) ⁇ (0, 1) ⁇ ... ⁇ (0,7),G1 ⁇ G0 ⁇ G7 ⁇ (1,0) ⁇ (1,1) ⁇ ... ⁇ (1,7), ⁇ ,G7 ⁇ G0 ⁇ G7 ⁇ (7,0) ⁇ (7,1) ⁇ ... ⁇ (7,7) ⁇ N 0 ⁇ N 1 ⁇ GPU ⁇ , ⁇ : ⁇ N 0 ⁇ N 1 ⁇ GPU, ⁇ N 0 ⁇ G0, ⁇ N 0 ⁇ GPU ⁇ L0 ⁇ GPU ⁇ , ⁇ G0; ⁇ G0 ⁇ GPU ⁇ , ⁇ N 0 ⁇ GPU ⁇ G
  • Each GPU in N 0 and N 1 executes the above-mentioned data interaction in the node.
  • the data in each GPU is the data required by each GPU in the same communication plane as the GPU.
  • the data in G0 is the data required by G0 and G4, including data (0,0), (1,0), (2,0), (3,0), (0,4), (1,4), (2,4) and (3,4)
  • the data in G1 is the data required by G1 and G5, including data (0,1), (1,1), (2,1), (3,1), (0 , 5), (1, 5), (2, 5) and (3, 5);
  • the data in G4 is the data required by G0 and G4, including data (4, 0), (5, 0), (6, 0), (7, 0), (4, 4), (5, 4), (6, 4) and (7, 4);
  • the data in G5 is the data required by G1 and G5, including data (4, 1), (5
  • each GPU located in the same communication plane performs data interaction between nodes through the second communication link, and each GPU sends the data required by other GPUs in the same communication plane to the corresponding GPU through the second communication link respectively.
  • G0 sends data (0, 4), (1, 4), (2, 4) and (3, 4) to G4
  • G4 sends data (4, 0), (5, 0), (6, 0) and (7, 0) to G0
  • G1 sends data (0, 5), (1, 5), (2, 5) and (3, 5) to G5
  • G5 sends data (4, 1), (5, 1), (6, 1) and ( 7.
  • the data in each GPU is the data that each GPU in the data transmission system needs to send to the GPU.
  • the data in G0 are (0,0), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0) and (7,0)
  • the data in G5 are (0,5), (1,5), (2,5), (3,5), (4, 5), (5,5), (6,5) and (7,5).
  • FIG. 5 is a schematic diagram of another data transmission provided by the embodiment of the present application.
  • the data transmission system includes 4 computing devices, each computing device includes 2 nodes, and each node includes 4 GPUs, that is, the data transmission system includes a total of 32 GPUs, respectively G0-G31.
  • the data transmission system includes four communication planes, as shown in Figure 5, the four communication planes are L0, L1, L2 and L3 respectively, the communication plane L0 includes G0, G4, G8, G12, G16, G20, G24 and G28, the communication plane L1 includes G1, G5, G9, G13, G17, G21, G25 and G29, and the communication plane L2 includes G2, G6, G10, G14, G 18. G22, G26 and G30, the communication plane L3 includes G3, G7, G11, G15, G19, G23, G27 and G31. Each communication plane includes 8 GPUs, and FIG. 5 only shows that the communication plane L0 is connected through the second communication link.
  • the data exchange between the GPUs in the node is first performed.
  • G0 will respectively receive the data sent by G1-G3 in the node N0 to each GPU in the communication plane L0.
  • G0 will send the data that needs to be sent to each GPU in the communication plane L1 to G1.
  • Data is sent to G3.
  • Data interaction is also performed between GPUs in other nodes according to the above method.
  • G21 in the node N5 will receive the data sent by G20, G22 and G23 to each GPU in the communication plane L1, and at the same time, G21 will send the data that needs to be sent to each GPU in the communication plane L0 to G20.
  • the data in each GPU is the data required by each GPU located in the same communication plane as the GPU.
  • the data in G0 is the data required by the 8 GPUs in the communication plane L0
  • the data in G1 is the data required by the 8 GPUs in the communication plane L1
  • the data in G4 is the data required by the 8 GPUs in the communication plane L0
  • the data in G6 is the data required by the 8 GPUs in the communication plane L2.
  • each GPU located on the same communication plane performs data interaction between nodes through the second communication link.
  • Each GPU sends data required by other GPUs in the same communication plane to other GPUs through the second communication link.
  • G0 sends data (0, 4), (1, 4), (2, 4) and (3, 4) to G4, sends data (0, 8), (1, 8), (2, 8) and (3, 8) to G8, and sends data (0, 12), (1, 12), (2, 12) and (3, 12) to G12, etc.
  • G4 sends data (4, 0), (5, 0), (6, 0) and (7,0) are sent to G0
  • the data (4,8), (5,8), (6,8) and (7,8) are sent to G8, the data (4,12), (5,12), (6,12) and (7,12) are sent to G12, etc.
  • G1 sends the data (0,5), (1,5), (2,5) and (3,5) to G5, and G5 sends the data (4
  • the data in each GPU is the data that each GPU in the data transmission system needs to send to the GPU.
  • the data in G0 are (0,0), (1,0), (2,0),..., (31,0);
  • the data in G1 are (0,1), (1,1), (2,1),..., (31,1).
  • FIG. 6 is a schematic flowchart of a data transmission method provided by an embodiment of the present application.
  • the data transmission method is applied to the data transmission system shown in FIG. 3 , and the data transmission method includes S601 to S602.
  • the first accelerator acquires first data sent by other accelerators in the first node through the first communication link.
  • the first data includes data that other accelerators in the first node each need to send to the second accelerator in the second node; the first node and the second node are any two nodes among the plurality of nodes in the data transmission system, and the first accelerator and the second accelerator are accelerators in the first communication plane.
  • each accelerator will generate data that needs to be sent to other accelerators.
  • One or more accelerators in the first node generate data that needs to be sent to the second accelerator in the second node; the one or more accelerators in the first node send the data that needs to be sent to the second accelerator respectively to the first accelerator on the same communication plane as the second accelerator through the first communication link.
  • the data sent by each accelerator includes instruction information for sending the data to the second accelerator, such as the identity of the second accelerator or the address of the second accelerator; the first node and the second node can be two nodes in the same computing device, or two nodes in different computing devices.
  • the first accelerator sends the first data to the second accelerator through the second communication link.
  • the first accelerator After the first accelerator receives the data that needs to be sent to the second accelerator from each accelerator in the first node, it obtains the first data, and then sends the first data to the second accelerator through the second communication link.
  • the first node can be N 0 in FIG. 4
  • the second node can be N 1 in FIG. 4
  • the first accelerator can be G0 in FIG. 4
  • the second accelerator can be G4 in FIG. 4 .
  • Specific data interaction operations will not be repeated here.
  • an accelerator in one of the two nodes receives data sent by the other accelerator in the same node, and then sends it to an accelerator in the other node through the second communication link as an example.
  • the data transmission method provided by the present application can be used for each accelerator in each node in the data transmission system shown in FIG. 3 .
  • Each accelerator in each node can first perform data interaction in the node through the first communication link, so that the data obtained by any accelerator is the data required by each accelerator located in the same communication plane as the accelerator; then the accelerators in each node interact with each accelerator in the same communication plane through the second communication link, and finally each accelerator obtains the data it needs.
  • FIG. 7 is a schematic flowchart of another data transmission method provided by an embodiment of the present application. The data transmission method is applied to the data transmission system shown in FIG. 3 , and the data transmission method includes S701 to S703.
  • the processor sends group information to each accelerator in each managed node, where the group information includes information about accelerators included in each communication plane.
  • the above data transmission system includes n nodes and at least one Host CPU, wherein each node includes m accelerators, and one Host CPU manages at least one node.
  • the foregoing group information includes information about accelerators included in each communication plane in the data transmission system.
  • the accelerator information may be the identifier or address of the accelerator.
  • the data transmission system shown in FIG. 5 above includes 8 nodes, each node includes 4 GPUs, the data transmission system includes 4 communication planes, each communication plane includes 8 GPUs, and the group information includes the information of the 8 GPUs included in each communication plane.
  • the accelerators in each node establish connections based on the second communication link with other accelerators on the same communication plane according to the above group information.
  • the accelerators in each node perform data interaction in the node, so that the data obtained by one accelerator is the data required by each accelerator on the same communication plane as the accelerator.
  • each accelerator will generate data that needs to be sent to other accelerators.
  • one or more accelerators in the first node generate first data that needs to be sent to the second accelerator in the second node, and the one or more accelerators in the first node determine that both the first accelerator in the first node and the second accelerator in the second node are located on the first communication plane according to the above group information, and each of the one or more accelerators in the first node first sends the data that needs to be sent to the second accelerator to the first accelerator through the first communication link.
  • the data sent by each node in the first node includes instruction information for sending the data to the second accelerator, such as the identifier of the second accelerator or the address of the second accelerator;
  • the first node and the second node are any two different nodes in the data transmission system, and the first node and the second node can be two nodes in the same computing device, or two nodes in different computing devices.
  • accelerators in the first node have data that needs to be sent to multiple accelerators in the first communication plane
  • other accelerators in the first node can first send all the data that needs to be sent to each accelerator in the first communication plane to the first accelerator.
  • the first node includes four accelerators
  • the first communication plane includes six accelerators
  • the other three accelerators in the first node send the data that needs to be sent to the six accelerators in the first communication plane to the first accelerator.
  • the first accelerator in the first node generates second data that needs to be sent to the fourth accelerator in the second node, and the first accelerator determines that the third accelerator in the first node and the fourth accelerator are located on the same communication plane according to the above group information, and then the first accelerator sends the second data to the third accelerator, so that the third accelerator sends the second data to the fourth accelerator through the second communication link.
  • the second data includes indication information indicating to send the second data to the fourth accelerator.
  • the accelerators in each node after the accelerators in each node generate data that needs to be sent to other accelerators, the accelerators in each node perform data interaction between the accelerators in the node through the first communication link, and finally the data obtained by an accelerator is the data required by each accelerator located in the same communication plane as the accelerator.
  • the data interaction method of the accelerator in the node reference may be made to the operation of the data interaction in the node in the above-mentioned embodiments corresponding to FIG. 3 to FIG. 5 , which will not be repeated here.
  • Accelerators in the same communication plane respectively perform data interaction between nodes through the second communication link, and obtain the data required by each accelerator.
  • the first accelerator After the first accelerator receives the data that needs to be sent to the second accelerator from other accelerators in the first node, according to the indication information in the received data, the data that other accelerators need to send to the second accelerator and the data that the first accelerator needs to send to the second accelerator are sent to the first accelerator through the second communication link. Similarly, the third accelerator sends the second data to the fourth accelerator through the second communication link. The first accelerator also receives third data sent by other accelerators in the first communication plane through the second communication link, and the third data includes data sent to the first accelerator by the accelerator in the node to which each accelerator belongs in the first communication plane.
  • the memory of any accelerator stores the data required by each accelerator in the communication plane where the accelerator is located; then each accelerator in the same communication plane performs data interaction between nodes through the second communication link, and finally each accelerator obtains the data it needs, that is, the data that each accelerator in the data transmission system needs to send to the accelerator. Operations related to data interaction between nodes in the embodiment corresponding to 5 will not be repeated here.
  • the accelerators in each node first perform data interaction through the communication link in the node, and after the accelerators in each node perform data interaction through the first communication link in the node, the data in any accelerator is the data required by each accelerator located in the same communication plane as the accelerator, and the accelerators located in the same communication plane then exchange data through the second communication link, and finally realize that each accelerator obtains the data it needs.
  • the internal high-speed link in the node can be fully utilized to realize data aggregation on the same communication plane, and then data interaction between the accelerators of each node can be performed through the second communication link, which can reduce the number of times the accelerators between nodes send data to each other, that is, reduce the communication scale between nodes, thereby reducing data congestion and transmission delay on the network, improving data transmission efficiency, and facilitating system expansion to enhance computing power.
  • the internal high-speed link in the node can be fully utilized to realize data aggregation on the same communication plane, and then data interaction between the accelerators of each node can be performed through the second communication link, which can reduce the number of times the accelerators between nodes send data to each other, that is, reduce the communication scale between nodes, thereby reducing data congestion and transmission delay on the network, improving data transmission efficiency, and facilitating system expansion to enhance computing power.
  • G1-G3 when G0-G3 in node N0 have data that needs to be sent to G4 in node N1 , G1-G3 first send the data to be sent to G4 to G0 through the first communication link in the node, and then G0 sends the data that G0-G3 needs to send to G4 to G4 through the second communication link, without the need for G0-G3 to send the data to G4 through the second communication link, which can reduce the number of communications between nodes to four times the original one of.
  • FIG. 8 is a schematic diagram of a matrix operation provided by an embodiment of the present application.
  • B is the matrix of a*b
  • C is the matrix of b*c
  • D is the matrix of c*d
  • E B x C
  • E is the matrix of a*c.
  • matrix C is an embedding table.
  • Matrix C needs to be deployed to multiple GPUs, for example, deployed to 8 GPUs in G0 to G7.
  • the input data of 8 GPUs is a sub-matrix of matrix C and matrix B, that is, each GPU completes part of the calculation of matrix multiplication B*C.
  • each GPU completes the matrix multiplication calculation, eight matrices E0 to E7 of a*c1 are obtained in each GPU. Then each GPU needs to continue to cooperate to complete the matrix multiplication operation with matrix D. Since matrix D is a c*d matrix, matrices E0 to E7 cannot directly perform matrix multiplication calculation with matrix D. It is necessary to convert the matrix in each GPU into a matrix with column number c through the above-mentioned method of data interaction between GPUs in each node and data interaction between GPUs on the same communication plane between nodes.
  • the matrix E after the combination of the matrices E0 ⁇ E7 is a*c matrix, so the matrix E can be divided into 8 sub-matrices F0-F7 by row.
  • matrix E0 in G0 is equivalent to columns 1 to 100 of matrix E
  • matrix E1 in G1 is equivalent to columns 101 to 200 of matrix E
  • matrix E2 in G2 is equivalent to columns 201 to 300 of matrix E
  • matrix E7 in G7 is equivalent to columns 701 to 800 of matrix E.
  • the final matrix F0 in G0 is the data in the 1st to 25th rows in the matrix E, then G0 needs to receive the data in the 1st to 25th rows of each matrix in G1 to G7; if the final matrix F1 in G1 is the data in the 26th to 50th rows in the matrix E, then G1 Need to receive the data of rows 26-50 of each matrix in G0 and G2-G7; if the matrix F2 finally obtained in G2 is the data of rows 51-75 in matrix E, then G2 needs to receive the data of rows 51-75 of each matrix in G0, G1 and G3-G7; and so on, if the matrix F7 finally obtained in G7 is the data of rows 176-200 in matrix E, then G7 needs to receive each matrix in G0-G6 The data of lines 176 to 200 of the
  • the data in the 1st to 25th rows in G0 is the data needed by G0 itself, which is the data sent by G0 to G0, which is equivalent to the data (0, 0) in the embodiment corresponding to the above-mentioned figure 4;
  • the data in the 26th to 50th rows in G0 is the data required by G1, which is the data sent by G0 to G1, which is equivalent to the data (0, 1) in the embodiment corresponding to the above-mentioned figure 4;
  • the data in the 176th to 200th rows in G0 are required by G7
  • the data is the data sent by G0 to G7, which is equivalent to the data (0, 7) in the above embodiment corresponding to FIG. 4 . Therefore, G0 includes data (0, 0), (0, 1), . . . , (0, 7) sent to G0 to G7.
  • the data in the 1st to 25th rows in G1 is the data required by G0, which is the data sent by G1 to G0, which is equivalent to the data (1, 0) in the embodiment corresponding to Figure 4 above;
  • the data in the 26th to 50th rows in G1 is the data required by G1 itself, and is the data sent by G1 to G1, which is equivalent to the data (1, 1) in the embodiment corresponding to Figure 4 above;
  • the data in the 176th to 200th rows in G1 are required by G7
  • the data is the data sent by G1 to G7, which is equivalent to the data (1, 7) in the above embodiment corresponding to FIG. 4 . Therefore, G1 includes data (1, 0), (1, 1), . . . , (1, 7) sent to G0 to G7.
  • G2 includes data (2, 0), (2, 1), ..., (2, 7) sent to G0 ⁇ G7;
  • G3 includes data (3, 0), (3, 1), ..., (3, 7) sent to G0 ⁇ G7;
  • G4 includes data (4, 0), (4, 1), ..., (4, 7) sent to G0 ⁇ G7;
  • G5 includes data sent to G0 ⁇ G7 (5, 0), ( 5, 1), ..., (5, 7);
  • G6 includes data (6, 0), (6, 1), ..., (6, 7) sent to G0 ⁇ G7;
  • G7 includes data (7, 0), (7, 1), ..., (7, 7) sent to G0 ⁇ G7.
  • the data that one GPU needs to send to another GPU is the data of 25 rows and 100 columns.
  • any GPU in G0 ⁇ G7 must send 25 rows and 100 columns of data to each GPU, and each GPU must also receive 25 rows and 100 columns of data sent by other GPUs, so as to be able to convert E0 in G0 to F0, convert E1 in G1 to F1, and so on.
  • the data in the eight GPUs are respectively converted into a1*c matrices F0 to F7 through the data interaction of the GPUs in the node and the data interaction of the GPUs on the same communication plane between nodes, and the matrix in each GPU is multiplied by the matrix D to obtain the a1*d matrices Z0 to Z7 respectively.
  • the data transmission system and method provided according to the embodiment of the present application are described in detail above with reference to FIG. 1 to FIG. 9 .
  • the device, board and computing device for data transmission provided by the embodiment of the present application are introduced below in conjunction with FIG. 10 to FIG. 12 .
  • FIG. 10 is a schematic structural diagram of a data transmission device provided by an embodiment of the present application.
  • the data transmission device 100 is used for any accelerator in the above-mentioned data transmission system.
  • the data transmission device 100 includes a communication unit 101 and a processing unit 102, wherein,
  • the communication unit 101 is configured to acquire first data sent by other accelerators in the first node through the first communication link, where the first data includes data that other accelerators in the first node need to send to the second accelerator in the second node.
  • the first node and the second node are any two nodes in the plurality of nodes, and the first accelerator and the second accelerator are accelerators in the first communication plane.
  • one or more accelerators in the first node generate first data that needs to be sent to the second accelerator in the second node, and the one or more accelerators in the first node determine that the first accelerator in the first node and the second accelerator in the second node are located on the same communication plane, and each of the one or more accelerators in the first node first sends the data that needs to be sent to the second accelerator to the first accelerator through the first communication link.
  • the data sent by the one or more accelerators in the first node all include instruction information for sending the data to the second accelerator, such as the identification of the second accelerator or the address of the second accelerator.
  • the processing unit 102 is configured to, after the communication unit 101 of the first accelerator receives the data sent by other accelerators in the first node, determine the destination accelerator of the data sent by each accelerator according to the indication information in the data sent by each node, that is, the second accelerator, and then send the data sent by each accelerator to the second accelerator to the second accelerator through the communication unit 101.
  • the processing unit 102 is further configured to determine the second data that needs to be sent to the fourth accelerator, and to determine that the third accelerator in the first node and the fourth accelerator are located on the same communication plane; the communication unit 101 is further configured to send the second data to the third accelerator in the first node through the first communication link, so that the third accelerator sends the second data to the fourth accelerator through the second communication link; wherein the fourth accelerator is an accelerator located at a different node from the first accelerator, and the second data includes instruction information for sending the second data to the fourth accelerator.
  • the communication unit 101 is further configured to receive data respectively sent by other accelerators located in the same communication plane as the first accelerator.
  • FIG. 11 is a schematic structural diagram of a board provided by an embodiment of the present application.
  • the board 110 includes a plurality of accelerators 111 and a plurality of network interface controllers (network interface controllers, NICs) 112, wherein part or all of the plurality of accelerators 111 are connected through a first communication link, that is, the board 110 includes one or more nodes described in the embodiments corresponding to FIGS. 3 to 5 above. Multiple accelerators in each node are connected through the first communication link, each accelerator 111 is connected to a NIC 112 through a bus 113, and one NIC 112 can be used by one or more accelerators 111 to send or receive data.
  • NICs network interface controllers
  • the NIC 112 corresponding to each accelerator 111 is used to send data to the accelerator 111 in other nodes, or receive data sent by the accelerator 111 in other nodes.
  • the accelerator 111 may be any one of AI chips such as GPU, NPU, TPU, or DPU.
  • a board 110 When a board 110 includes one of the above-mentioned nodes, the board 110 can be set in a computing device, and the accelerator 111 connected to the board 110 through the first communication link can complete various operations of data interaction in the node described in the embodiments corresponding to the above-mentioned FIGS. The various operations of data interaction between nodes described in the embodiments.
  • a board 110 When a board 110 includes a plurality of the above-mentioned nodes, multiple nodes on the board 110 can establish the second communication link as described in the above-mentioned method embodiment, and the accelerator 111 connected through the first communication link in any node on the board 110 can complete the various operations of data interaction in the node described in the embodiment corresponding to Figures 3 to 9 above; it can also cooperate with accelerators in other nodes on the board 110 to complete the data interaction between nodes described in the embodiment corresponding to Figures 3 to 9 above. various operations. Or cooperate with accelerators in other nodes on other boards 110 in the computing device to complete various operations of data interaction between nodes described in the above embodiments corresponding to FIG. 3 to FIG. 9 .
  • FIG. 12 is a schematic diagram of a computing device provided by an embodiment of the present application.
  • the computing device 120 includes: one or more processors 121, a communication interface 122, a memory 123, and a plurality of accelerators 124.
  • the processors 121, communication interfaces 122, memory 123, and accelerators 124 are connected to each other through a bus 125.
  • the connection relationship between the processor 121 and the accelerator 124 can refer to the description in the above-mentioned FIG. 3, multiple accelerators 124 may be deployed on one or more boards 110 as shown in FIG. 11 .
  • the processor 121 may have various specific implementation forms, for example, the processor 121 may be a CPU, and the processor 121 may be a single-core processor or a multi-core processor.
  • the processor 121 may be a combination of a CPU and a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD) or a combination thereof.
  • the aforementioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • the processor 121 may also be implemented solely by a logic device with built-in processing logic, such as an FPGA or a digital signal processor (digital signal processor, DSP).
  • the accelerator 124 may be any one of AI chips such as GPU, NPU, TPU, or DPU.
  • the communication interface 122 can be a wired interface or a wireless interface for communicating with other modules or devices.
  • the wired interface can be an Ethernet interface, a local interconnect network (LIN), etc.
  • the wireless interface can be a cellular network interface or a wireless local area network interface.
  • the memory 123 may be a non-volatile memory, for example, a read-only memory (read-only memory, ROM), a programmable read-only memory (programmable ROM, PROM), an erasable programmable read-only memory (erasable PROM, EPROM), an electrically erasable programmable read-only memory (electrically EPROM, EEPROM) or a flash memory.
  • the memory 123 can also be a volatile memory, and the volatile memory can be a random access memory (random access memory, RAM), which is used as an external cache.
  • RAM static random access memory
  • dynamic RAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate SDRAM double data rate SDRAM
  • DDR SDRAM double data rate SDRAM
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronously connected dynamic Random access memory
  • direct rambus RAM direct rambus RAM
  • the memory 123 can also be used to store program codes and data, so that the processor 121 or the accelerator 124 calls the program codes stored in the memory 123 to execute the operation steps for realizing data transmission in the above method embodiments. Additionally, computing device 120 may contain more or fewer components than shown in FIG. 12 , or have components arranged in a different manner.
  • the bus 125 can be a PCIe bus, or an extended industry standard architecture (EISA) bus, unified bus (Ubus or UB), computer express link (compute express link, CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • the bus 125 can be divided into an address bus, a data bus, a control bus, and the like.
  • the bus 125 may also include a power bus, a control bus, a status signal bus, and the like. However, for the sake of clarity, only one thick line is used in FIG. 12 , but it does not mean that there is only one bus or one type of bus.
  • the computing device 120 may further include an input/output interface 126 connected with an input/output device for receiving input information and outputting an operation result.
  • the specific implementation of various operations performed by the computing device 120 may refer to the specific operations in the method embodiments described above in FIG. 2 to FIG. 9 , which will not be repeated here.
  • the embodiment of the present application also provides a data transmission system, the system includes one or more computing devices 120 described above, the data interaction process between the accelerators in each computing device 120 in the system can refer to the specific operations in the method embodiments described above in FIGS.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when it runs on a processor, it can implement the method steps in the above-mentioned method embodiments.
  • the specific implementation of the processor of the computer-readable storage medium performing the above-mentioned method steps can refer to the specific operations shown in the method embodiments described in the above-mentioned method embodiments shown in FIGS. 3 to 9 , and details are not repeated here.
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on the computer, the processes or functions according to the embodiments of the present invention will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disks, hard disks, magnetic tape), optical media, or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD).
  • the steps in the method of the embodiment of the present application can be adjusted in order, merged or deleted according to actual needs; the modules in the system of the embodiment of the present application can be divided, combined or deleted according to actual needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Multi Processors (AREA)
  • Small-Scale Networks (AREA)
  • Computer And Data Communications (AREA)

Abstract

本申请提供一种数据传输系统、方法及相关设备,该系统包括多个节点,每个节点内的多个加速器之间通过第一通信链路互联;多个节点的加速器之间构成多个通信平面,每个通信平面包括每个节点内的一个加速器,且任意两个通信平面包括的加速器互不相同,同一个通信平面内的加速器之间通过第二通信链路连接;第一节点内的第一加速器获取第一节点内其他加速器各自发送的第一数据,该第一数据包括第一节点内的其他加速器各自待发送至第二节点内的第二加速器的数据,然后第一加速器通过第二通信链路将第一数据发送给第二加速器。通过上述方法进行节点内和节点间的数据交互,能够降低节点间数据交互次数,降低数据拥塞和传输时延,提高数据传输效率。

Description

一种数据传输系统、方法及相关设备 技术领域
本申请涉及计算机技术领域,尤其涉及一种数据传输系统、方法及相关设备。
背景技术
随着计算机技术的发展,数据规模也在不断发展,为了解决大规模数据的计算问题,分布式计算应运而生。分布式计算为了解决算力不足的问题,把需要进行大量计算的任务分配给多个计算设备或芯片进行计算。在进行分布式计算的过程中,各个计算设备或芯片会产生其他计算设备或芯片需要的数据,这会涉及到不同计算设备之间或者不同芯片之间的数据交互,因此提高不同计算设备或不同芯片之间的数据传输效率是提高分布式计算效率的一个有效途径。
发明内容
本申请公开了一种数据传输系统、方法及相关设备,能够减少数据传输过程中的拥塞和传输时延,提高数据传输效率。
第一方面,本申请提供一种数据传输系统,该数据传输系统包括多个节点,多个节点中的每个节点包括多个加速器,每个节点内的多个加速器之间通过第一通信链路连接;这多个节点的加速器之间构成多个通信平面,每个通信平面包括每个节点内的一个加速器,且任意两个通信平面包括的加速器互不相同,同一个通信平面内的加速器之间通过第二通信链路连接;其中,第一节点内的第一加速器用于获取第一节点内的其他加速器发送的第一数据,该第一数据包括第一节点内的其他加速器各自需要发送至第二节点内的第二加速器的数据;上述第一节点和第二节点是上述多个节点中的任意两个节点,第一加速器和第二加速器是第一通信平面内的加速器;第一加速器还用于将上述第一数据通过上述第二通信链路发送给第二加速器。
通过在多个节点的加速器之间构建多个通信平面,在第一节点内的一个或多个加速器有数据需要发送给第一通信平面内的加速器时,先通过第一节点内的通信链路将数据发送给属于第一节点且属于第一通信平面的第一加速器,然后由第一加速器通过第二通信链路将数据分别发送给第一通信平面的加速器。通过上述方法能够降低节点间的加速器相互发送数据的次数,降低网络上的数据拥塞和传输时延,提高数据传输效率。
需要说明的是,当第一节点内的其他加速器有数据需要发送给第一通信平面内的多个加速器时,第一节点内的其他加速器能够将需要发送给第一通信平面内各个加速器的数据均先发送给第一加速器,然后第一加速器再通过第二通信链路将接收到的数据中分别发送给第一通信平面的各个加速器。例如第一节点内包括四个加速器,第一通信平面内包括六个加速器,第一节点内的其他三个加速器将需要发送给第一通信平面内的六个加速器的数据均发送给第一加速器,然后由第一加速器将接收到的数据中第一通信平面内其他五个加速器各自需要的数据,通过第二通信链路分别发送给其他五个加速器。
通过在多个节点的加速器之间构建多个通信平面,在第一节点内的一个或多个加速器有数据需要发送给第一通信平面内的一个或多个加速器时,先通过第一节点内的通信链路将数据发送给属于第一节点且属于第一通信平面的第一加速器,然后由第一加速器通过第二通信链路将第一通信平面内各个加速器需要的数据分别发送给各个加速器。通过上述方法能够降低节点间的加速器相互发送数据的次数,降低网络上的数据拥塞和传输时延,提高数据传输效率。
在一种可能的实现方式中,上述数据传输系统还包括处理器,处理器用于向上述多个节点内的各个加速器发送分组信息,该分组信息包括每个通信平面包括的加速器的信息。
在一种可能的实现方式中,上述第一加速器还用于根据接收到的分组信息与上述第二加速器建立第二通信链路的连接。
处理器确定用于计算的数据传输系统包括的节点之后,能够根据每个节点的加速器对加速器进行分组以确定每个通信平面包括的加速器的信息,并通知给各个节点内的加速器,以使各个节点内的加速器根据上述分组信息建立连接。
在一种可能的实现方式中,上述第一加速器还用于在有第二数据需要发送给第二通信平面内的任一加速器时,将第二数据发送给第一节点内的第三加速器,该第三加速器是位于第二通信平面的加速器;第三加速器用于将第二数据通过第二通信链路发送给第二通信平面内的上述任一加速器。
需要说明的是,当第一加速器和本节点内的其他加速器有需要发送给第二通信平面内一个或多个加速器的数据时,均先发送给第三加速器,第三加速器再通过第二通信链路将第二通信平面各个加速器需要的数据分别发送给各个加速器,从而减少节点间的通信规模,降低网络上的数据拥塞和传输时延,提高数据传输效率。
在一种可能的实现方式中,上述第一加速器还用于通过第二通信链路接收第一通信平面内各个加速器发送的第三数据。其中,第一通信平面内一个加速器发送给第一加速器的数据包括该加速器所在的节点内多个加速器需要发送给第一加速器的数据。
在一种可能的实现方式中,上述数据传输系统用于人工智能(artificial intelligence,AI)模型的训练,上述第一数据、第二数据以及第三数据是AI模型训练过程中产生的中间数据。
在人工智能模型的训练过程中,需要采用多个节点内的多个加速器对数据进行处理,不同加速器之间会涉及大量的数据传输,在人工智能模型训练的过场中产生的中间数据通过上述方法进行传输,能够提高模型训练的效率。
在一种可能的实现方式中,上述第一通信链路包括外设总线接口标准(peripheral component interface express,PCIe)总线或统一总线(unified bus,UB);上述第二通信链路为支持传输控制协议(transmission control protocol,TCP)、基于以太网实现远程直接内存访问(remote mirect memory access over converged ethernet,RoCE)协议或无线带宽(InfiniBand,IB)协议的链路。
在一种可能的实现方式中,上述多个节点部署于一个或多个物理机中,上述多个节点内的加速器为图像处理器(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing units,NPU)、张量处理器(tensor processing unit,TPU)或深度学习处理器(deep learning processing units,DPU)。
第二方面,本申请提供一种数据传输方法,应用于包括多个节点的数据传输系统,上述 多个节点中的每个节点包括多个加速器,每个节点内的多个加速器之间通过第一通信链路连接;上述多个节点之间的加速器构成多个通信平面,每个通信平面包括每个节点内的一个加速器,且任意两个通信平面包括的加速器互不相同,同一个通信平面包括的加速器之间通过第二通信链路连接;上述数据传输方法包括:
第一节点内的第一加速器通过第一通信链路获取第一节点内的其他加速器发送的第一数据,该第一数据包括第一节点内的其他加速器各自需要发送至第二节点内的第二加速器的数据;然后第一加速器将上述第一数据通过上述第二通信链路发送给第二加速器。上述第一节点和第二节点是上述多个节点中的任意两个节点,第一加速器和第二加速器是第一通信平面内的加速器。
需要说明的是,当第一节点内的其他加速器有数据需要发送给第一通信平面内的多个加速器时,第一节点内的其他加速器能够将需要发送给第一通信平面内各个加速器的数据均先发送给第一加速器,然后第一加速器再通过第二通信链路将接收到的数据中各个加速器需要的数据分别发送给第一通信平面的各个加速器。例如第一节点内包括四个加速器,第一通信平面内包括六个加速器,第一节点内的其他三个加速器将需要发送给第一通信平面内的六个加速器的数据均发送给第一加速器,然后由第一加速器将接收到的数据中第一通信平面内其他五个加速器各自需要的数据,通过第二通信链路分别发送给其他五个加速器。
在一种可能的实现方式中,第一加速器接收处理器发送的分组信息,根据分组信息与上述第二加速器建立基于上述第二通信链路的连接,其中,所述分组信息包括每个通信平面包括的加速器的信息。
在一种可能的实现方式中,所述方法还包括:上述第一加速器在有第二数据需要发送给第二通信平面内的任一加速器时,将该第二数据发送给第一节点内的第三加速器,该第三加速器是位于第二通信平面的加速器;以使第三加速器将第二数据通过第二通信链路发送给第二通信平面内的上述任一加速器。
需要说明的是,当第一加速器有需要发送给第二通信平面内多个加速器的数据时,均先发送给第三加速器,第三加速器再通过第二通信链路将第二通信平面各个加速器需要的数据分别发送给各个加速器。
在一种可能的实现方式中,上述第一加速器还用于通过第二通信链路接收第一通信平面内各个加速器发送的第三数据。其中,第一通信平面内一个加速器发送给第一加速器的数据包括该加速器所在的节点内多个加速器需要发送给第一加速器的数据。
在一种可能的实现方式中,上述数据传输系统用于AI模型的训练,上述第一数据、第二数据以及第三数据是AI模型训练过程中产生的中间数据。在人工智能模型的训练过程中,需要采用多个节点内的多个加速器对数据进行处理,不同加速器之间会涉及大量的数据传输,在人工智能模型训练的过场中产生的中间数据通过上述方法进行传输,能够提高模型训练的效率。
在一种可能的实现方式中,上述第一通信链路包括PCIe总线或UB;上述第二通信链路为支持TCP、RoCE协议或IB协议的链路。
在一种可能的实现方式中,上述多个节点部署于一个或多个物理机中,上述多个节点内的加速器为GPU、NPU、TPU或DPU。
第三方面,本申请提供一种板卡,该板卡包括多个用于执行如上述第二方面或第二方面 任意可能实现方式中所述方法的加速器。
第四方面,本申请提供一种计算设备,包括处理器、存储器和多个加速器,上述存储器存储有计算机指令,当处理器执行所述计算机指令时,计算设备调用一个或多个加速器执行如上述第二方面或第二方面任意可能实现方式中所述的方法。
第五方面,本申请一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,计算机程序被加速器执行时,加速器执行如上述第二方面或第二方面任意可能实现方式中所述的方法。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种计算设备的结构示意图;
图2是本申请实施例提供的一种节点集群的示意图;
图3是本申请实施例提供的一种数据传输系统的示意图;
图4是本申请实施例提供的一种数据传输过程的示意图;
图5是本申请实施例提供的另一种数据传输过程的示意图;
图6是本申请实施例提供的一种数据传输方法的流程示意图;
图7是本申请实施例提供的另一种数据传输方法的流程示意图;
图8是本申请实施例提供的一种矩阵计算的示意图;
图9是本申请实施例提供的另一种矩阵计算的示意图;
图10是本申请实施例提供的一种数据传输装置的结构示意图;
图11是本申请实施例提供的一种板卡的结构示意图;
图12是本申请实施例提供的另一种计算设备的结构示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
下面结合附图对本申请进行具体介绍,首先介绍本申请中涉及的专用名词:
人工智能(artificial intelligence,AI)芯片:是一种用于处理人工智能应用中大量计算任务的模块,一个计算设备中可以有一个或多个AI芯片。
网络接口控制器(network interface controller,NIC):又称为网卡,NIC是用于支撑各个计算设备在网络上进行通信的计算机硬件,计算设备的NIC用于将一台计算设备与另一台计算设备连接起来,或者用于使计算设备与交换机等网络设备之间建立连接。
外设总线接口标准交换(peripheral component interface express switch,PCIe Switch)芯片:PCIe Switch芯片是一种用于扩展PCIe链路的模块,PCIe链路使用端到端的连接方式,在一条PCIe链路的两端只能各连接一个设备或器件,因此PCIe链路可以使用PCIe Switch芯片扩展PCIe链路,使得在PCIe链路的一端连接多个设备或者器件,其中,PCIe Switch芯片与其他设备或者器件之间通过PCIe总线连接。
下面首先介绍一种计算设备的内部结构。
如图1所示,图1是本申请实施例提供的一种计算设备的结构示意图,计算设备包括至少一个中央处理器(central processing unit,CPU)和至少一个节点,每个节点包括多个加速器。中央处理器作为主机CPU(Host CPU),通过总线与各个节点内的加速器连接,或者通过总线与交换芯片与多个加速器连接。图1中以一个计算设备包括两个CPU和两个节点,每个节点包括4个加速器为例,一个Host CPU通过PCIe总线以及PCIe交换芯片与一个节点内的4个加速器分别连接,一个节点内的4个加速器通过PCIe总线连接。需要说明的是,计算设备还包括每个加速器对应的内存、网卡等器件。上述加速器可以是图像处理器(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing units,NPU)、张量处理器(tensor processing unit,TPU)或深度学习处理器(deep learning processing units,DPU)等AI芯片中的任意一种。
如图2所示,图2中是一种节点集群的示意图,该节点集群中的每个节点包括多个加速器,不同节点之间通过通信网络连接。其中,节点集群中的多个节点可以是如图1所示的一个计算设备内的节点,也可以是不同计算设备内的节点,当多个节点集群位于不同计算设备时,不同计算设备内节点的数量可以相同,也可以不同。在节点集群进行计算的过程中,每个加速器会产生其他加速器需要的数据,因此该加速器需要将数据发送给其他需要的加速器。当一个加速器产生的数据是同一个节点内的加速器需要的数据时,该加速器能够通过内部高速链路将数据发送给节点内的加速器。但是当一个节点内的多个加速器各自产生了需要发送给另外一个节点内的目标加速器的数据时,这多个加速器需要各自将数据通过通信网络发送给另外一个节点内的目标加速器。例如节点N 0内的4个加速器各自产生了节点N 2内加速器0需要的数据,节点N 0内的4个加速器需要分别将数据通过通信网络发送给节点N 2内加速器0。当节点集群内的多个节点的多个加速器之间需要互相发送数据时,会导致通信网络中的通信规模较大。例如,节点N 0内的4个加速器各自产生了节点N 1内4个加速器需要的数据以及节点N 2内4个加速器;节点N 1内的4个加速器也各自产生了节点N 0内4个加速器需要的数据以及节点N 2内4个加速器。当通信网络中的通信规模较大时,容易导致网络拥塞,降低数据传输效率;并且节点集群的通信规模会随着节点数量的增加而增加,不利于集群的扩容。
本申请实施例提供一种数据传输方法,应用于如图3所示的包括多个节点的数据传输系统,上述多个节点中的每个节点包括至少两个加速器,每个节点内的多个加速器之间通过第一通信链路连接;多个节点之间的加速器构成多个通信平面,每个通信平面包括每个节点内的一个加速器,且任意两个通信平面包括的加速器互不相同,同一个通信平面内的加速器之间通过第二通信链路连接,图3中仅示出了一个通信平面的加速器(各个节点内的加速器0)通过第二通信链路连接。上述多个节点可以是同一个计算设备中的节点,也可以是多个计算设备中的节点。其中,计算设备的结构如上述图1所描述的计算设备。当多个节点位于同一 个计算设备时,同一个节点内的多个加速器通过第一通信链路连接,能够通过第一通信链路进行数据交互;同一个计算设备的不同节点内的加速器之间由于没有通过第一通信链路连接,同一个计算设备的不同节点之间的加速器需要通过第二通信链路进行数据交互。当上述多个节点是多个计算设备中的节点时,不同计算设备的节点之间的加速器通过第二通信链路连接。需要说明的是,当上述多个节点位于多个计算设备中时,任意两个计算设备中包括的节点的数量可以相同,也可以不同。图3中以每个节点内的4个加速器通过第一通信链路连接为例,应理解,每个节点内通过第一通信链路连接的加速器的数量还可以是其他数量。
上述第一通信链路包括总线,例如PCIe总线或统一总线(unified bus,Ubus或UB)等,第一通信链路也可以是包括总线与交换芯片的通信网络,例如PCIe总线和PCIe Switch芯片等。第二通信链路可以是支持TCP、RoCE协议或IB协议的链路,例如以太网或IB网络。其中,每个加速器对应有一个网卡,不同节点的加速器通过网卡和交换机等网络设备连接。
如果数据传输系统包括n个节点N 0~N n-1,每个节点包括m个加速器,则该数据传输系统共包括m*n个加速器,其中,m和n均是大于1的整数。数据传输系统的n个节点中,每个节点内的一个加速器与其他各个节点内的一个加速器通过第二通信链路连接,构成通过第二通信链路连接的通信平面,且每个通信平面包括一个节点内的一个加速器,任意两个通信平面所包括的加速器不同。上述包括n个节点,每个节点包括m个加速器的数据传输系统共包括m个通信平面,每个通信平面包括n个加速器。
上述n个节点在共同完成一项任务的过程中,例如通过模型并行(model parallelism)的方式训练神经网络模型,每个加速器会生成需要发送给其他加速器的数据。当一个节点内的一个源加速器有数据需要发送给一个或多个目的加速器时,这一个或多个目的加速器可能与源加速器位于同一个节点,也可能源加速器位于不同节点;并且当有多个目的加速器时,这多个目的加速器可能部分与源加速器位于同一个节点,部分目的加速器与源加速器位于不同的节点。需要说明的是,源加速器发送给各个目的加速器的数据可能相同,也可能发送给部分目的加速器的数据是相同的,还可能是发送给各个加速器的数据各不相同,本申请实施例不做具体限定。
本申请实施例中,为了将各个加速器中生成的数据发送给需要这些数据的节点,首先各个节点内的加速器通过第一通信链路交互数据。以数据传输系统中第一节点内的第一加速器为例,其中,第一节点是数据传输系统中的任意一个节点,第一加速器是第一节点内的任意一个加速器,该第一加速器位于数据传输系统的第一通信平面。当第一节点内的加速器有数据需要发送给位于第一通信平面内的加速器时,先通过第一通信链路将数据发送给第一节点内位于第一通信平面的第一加速器。当第一加速器以及第一节点内的其他加速器有数据需要发送给位于第二通信平面内的加速器时,第一加速器以及其他加速器均将数据发送给第一节点内位于第二通信平面的加速器。其中,第二通信平面是数据传输系统中的任意一个通信平面。各个节点内的加速器均执行上述节点内的数据交互操作,各个节点内的加速器完成节点内数据交互之后,每个加速器中保存的是该加速器所在的通信平面内各个加速器需要的数据。在各个节点内的加速器数据交互完成之后,位于同一个通信平面内的各个加速器再通过第二通信链路交互数据,最终每个加速器得到数据传输系统内各个加速器需要发送给该加速器的数据。需要说明的是,各个加速器发送的数据中均包括指示数据对应的目的加速器的指示信息,该指示信息可以是目的加速器的标识或者地址,例如,节点N 0中的加速器1有数据需要 发送给节点N 1中的加速器0,节点N 0中的加速器1将该数据发送给节点N 0中的加速器0,该数据中包括节点N 1中的加速器0的地址。
示例性的,节点0中的m个加速器的编号分别为A 0~A m-1,节点1中的m个加速器的编号分别为A m~A 2m-1,节点N k中的m个加速器的编号分别为A km~A (k+1)*m-1,其中,k为小于或等于n的整数。其中,加速器A 0、A m、A 2m、A km、…、A (n-1)m是位于同一个通信平面的加速器,加速器A 1、A m+1、A 2m+1、…A km+1、…、A (n-1)m+1是位于同一个通信平面的加速器,依此类推,加速器A m-1、A 2m-1、A 3m-1、…A (k+1)m-1、…、A n*m-1是位于同一个通信平面的加速器。
用(x,y)表示加速器A x需要发送给加速器A y的数据,其中,x和y均是大于或等于0,且小于或等于m*n的整数。当一个节点N k中的m个加速器均有需要发送给其他加速器的数据时,对于节点N k中的任意一个加速器A x,该加速器需要发送给其他加速器的数据分别为(x,0)、(x,1)、(x,2)、…、(x,n*m-1),该加速器先将各个数据分别发送给各个数据对应的转发加速器,该转发加速器位于节点N k中,且与目的加速器位于同一个通信平面的加速器。例如A x将发送给目的加速器A 0、A m、A 2m、…A km、…、A (n-1)m的数据均发送给加速器A km,将需要发送给目的加速器A 1、A m+1、A 2m+1、…A km+1、…、A (n-1)m+1发送给加速器A km+1,依此类推,将需要发送给目的加速器A m-1、A 2m-1、A 3m-1、…A (k+1)m-1、…、A n*m-1的数据发送给加速器A (k+1)*m-1。同时加速器A x会接收到节点N k中其他加速器发送的数据,加速器A x接收的数据是需要发送给与A x位于同一个通信平面的加速器的数据。
节点N k中的任意一个加速器均会执行上述操作,最终节点N k中任意一个加速器中得到的数据是与该加速器位于同一个通信平面的n个加速器需要的数据。例如,加速器A km中有需要发送给A 0、A m、A 2m、…A km、…、A (n-1)m的数据,加速器A km+1中有需要发送给A 1、A m+1、A 2m+1、…A km+1、…、A (n-1)m+1的数据。
同时数据传输系统内任意一个节点内的加速器均会执行上述操作,各个节点内的加速器完成节点内的数据交互后,任意一个加速器中得到的是与该加速器位于同一个通信平面的n个加速器需要的数据。例如加速器A 0中包括有需要发送给A 0、A m、A 2m、…A km、…、A (n-1) m的数据,加速器A 1中有需要发送给A 1、A m+1、A 2m+1、…A km+1、…、A (n-1)m+1的数据。最后各个加速器执行节点间的数据交换,每个加速器将与该加速器位于同一通信平面的各个加速器需要的数据通过第二通信链路分别发送给各个加速器,完成同一个通信平面的加速器的数据交互,最终每个加速器得到的是数据传输系统内各个加速器需要发送给该加速器的数据。例如,加速器A 0将需要发送给A m的数据通过第二通信链路发送给A m,加速器A 0将需要发送给A km的数据通过第二通信链路发送给A km等等,最终加速器A 0得到各个加速器需要发送给A 0的数据,A km得到各个加速器需要发送给A km的数据。
下面以加速器是GPU,两个节点N 0和N 1之间的数据传输为例,对本申请提供的数据传输方法进行详细介绍。如图4所示,图4是本申请实施例提供的一种数据传输的示意图。图4中仅示出节点内GPU的连接关系以及一个通信平面的GPU的连接关系。N 0包括G0、G1、G2和G3共四个GPU,N 1包括G4、G5、G6和G7共四个GPU。该数据传输系统包括L0(包、L1、L2和L3共四个通信平面,其中,通信平面L0包括G0和G4,通信平面L1包括G1和G5,通信平面L2包括G2和G6,通信平面L3包括G3和G7。各个GPU分别存在需要发送给两个节点内的所有GPU的数据,例如G0包括需要分别发送给G0~G7的数据(0,0)、(0, 1)、…、(0,7),G1包括需要分别发送给G0~G7的数据(1,0)、(1,1)、…、(1,7),依此类推,G7包括需要分别发送给G0~G7的数据(7,0)、(7,1)、…、(7,7)。N 0和N 1首先进行各自节点内的GPU的数据交互,具体包括:对于N 0或N 1内的任意一个GPU,例如N 0内的G0,当N 0内的一个或多个GPU有数据需要发送给位于通信平面L0内的GPU时,均先通过第一通信链路将数据发送给G0;当G0有数据需要发送给其他通信平面内的目的GPU时,先通过第一通信链路将数据发送给N 0内与目的GPU位于同一通信平面的GPU;例如,G0会接收到G1发送的数据(1,0)和(1,4),接收到G2发送的数据(2,0)和(2,4),接收到G3发送的数据(3,0)和(3,4);G0将需要发送给G1的数据(0,1)和需要发送给G5的数据(0,5)都发送给G1,G0将需要发送给G2的数据(0,2)和需要发送给G6的数据(0,6)都发送给G2,G0将需要发送给G3的数据(0,3)和需要发送给G7的数据(0,7)都发送给G3。
N 0和N 1内的各个GPU均执行上述节点内的数据交互,各节点内的GPU在完成节点内的数据交互之后,每个GPU中的数据是与该GPU位于同一个通信平面的各个GPU需要的数据。如图4中所示,在节点内的数据交互完成之后,G0中的数据是G0和G4需要的数据,包括数据(0,0)、(1,0)、(2,0)、(3,0)、(0,4)、(1,4)、(2,4)和(3,4);G1中的数据是G1和G5需要的数据,包括数据(0,1)、(1,1)、(2,1)、(3,1)、(0,5)、(1,5)、(2,5)和(3,5);G4中的数据是G0和G4需要的数据,包括数据(4,0)、(5,0)、(6,0)、(7,0)、(4,4)、(5,4)、(6,4)和(7,4);G5中的数据是G1和G5需要的数据,包括数据(4,1)、(5,1)、(6,1)、(7,1)、(4,5)、(5,5)、(6,5)和(7,5)。
在各节点内的GPU完成节点内的数据交互之后,位于同一个通信平面的各个GPU通过第二通信链路进行节点间的数据交互,各个GPU将同一通信平面内其他GPU各自需要的数据通过第二通信链路分别发送给对应的GPU。具体的,G0将数据(0,4)、(1,4)、(2,4)和(3,4)发送给G4;G4将数据(4,0)、(5,0)、(6,0)和(7,0)发送给G0;G1将数据(0,5)、(1,5)、(2,5)和(3,5)发送给G5,G5将数据(4,1)、(5,1)、(6,1)和(7,1)发送给G1;其他通信平面的数据交互过程与上述相同,在此不再一一赘述。位于同一通信平面的GPU完成数据交互之后,每个GPU中的数据均是数据传输系统中各个GPU需要发送给该GPU的数据,例如,G0中的数据是(0,0)、(1,0)、(2,0)、(3,0)、(4,0)、(5,0)、(6,0)和(7,0),G5中的数据是(0,5)、(1,5)、(2,5)、(3,5)、(4,5)、(5,5)、(6,5)和(7,5)。
以上是以两个节点中的数据传输为例对本申请实施例提供的数据传输方法进行介绍,应理解,本申请实施例中,对于包括两个以上节点的系统,同样能够用上述方法进行数据传输。下面以加速器是GPU、8个节点N 0~N 8之间的数据传输为例,对本申请提供的数据传输方法进行详细介绍。如图5所示,图5是本申请实施例提供的另一种数据传输的示意图。数据传输系统包括4台计算设备,每个计算设备包括2个节点,每个节点包括4个GPU,即数据传输系统共包括32个GPU,分别为G0~G31。该数据传输系统共包括4个通信平面,如图5中所示,4个通信平面分别为L0、L1、L2和L3,通信平面L0包括G0、G4、G8、G12、G16、G20、G24和G28,通信平面L1包括G1、G5、G9、G13、G17、G21、G25和G29,通信平面L2包括G2、G6、G10、G14、G18、G22、G26和G30,通信平面L3包括G3、G7、G11、G15、G19、G23、G27和G31。每个通信平面包括8个GPU,图5中仅示出通信平面L0通 过第二通信链路连接。
当数据传输系统中的各个GPU有数据需要发送给其他GPU时,首先进行节点内的GPU之间的数据交互,对于通信平面L0中的8个GPU,G0会分别接收节点N 0内的G1~G3发送给通信平面L0内各个GPU的数据,同时G0将需要发送给通信平面L1内各个GPU的数据发送给G1,G0将需要发送给通信平面L2内各个GPU的数据发送给G2,G0将需要发送给通信平面L3内各个GPU的数据发送给G3。其他节点内的GPU之间同样根据上述方法进行数据交互。例如,节点N 5内的G21会接收G20、G22和G23分别发送给通信平面L1内各个GPU的数据,同时G21将需要发送给通信平面L0内各个GPU的数据发送给G20,G21将需要发送给通信平面L2内各个GPU的数据发送给G22,G21将需要发送给通信平面L3内各个GPU(的数据发送给G23。
各节点内的GPU在完成节点内的数据交互之后,每个GPU中的数据是与该GPU位于同一个通信平面内的各个GPU需要的数据。例如,G0中的数据是通信平面L0内8个GPU各自需要的数据,G1中的数据是通信平面L1内8个GPU各自需要的数据,G4中的数据是通信平面L0内8个GPU各自需要的数据,G6中的数据是通信平面L2内8个GPU各自需要的数据。
在各节点内的GPU完成节点内数据交互之后,位于同一个通信平面的各个GPU通过第二通信链路进行节点间的数据交互。各个GPU将同一通信平面内其他GPU需要的数据通过第二通信链路分别发送给其他GPU。具体的,G0将数据(0,4)、(1,4)、(2,4)和(3,4)发送给G4,将数据(0,8)、(1,8)、(2,8)和(3,8)发送给G8,将数据(0,12)、(1,12)、(2,12)和(3,12)发送给G12等;G4将数据(4,0)、(5,0)、(6,0)和(7,0)发送给G0,将数据(4,8)、(5,8)、(6,8)和(7,8)发送给G8,将数据(4,12)、(5,12)、(6,12)和(7,12)发送给G12等;G1将数据(0,5)、(1,5)、(2,5)和(3,5)发送给G5,G5将数据(4,1)、(5,1)、(6,1)和(7,1)发送给G1等等;其他通信平面内的GPU的数据交互过程与上述相同,在此不再一一赘述。位于同一通信平面的GPU完成节点间的数据交互之后,每个GPU中的数据均是数据传输系统中各个GPU需要发送给该GPU的数据,例如,G0中的数据是(0,0)、(1,0)、(2,0)、…、(31,0);G1中的数据是(0,1)、(1,1)、(2,1)、…、(31,1)。
图6是本申请实施例提供的一种数据传输方法的流程示意图,该数据传输方法应用于如图3所示的数据传输系统,该数据传输方法包括S601至S602。
S601.第一加速器获取第一节点内其他加速器通过第一通信链路发送的第一数据。
其中,第一数据包括第一节点内的其他加速器各自需要发送给第二节点内的第二加速器的数据;第一节点和第二节点是数据传输系统的多个节点中的任意两个节点,第一加速器和第二加速器是第一通信平面内的加速器。
上述n个节点在共同完成一项任务的过程中,例如通过模型并行(model parallelism)的方式训练神经网络模型时,每个加速器会生成需要发送给其他加速器的数据。第一节点内的一个或多个加速器生成了需要发送给第二节点内的第二加速器的数据;第一节点内的这一个或多个加速器通过第一通信链路将各自需要发送给第二加速器的数据发送给与第二加速器位于同一通信平面的第一加速器。其中,各个加速器发送的数据中均包括将数据发送给第二加速器的指示信息,例如第二加速器的标识或者第二加速器的地址;第一节点和第二节点可以 是同一个计算设备中的两个节点,也可以是不同计算设备中的两个节点。
S602.第一加速器通过第二通信链路将第一数据发送给第二加速器。
第一加速器在接收到上述第一节点内各个加速器发送的需要发送给第二加速器的数据之后,得到第一数据,然后将第一数据通过第二通信链路发送给第二加速器。
第一节点内的其他加速器将需要发送给第二加速器的数据发送给第一节点内的第一加速器的具体操作,可以参照上述图4或图5对应的实施例中关于节点内数据交互的操作;第一加速器将数据通过第二通信链路发送给第二加速器的具体操作,可以参照上述图4或图5对应的实施例中关于节点间数据交互的操作。例如,第一节点可以是图4中的N 0,第二节点是图4中的N 1,第一加速器是图4中的G0,第二加速器是图4中的G4,具体的数据交互操作在此不再赘述。
上述图6对应的方法实施例是以两个节点中一个节点内的一个加速器接收同一个节点内的其他加速器发送的数据,然后通过第二通信链路发送给另一个节点内的一个加速器为例。应理解,本申请提供的数据传输方法能够用于图3所示的数据传输系统中的各个节点内的每个加速器。各个节点内的各个加速器能够先通过第一通信链路进行节点内的数据交互,以使任意一个加速器中得到的数据是与该加速器位于同一个通信平面的各个加速器需要的数据;然后各个节点内的加速器分别通过第二通信链路与同一个通信平面内的各个加速器交互数据,最终每个加速器得到各自需要的数据。如图7所示,图7是本申请实施例提供的另一种数据传输方法的流程示意图,该数据传输方法应用于如图3所示的数据传输系统,该数据传输方法包括S701至S703。
S701.处理器向各自管理的节点内的各个加速器发送分组信息,该分组信息包括每个通信平面包括的加速器的信息。
上述数据传输系统包括n个节点和至少一个Host CPU,其中,每个节点包括m个加速器,一个Host CPU管理至少一个节点。上述分组信息包括数据传输系统中每个通信平面包括的加速器的信息。其中,加速器的信息可以是加速器的标识或地址。例如上述图5所示的数据传输系统包括8个节点,每个节点包括4个GPU,该数据传输系统包括4个通信平面,每个通信平面包括8个GPU,则分组信息包括每个通信平面包括的8个GPU的信息。各个节点内的加速器根据上述分组信息与位于同一个通信平面的其他各个加速器建立基于第二通信链路的连接。
S702.各个节点内的加速器进行节点内的数据交互,以使一个加速器得到的数据是与该加速器位于同一个通信平面的各个加速器需要的数据。
上述n个节点在共同完成一项任务的过程中,例如通过模型并行的方式训练神经网络模型时,每个加速器会生成需要发送给其他加速器的数据。例如第一节点内的一个或多个加速器生成了需要发送给第二节点内的第二加速器的第一数据,第一节点内的这一个或多个加速器根据上述分组信息,确定第一节点内的第一加速器与第二节点内的第二加速器都位于第一通信平面,第一节点内的这一个或多个加速器各自将需要发送给第二加速器的数据先通过第一通信链路发送给第一加速器。其中,第一节点内的各个节点发送的数据中均包括将数据发送给第二加速器的指示信息,例如第二加速器的标识或者第二加速器的地址等;第一节点和第二节点是数据传输系统中任意两个不同的节点,第一节点和第二节点可以是同一个计算设备中的两个节点,也可以是不同计算设备中的两个节点。
需要说明的是,当第一节点内的其他加速器有数据需要发送给第一通信平面内的多个加速器时,第一节点内的其他加速器能够将需要发送给第一通信平面内各个加速器的数据均先发送给第一加速器。例如第一节点内包括四个加速器,第一通信平面内包括六个加速器,第一节点内的其他三个加速器将各自需要发送给第一通信平面内的六个加速器的数据均发送给第一加速器。
在一种可能的实现方式中,第一节点内的第一加速器会生成需要发送给第二节点内第四加速器的第二数据,第一加速器根据上述分组信息确定第一节点内的第三加速器与第四加速器位于同一个通信平面,则第一加速器将第二数据发送给第三加速器,以使第三加速器通过第二通信链路将第二数据发送给第四加速器。其中,第二数据中包括指示将第二数据发送给第四加速器的指示信息。
本申请实施例中,各个节点内的加速器生成需要发送给其他加速器的数据后,各个节点内的加速器通过第一通信链路进行节点内的加速器之间的数据交互,最终使得一个加速器得到的数据是与该加速器位于同一通信平面的各个加速器需要的数据。节点内加速器的数据交互的方法可以参照上述图3至图5对应的实施例中关于节点内数据交互的操作,在此不再赘述。
S703.同一个通信平面内的各个加速器分别通过第二通信链路进行节点间的数据交互,得到每个加速器各自需要的数据。
第一加速器在接收到第一节点内其他各个加速器发送的需要发送给第二加速器的数据之后,根据接收到的数据中的指示信息,将其他各个加速器需要发送给第二加速器的数据以及第一加速器需要发送给第二加速器的数据,通过第二通信链路发送给第一加速器。同样的,第三加速器通过第二通信链路将第二数据发送给第四加速器。第一加速器也会通过第二通信链路接收第一通信平面内其他加速器各自发送的第三数据,第三数据包括第一通信平面内的各个加速器所属的节点内的加速器发送给第一加速器的数据。
本申请实施例中,各个节点内的加速器通过第一通信链路进行节点内的加速器的数据交互之后,任意一个加速器的内存中保存的是该加速器所在的通信平面内的各个加速器需要的数据;然后位于同一通信平面的各个加速器通过第二通信链路进行节点间的数据交互,最终每个加速器得到各自需要的数据,即数据传输系统中各个加速器需要发送给该加速器的数据,位于同一通信平面的加速器进行节点间的数据交互的方法可以参照上述图3至图5对应的实施例中关于节点间数据交互的操作,在此不再赘述。
通过本申请实施例提供的数据传输方法,在多个节点内的各个加速器需要相互交互数据时,各个节点内的加速器先通过节点内的通信链路进行数据交互,各个节点内的加速器通过节点内的第一通信链路进行数据交互之后,任意一个加速器中的数据是与该加速器位于同一通信平面的各个加速器需要的数据,位于同一个通信平面内的各个加速器再通过第二通信链路交互数据,最终实现各个加速器得到各自需要的数据。通过实施本申请提供的数据传输方法,能够充分利用节点内的内部高速链路实现同一通信平面的数据聚合,然后再通过第二通信链路在各个节点的加速器之间进行数据交互,能够降低节点间的加速器相互发送数据的次数,即降低节点间的通信规模,从而能够降低网络上的数据拥塞和传输时延,提高数据传输效率,有利于进行系统扩容以增强算力。例如,上述图4所对应的实施例中,节点N 0内的G0~G3有需要发送给节点N 1内G4的数据时,G1~G3先将需要发送给G4的数据通过节点内 的第一通信链路发送给G0,然后由G0将G0~G3需要发送给G4的数据通过第二通信链路发送给G4,而不需要G0~G3分别将数据通过第二通信链路发送给G4,能够将节点间的通信次数减少到原来的四分之一。
本申请实施例提供的数据传输方法能够应用于矩阵运算,例如神经网络模型的模型训练过程中的矩阵运算。如图8所示,图8是本申请实施例提供的一种矩阵运算的示意图。图8是通过如图4所示的包括两个节点、8个GPU的数据传输系统进行模型训练的示意图,8个GPU完成模型训练过程中Z=(B x C)x D的矩阵乘法计算。其中,B为a*b的矩阵,C为b*c的矩阵,D为c*d的矩阵,E=B x C,则E为a*c的矩阵。在进行上述矩阵乘法运算时,矩阵C的数据量较大,例如矩阵C为嵌入表(embedding table),需要将矩阵C部署到多个GPU中,例如部署到G0~G7共8个GPU中,将矩阵C按列分成8个子矩阵,如果将矩阵按列平均分成8个子矩阵,则子矩阵C0~C7均为b*c1的矩阵,其中c1=c/8。如图8所示,8个GPU输入数据为矩阵C的一个子矩阵以及矩阵B,即每个GPU完成矩阵乘法B*C的部分计算。例如G0完成E0=B x C0的矩阵计算,G4完成E4=B x C4的矩阵计算。
如图8所示,每个GPU完成矩阵乘法计算后,每个GPU中得到a*c1的8个矩阵E0~E7。然后各个GPU需要继续协作完成与矩阵D的矩阵乘法运算,由于矩阵D是c*d的矩阵,矩阵E0~E7不能和矩阵D直接进行矩阵乘法计算,需要先通过上述各节点内GPU间的数据交互以及节点间同一通信平面的GPU间的数据交互的方法,将各个GPU中的矩阵转换为列数为c的矩阵。矩阵E0~E7组合后的矩阵E为a*c的矩阵,因此可以将矩阵E按行分成8个子矩阵F0~F7,如果将矩阵E按行平均分成8个子矩阵,每个子矩阵为a1*c,其中a1=a/8。然后进行节点内各GPU的数据交互和节点间各GPU的数据交互,将每个GPU中的矩阵转换为a1*c的矩阵,以完成与矩阵D的乘法运算。
根据上述分析,GPU间需要进行数据交互,以将a*c1的矩阵E0~E7转换为a1*c的矩阵F0~F7。为便于描述,下面以a等于200,c等于800为例,则a1=25,c1=100,E为200*800的矩阵,E0~E7为200*100的矩阵,F0~F7为25*800的矩阵。如图9所示,G0中矩阵E0相当于矩阵E的第1~100列,G1中矩阵E1相当于矩阵E的第101~200列,G2中矩阵E2相当于矩阵E的第201~300列,依此类推,G7中矩阵E7相当于矩阵E的第701~800列。为了将各个GPU中200*100的矩阵转换为25*800的矩阵,首先将各个GPU中的矩阵均按行分成8个25*100的矩阵,如果G0中最终得到的矩阵F0是矩阵E中的第1~25行的数据,则G0需要接收G1~G7中各个矩阵的第1~25行的数据;如果G1中最终得到的矩阵F1是矩阵E中的第26~50行的数据,则G1需要接收G0以及G2~G7中各个矩阵的第26~50行的数据;如果G2中最终得到的矩阵F2是矩阵E中的第51~75行的数据,则G2需要接收G0、G1以及G3~G7中各个矩阵的第51~75行的数据;依此类推,如果G7中最终得到的矩阵F7是矩阵E中的第176~200行的数据,则G7需要接收G0~G6中各个矩阵的第176~200行的数据。
对于G0,G0中的第1~25行的数据是G0自身需要的数据,为G0发送给G0的数据,相当于上述图4对应的实施例中的数据(0,0);G0中的第26~50行的数据是G1需要的数据,为G0发送给G1的数据,相当于上述图4对应的实施例中的数据(0,1);依此类推,G0中的第176~200行的数据是G7需要的数据,为G0发送给G7的数据,相当于上述图4对应的实施例中的数据(0,7)。因此,G0中包括发送给G0~G7的数据(0,0)、(0,1)、…、(0,7)。
对于G1,G1中的第1~25行的数据是G0需要的数据,为G1发送给G0的数据,相当于上述图4对应的实施例中的数据(1,0);G1中的第26~50行的数据是G1自身需要的数据,为G1发送给G1的数据,相当于上述图4对应的实施例中的数据(1,1);依此类推,G1中的第176~200行的数据是G7需要的数据,为G1发送给G7的数据,相当于上述图4对应的实施例中的数据(1,7)。因此,G1中包括发送给G0~G7的数据(1,0)、(1,1)、…、(1,7)。
同样的,G2中包括发送给G0~G7的数据(2,0)、(2,1)、…、(2,7);G3中包括发送给G0~G7的数据(3,0)、(3,1)、…、(3,7);G4中包括发送给G0~G7的数据(4,0)、(4,1)、…、(4,7);G5中包括发送给G0~G7的数据(5,0)、(5,1)、…、(5,7);G6中包括发送给G0~G7的数据(6,0)、(6,1)、…、(6,7);G7中包括发送给G0~G7的数据(7,0)、(7,1)、…、(7,7)。其中,一个GPU需要发送给另一个GPU的数据均为25行100列的数据。
通过上述分析,G0~G7中的任何一个GPU都要分别向各个GPU发送25行100列的数据,每个GPU也要接收其他各个GPU发送的25行100列的数据,才能够将G0中E0转换为F0,将G1中的E1转换为F1,等等。通过上述图4对应的实施例中的方法,通过节点内GPU的数据交互和节点间同一通信平面的GPU的数据交互,将8个GPU中的数据分别转换为a1*c的矩阵F0~F7,每个GPU中的矩阵与矩阵D完成矩阵乘法运算,分别得到a1*d的矩阵Z0~Z7。
对于上述方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明所必须的。
本领域的技术人员根据以上描述的内容,能够想到的其他合理的步骤组合,也属于本发明的保护范围内。其次,本领域技术人员也应该熟悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明所必须的。
上文中结合图1至图9详细描述了根据本申请实施例所提供的数据传输系统和方法,下面结合图10至图12,介绍本申请实施例所提供的关于数据传输的装置、板卡和计算设备。
如图10所示,图10是本申请实施例提供的一种数据传输装置的结构示意图,该数据传输装置100用于上述数据传输系统中的任一加速器,该数据传输装置100包括通信单元101和处理单元102,其中,
通信单元101,用于通过第一通信链路获取第一节点内的其他加速器发送的第一数据,该第一数据包括第一节点内的其他加速器各自需要发送至第二节点内的第二加速器的数据。上述第一节点和第二节点是上述多个节点中的任意两个节点,第一加速器和第二加速器是第一通信平面内的加速器。例如第一节点内的一个或多个加速器生成了需要发送给第二节点中的第二加速器的第一数据,第一节点中的这一个或多个加速器确定第一节点中的第一加速器与第二节点中的第二加速器位于同一通信平面,第一节点中的这一个或多个加速器各自将需要发送给第二加速器的数据先通过第一通信链路发送给第一加速器。其中,第一节点中这一个或多个加速器发送的数据中均包括将数据发送给第二加速器的指示信息,例如第二加速器的标识或者第二加速器的地址等。
处理单元102,用于在第一加速器的通信单元101接收到第一节点内其他加速器发送的数据后,根据各个节点发送的数据中的指示信息,确定各个加速器发送的数据的目的加速器,即第二加速器,然后通过通信单元101将各个加速器发送给第二加速器的发送给第二加速器。
在一种可能的实现方式中,处理单元102还用于确定需要发送给第四加速器的第二数据,并确定第一节点内的第三加速器与第四加速器位于同一通信平面;通信单元101还用于通过第一通信链路发送第二数据给第一节点内的第三加速器,以使第三加速器通过第二通信链路将第二数据发送给第四加速器;其中,第四加速器是与第一加速器位于不同节点的加速器,第二数据包括将第二数据发送给第四加速器的指示信息。
在一种可能的实现方式中,通信单元101还用于接收与第一加速器位于相同通信平面内的其他加速器各自发送的数据。
上述数据传输装置100实现数据传输的具体操作可以参照上述图3至图9所描述的实施例中任一加速器执行的操作,在此不再赘述。
如图11所示,图11是本申请实施例提供的一种板卡的结构示意图,该板卡110包括多个加速器111和多个网络接口控制器(network interface controller,NIC)112,其中,多个加速器111中的部分或者全部通过第一通信链路连接,即板卡110包括上述图3至图5对应的实施例中描述的一个或多个节点。每个节点内的多个加速器之间通过上述第一通信链路连接,每个加速器111和一个NIC 112通过总线113连接,一个NIC 112能够供一个或多个加速器111用来发送或接收数据。每个加速器111对应的NIC 112用于向其他节点内的加速器111发送数据,或者接收其他节点内的加速器111发送的数据。加速器111可以是GPU、NPU、TPU或DPU等AI芯片中的任意一种。
当一个板卡110包括一个上述节点时,该板卡110能够设置于计算设备中,板卡110上通过第一通信链路的连接的加速器111能够完成上述图3至图9所对应的实施例中所描述的节点内数据交互的各项操作;也能够与计算设备内其他板卡110上的节点构建上述方法实施例中的第二通信链路,多个板卡110上的多个节点内的加速器111能够完成上述图3至图9所对应的实施例中所描述的节点间的数据交互的各项操作。当一个板卡110包括多个上述节点时,板卡110上的多个节点能够建立如上述方法实施例中所述的第二通信链路,板卡110上的任意一个节点内通过第一通信链路的连接的加速器111能够完成上述图3至图9所对应的实施例中所描述的节点内数据交互的各项操作;也能够与该板卡110上其他节点内的加速器配合完成上述图3至图9所对应的实施例中所描述的节点间的数据交互的各项操作。或者与计算设备中其他板卡110上的其他节点内的加速器配合完成上述图3至图9所对应的实施例中所描述的节点间的数据交互的各项操作。
上述板卡110实现数据传输的具体操作可以参照上述图3至图9所描述的实施例中任一节点内加速器执行的操作,在此不再赘述。
参见图12,图12是本申请实施例提供的一种计算设备的示意图,该计算设备120包括:一个或者多个处理器121、通信接口122、存储器123以及多个加速器124,所述处理器121、通信接口122、存储器123和加速器124通过总线125相互连接,其中,处理器121与加速器124之间的连接关系可以参照上述图1中的描述,多个加速器124能够构建如上述图3所描述的一个或多个节点,多个加速器124可以部署在如图11所示的一个或多个板卡110上。
处理器121执行的各种操作可参照上述图7中S701中的具体操作。任一加速器124执行 的具体操作可参照上述图3至图9所描述的实施例中加速器执行的操作,其中,处理器121和加速器124之间的关系可以参见上述图3的相关描述,在此不再赘述。
处理器121可以有多种具体实现形式,例如处理器121可以为CPU,处理器121可以是单核处理器或多核处理器。处理器121可以由CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。处理器121也可以单独采用内置处理逻辑的逻辑器件来实现,例如FPGA或数字信号处理器(digital signal processor,DSP)等。
加速器124可以是GPU、NPU、TPU或DPU等AI芯片中的任意一种。
通信接口122可以为有线接口或无线接口,用于与其他模块或设备进行通信,有线接口可以是以太接口、局域互联网络(local interconnect network,LIN)等,无线接口可以是蜂窝网络接口或使用无线局域网接口等。
存储器123可以是非易失性存储器,例如,只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。存储器123也可以是易失性存储器,易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
存储器123也可用于存储程序代码和数据,以便于处理器121或加速器124调用存储器123中存储的程序代码执行上述方法实施例中实现数据传输的操作步骤。此外,计算设备120可能包含相比于图12展示的更多或者更少的组件,或者有不同的组件配置方式。
总线125可以是PCIe总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线125可以分为地址总线、数据总线、控制总线等。总线125除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,图12中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
可选地,该计算设备120还可以包括输入/输出接口126,输入/输出接口126连接有输入/输出设备,用于接收输入的信息,输出操作结果。
具体地,上述计算设备120执行各种操作的具体实现可参照上述图2至图9所描述的方法实施例中的具体操作,在此不再赘述。
本申请实施例还提供一种数据传输系统,该系统包括一个或多个上述计算设备120,该系统中的各个计算设备120中各个加速器之间的数据交互过程可以参照上述图3至图9所描述的方法实施例中的具体操作,在此不再赘述。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在处理器上运行时,可以实现上述方法实施例中的方法步骤,所述计算机可读存储介质的处理器在执行上述方法步骤的具体实现可参照上述方法实施例图3至图9所描述的方法实施例中所示的具体操作,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并或删减;本申请实施例系统中的模块可以根据实际需要进行划分、合并或删减。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (29)

  1. 一种数据传输系统,其特征在于,至少包括第一节点和第二节点,所述第一节点包括多个加速器,所述第一节点内的多个加速器之间通过第一通信链路连接;其中,
    所述第一节点内的第二加速器用于通过所述第一通信链路将第一数据发送给所述第一节点内的第一加速器,其中,所述第一数据是所述第一节点内的第二加速器将要发送给所述第二节点内的第一加速器的数据;
    所述第一节点内的第一加速器,用于将所述第一数据通过第二通信链路发送给所述第二节点内的第一加速器;
    其中,所述第一通信链路传输数据的速度优于所述第二通信链路传输数据的速度。
  2. 根据权利要求1所述的数据传输系统,其特征在于,所述第一节点还包括其他加速器,
    所述第一节点内的其他加速器用于通过所述第一通信链路将自己将要发送给所述第二节点的第一加速器的第一数据发送给所述第一节点内的第一加速器;
    所述第一节点内的第一加速器,具体用于将从所述第一节点内的第二加速器获取的第一数据以及从所述第一节点内的其他加速器获取的第一数据,作为第一数据集合通过第二通信链路发送给所述第二节点内的第一加速器。
  3. 根据权利要求1或2所述的数据传输系统,其特征在于,所述第一节点内的第一加速器与所述第二节点内的第一加速器位于同一个通信平面。
  4. 根据权利要求1所述的数据传输系统,其特征在于,
    所述第一节点内的第一加速器还用于通过所述第一通信链路将第二数据发送给所述第一节点的第二加速器,其中,所述第二数据是所述第一节点内的第一加速器将要发送给所述第二节点内的第二加速器的数据;
    所述第一节点内的第二加速器,还用于将所述第二数据通过所述第二通信链路发送给所述第二节点内的第二加速器;
    其中所述第一节点内的第二加速器是除所述第一节点内的第一加速器之外的任意一个加速器。
  5. 根据权利要求4所述的数据传输系统,其特征在于,
    所述第一节点内的其他加速器还用于通过所述第一通信链路将自己将要发送给所述第二节点的第二加速器的第二数据发送给所述第一节点内的第二加速器;
    所述第一节点内的第二加速器,具体用于将从所述第一节点内的第一加速器获取的第二数据以及从所述第一节点内的其他加速器获取的第二数据,作为第二数据集合通过所述第二通信链路发送给所述第二节点内的第二加速器。
  6. 根据权利要求4或5所述的数据传输系统,其特征在于,所述第一节点内的第二加速器与所述第二节点内的第二加速器位于同一个通信平面,所述第一节点内的第二加速器所位 于的通信平面不同于所述第一节点内的第一加速器所位于的通信平面。
  7. 根据权利要求1-6任一所述的数据传输系统,其特征在于,所述第一节点和所述第二节点通过模型并行(model parallelism)的方式训练神经网络模型。
  8. 根据权利要求1-7任一所述的数据传输系统,其特征在于,所述第一节点和所述第二节点位于不同的计算设备。
  9. 根据权利要求1-8任一所述的数据传输系统,其特征在于,所述加速器是GPU、NPU、TPU中的任意一种。
  10. 一种计算系统,其特征在于,包括第一计算设备和第二计算设备;所述第一计算设备包括第一节点,所述第二计算设备包括第二节点;
    所述第一节点包括多个加速器,所述第一节点内的多个加速器之间通过第一通信链路连接;
    所述第一节点内的第二加速器用于通过所述第一通信链路将第一数据发送给所述第一节点内的第一加速器,其中,所述第一数据是所述第一节点内的第二加速器将要发送给所述第二节点内的第一加速器的数据;
    所述第一节点内的第一加速器,用于将所述第一数据通过第二通信链路发送给所述第二节点内的第一加速器;
    其中,所述第一通信链路传输数据的速度优于所述第二通信链路传输数据的速度。
  11. 根据权利要求10所述的计算系统,其特征在于,所述第一节点还包括其他加速器,
    所述第一节点内的其他加速器用于通过所述第一通信链路将自己将要发送给所述第二节点的第一加速器的第一数据发送给所述第一节点内的第一加速器;
    所述第一节点内的第一加速器,具体用于将从所述第一节点内的第二加速器获取的第一数据以及从所述第一节点内的其他加速器获取的第一数据,作为第一数据集合通过第二通信链路发送给所述第二节点内的第一加速器。
  12. 根据权利要求10所述的计算系统,其特征在于,所述第一节点内的第一加速器与所述第二节点内的第一加速器位于同一个通信平面。
  13. 根据权利要求10所述的计算系统,其特征在于,
    所述第一节点内的第一加速器还用于通过所述第一通信链路将第二数据发送给所述第一节点的第二加速器,其中,所述第二数据是所述第一节点内的第一加速器将要发送给所述第二节点内的第二加速器的数据;
    所述第一节点内的第二加速器,还用于将所述第二数据通过所述第二通信链路发送给所述第二节点内的第二加速器;
    其中所述第一节点内的第二加速器是除所述第一节点内的第一加速器之外的任意一个加 速器。
  14. 根据权利要求10所述的计算系统,其特征在于,
    所述第一节点内的其他加速器还用于通过所述第一通信链路将自己将要发送给所述第二节点的第二加速器的第二数据发送给所述第一节点内的第二加速器;
    所述第一节点内的第二加速器,具体用于将从所述第一节点内的第一加速器获取的第二数据以及从所述第一节点内的其他加速器获取的第二数据,作为第二数据集合通过所述第二通信链路发送给所述第二节点内的第二加速器。
  15. 根据权利要求13或14所述的计算系统,其特征在于,所述第一节点内的第二加速器与所述第二节点内的第二加速器位于同一个通信平面,所述第一节点内的第二加速器所位于的通信平面不同于所述第一节点内的第一加速器所位于的通信平面。
  16. 根据权利要求10-15任一所述的计算系统,其特征在于,所述第一节点和所述第二节点通过模型并行(model parallelism)的方式训练神经网络模型。
  17. 根据权利要求10-16任一所述的计算系统,其特征在于,所述第一节点内的每个加速器对应有一个网卡;
    所述第一节点内的第一加速器,具体用于将所述第一数据发送给所述第一节点内的第一加速器对应的网卡,由所述对应的网卡将所述第一数据发送给所述第二节点内的第一加速器。
  18. 根据权利要求17所述的计算系统,其特征在于,所述网卡具体用于通过所述第一计算设备和所述第二计算设备之间的交换机将所述第一数据发送给所述第二节点内的第一加速器。
  19. 根据权利要求10-18任一所述的计算系统,其特征在于,所述第一计算设备还包括CPU,所述CPU用于管理所述第一节点。
  20. 一种数据传输方法,其特征在于,包括:
    第一节点内的第二加速器通过所述第一通信链路将第一数据发送给所述第一节点内的第一加速器,其中,所述第一数据是所述第一节点内的第二加速器将要发送给第二节点内的第一加速器的数据;所述第一节点包括多个加速器,所述第一节点内的多个加速器之间通过第一通信链路连接;
    所述第一节点内的第一加速器将所述第一数据通过第二通信链路发送给所述第二节点内的第一加速器;
    其中,所述第一通信链路传输数据的速度优于所述第二通信链路传输数据的速度。
  21. 根据权利要求20所述的方法,其特征在于,所述第一节点还包括其他加速器,所述方法还包括:
    所述第一节点内的其他加速器通过所述第一通信链路将自己将要发送给所述第二节点的第一加速器的第一数据发送给所述第一节点内的第一加速器;
    所述第一节点内的第一加速器将所述第一数据通过第二通信链路发送给所述第二节点内的第一加速器,具体包括将从所述第一节点内的第二加速器获取的第一数据以及从所述第一节点内的其他加速器获取的第一数据,作为第一数据集合通过第二通信链路发送给所述第二节点内的第一加速器。
  22. 根据权利要求20或21所述的方法,其特征在于,所述第一节点内的第一加速器与所述第二节点内的第一加速器位于同一个通信平面。
  23. 根据权利要求20所述的方法,其特征在于,所述方法还包括:
    所述第一节点内的第一加速器通过所述第一通信链路将第二数据发送给所述第一节点的第二加速器,其中,所述第二数据是所述第一节点内的第一加速器将要发送给所述第二节点内的第二加速器的数据;
    所述第一节点内的第二加速器将所述第二数据通过所述第二通信链路发送给所述第二节点内的第二加速器;
    其中所述第一节点内的第二加速器是除所述第一节点内的第一加速器之外的任意一个加速器。
  24. 根据权利要求23所述的方法,其特征在于,所述方法还包括:所述第一节点内的其他加速器通过所述第一通信链路将自己将要发送给所述第二节点的第二加速器的第二数据发送给所述第一节点内的第二加速器;
    所述第一节点内的第二加速器将所述第二数据通过所述第二通信链路发送给所述第二节点内的第二加速器,具体包括:
    所述第一节点内的第二加速器将从所述第一节点内的第一加速器获取的第二数据以及从所述第一节点内的其他加速器获取的第二数据,作为第二数据集合通过所述第二通信链路发送给所述第二节点内的第二加速器。
  25. 根据权利要求23或24所述的方法,其特征在于,所述第一节点内的第二加速器与所述第二节点内的第二加速器位于同一个通信平面,所述第一节点内的第二加速器所位于的通信平面不同于所述第一节点内的第一加速器所位于的通信平面。
  26. 根据权利要求20-25任一所述的方法,其特征在于,所述第一节点和所述第二节点通过模型并行(model parallelism)的方式训练神经网络模型。
  27. 根据权利要求20-26任一所述的方法,其特征在于,所述第一节点内的每个加速器对应有一个网卡;
    所述第一节点内的第一加速器将所述第一数据通过第二通信链路发送给所述第二节点内的第一加速器,具体包括:
    所述第一节点内的第一加速器将所述第一数据发送给所述第一节点内的第一加速器对应的网卡,由所述对应的网卡将所述第一数据发送给所述第二节点内的第一加速器。
  28. 根据权利要求20-27任一所述的方法,其特征在于,所述第一计算设备还包括CPU,所述方法还包括:所述CPU用于管理所述第一节点。
  29. 一种计算机可读存储介质,其特征在于,该存储介质中存储有至少一条指令,基于该指令第一节点执行权利要求20-28任一所述的方法。
PCT/CN2022/106309 2022-01-21 2022-07-18 一种数据传输系统、方法及相关设备 WO2023138009A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22921439.0A EP4293984A1 (en) 2022-01-21 2022-07-18 Data transmission system and method, and related device
US18/356,475 US20230403232A1 (en) 2022-01-21 2023-07-21 Data Transmission System and Method, and Related Device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210073931.9 2022-01-21
CN202210073931.9A CN116506359A (zh) 2022-01-21 2022-01-21 一种数据传输系统、方法及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/356,475 Continuation US20230403232A1 (en) 2022-01-21 2023-07-21 Data Transmission System and Method, and Related Device

Publications (1)

Publication Number Publication Date
WO2023138009A1 true WO2023138009A1 (zh) 2023-07-27

Family

ID=83004391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/106309 WO2023138009A1 (zh) 2022-01-21 2022-07-18 一种数据传输系统、方法及相关设备

Country Status (4)

Country Link
US (1) US20230403232A1 (zh)
EP (1) EP4293984A1 (zh)
CN (2) CN114979000B (zh)
WO (1) WO2023138009A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117955901A (zh) * 2022-10-20 2024-04-30 华为技术有限公司 通信方法、系统及服务器

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105049362A (zh) * 2015-06-18 2015-11-11 西安电子科技大学 一种二维环绕网格片上网络的拓扑结构以及路由方法
US9886275B1 (en) * 2013-10-09 2018-02-06 Mellanox Technologies Ltd. Multi-core processor using three dimensional integration
CN110825689A (zh) * 2019-10-31 2020-02-21 新华三半导体技术有限公司 电子芯片的实现方法及电子芯片
CN111427835A (zh) * 2020-03-13 2020-07-17 苏州浪潮智能科技有限公司 一种基于混合路由算法的片上网络设计方法和装置
CN112148663A (zh) * 2019-06-28 2020-12-29 华为技术有限公司 一种数据交换芯片及服务器

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10423429B2 (en) * 2018-01-02 2019-09-24 International Business Machines Corporation Reconfiguring processing groups for cascading data workloads
CN109033001B (zh) * 2018-07-17 2021-08-27 北京百度网讯科技有限公司 用于分配gpu的方法和装置
US10747280B2 (en) * 2018-11-27 2020-08-18 International Business Machines Corporation Reconfigurble CPU/GPU interconnect to mitigate power/thermal throttling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9886275B1 (en) * 2013-10-09 2018-02-06 Mellanox Technologies Ltd. Multi-core processor using three dimensional integration
CN105049362A (zh) * 2015-06-18 2015-11-11 西安电子科技大学 一种二维环绕网格片上网络的拓扑结构以及路由方法
CN112148663A (zh) * 2019-06-28 2020-12-29 华为技术有限公司 一种数据交换芯片及服务器
CN110825689A (zh) * 2019-10-31 2020-02-21 新华三半导体技术有限公司 电子芯片的实现方法及电子芯片
CN111427835A (zh) * 2020-03-13 2020-07-17 苏州浪潮智能科技有限公司 一种基于混合路由算法的片上网络设计方法和装置

Also Published As

Publication number Publication date
CN114979000A (zh) 2022-08-30
US20230403232A1 (en) 2023-12-14
CN114979000B (zh) 2023-06-06
EP4293984A1 (en) 2023-12-20
CN116506359A (zh) 2023-07-28

Similar Documents

Publication Publication Date Title
US11973697B2 (en) Composing diverse remote cores and FPGAs
CN110033078B (zh) 一种基于树状拓扑的计算系统及方法
CN107689948B (zh) 应用于神经网络硬件加速系统的高效数据访存管理装置
TWI803663B (zh) 一種運算裝置和運算方法
WO2017187516A1 (ja) 情報処理システムおよびその運用方法
US7505457B2 (en) Method and apparatus for providing an interconnection network function
CN103092807A (zh) 节点控制器、并行计算服务器系统以及路由方法
WO2023138009A1 (zh) 一种数据传输系统、方法及相关设备
WO2023040197A1 (zh) 一种跨节点通信方法、装置、设备及可读存储介质
CN111262917A (zh) 一种基于fpga云平台的远端数据搬移装置和方法
WO2023207035A1 (zh) 一种数据同步方法、装置、设备及存储介质
CN111193971B (zh) 一种面向机器学习的分布式计算互连网络系统及通信方法
CN112929183B (zh) 智能网卡、报文传输方法、装置、设备及存储介质
CN106776014A (zh) 异构计算中的并行加速方法及系统
CN116541338B (zh) 一种计算系统、模型训练方法、装置及产品
EP3822776A1 (en) System and method for transaction broadcast in a network-on-chip
CN117061365A (zh) 一种节点选择方法、装置、设备及可读存储介质
WO2021213076A1 (zh) 基于多处理节点来构建通信拓扑结构的方法和设备
WO2021213075A1 (zh) 一种基于多处理节点来进行节点间通信的方法和设备
WO2021036404A1 (zh) 数据传输方法及相关设备
US11614946B2 (en) Networked computer
CN114844757B (zh) 一种面向分布式并行运算类算法的片上网络设计方法
US11922237B1 (en) Single-step collective operations
WO2023134590A1 (zh) 一种聚合通信方法及装置
CN114095289B (zh) 数据多播电路、方法、电子设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22921439

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022921439

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022921439

Country of ref document: EP

Effective date: 20230915