WO2024082670A1 - 通信方法、系统及服务器 - Google Patents
通信方法、系统及服务器 Download PDFInfo
- Publication number
- WO2024082670A1 WO2024082670A1 PCT/CN2023/101734 CN2023101734W WO2024082670A1 WO 2024082670 A1 WO2024082670 A1 WO 2024082670A1 CN 2023101734 W CN2023101734 W CN 2023101734W WO 2024082670 A1 WO2024082670 A1 WO 2024082670A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- accelerator
- accelerators
- node
- identifier
- Prior art date
Links
- 238000004891 communication Methods 0.000 title claims abstract description 153
- 238000000034 method Methods 0.000 title claims abstract description 81
- OWRCNXZUPFZXOS-UHFFFAOYSA-N 1,3-diphenylguanidine Chemical compound C=1C=CC=CC=1NC(=N)NC1=CC=CC=C1 OWRCNXZUPFZXOS-UHFFFAOYSA-N 0.000 description 74
- 238000003491 array Methods 0.000 description 40
- 230000005540 biological transmission Effects 0.000 description 34
- 238000010586 diagram Methods 0.000 description 24
- 238000012545 processing Methods 0.000 description 15
- 238000007726 management method Methods 0.000 description 14
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 8
- 238000012549 training Methods 0.000 description 7
- 230000004927 fusion Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/24—Multipath
- H04L45/245—Link aggregation, e.g. trunking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/24—Multipath
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Definitions
- the present application relates to the field of communication technology, and in particular to a communication method, system and server.
- each accelerator stores part of the model, and aggregate communication is required to complete data exchange during model training.
- the accelerator communicates with the accelerators of other nodes through the network card.
- the data that needs to be sent to the accelerators of other nodes communicating with the accelerator is aggregated to the accelerator through the interconnected links between the multiple accelerators in the node.
- the aggregated data is sent to the accelerators of other nodes communicating with the accelerator through its own network card.
- the communication time of the above-mentioned aggregate communication method depends on the last accelerator to complete the transmission. Since the amount of data transmitted through the network by accelerators at different nodes may be different, the communication time may be increased.
- the embodiments of the present application provide a communication method, system and server, which can evenly distribute data within a node that is sent to other nodes through a network on different accelerators within the node, thereby avoiding excessive communication volume of an accelerator in the node that causes other nodes to wait, thereby ensuring network communication efficiency.
- an embodiment of the present application provides a communication method, which is applied to a first node, wherein m first accelerators interconnected in the first node correspond one-to-one to m second accelerators interconnected in the second node, and each of the m first accelerators communicates with its corresponding second accelerator through a network card deployed by itself, and multiple copies of data stored in the m first accelerators are each marked with a final accelerator identifier, where m is a positive integer greater than or equal to 2, and the method includes:
- the m first accelerators adjust the distribution of the first data in the m first accelerators through their interconnected links, so that the difference in the amount of data in the first data possessed by each accelerator after the adjustment is less than or equal to a preset threshold; wherein the first data is data in the multiple copies of data marked with final accelerator identifiers indicating the m second accelerators respectively;
- the m first accelerators each send the adjusted data in the first data to the corresponding second accelerator through the network card configured by itself, so that the m second accelerators adjust the distribution of the received first data in the m second accelerators through their own interconnected links, so that the m second accelerators have the data marked with the final accelerator identifier in the first data indicating themselves.
- the data within a node that is sent to other nodes through the network can be evenly distributed among different accelerators within the node, avoiding the situation where the communication volume of a certain accelerator in the node is too large and causes other nodes to wait, thereby ensuring the efficiency of network communication; subsequently, the node changes the data distribution through the interconnection channel between the internal m accelerators, so that each accelerator in the node obtains the data that other nodes need to send to itself.
- the method also includes: each of the m first accelerators receives second data sent by the corresponding second accelerator through its own network card; wherein the second data includes data that is not in the first accelerator indicated by the marked final accelerator identifier; and the m first accelerators transmit the data that is not in the first accelerator indicated by the marked final accelerator identifier to the indicated first accelerator through their own interconnected links.
- the distribution of data among the m accelerators in the node can be changed based on the interconnection channels between the internal m accelerators, so that each accelerator in the node obtains the data that other nodes need to send to itself.
- the multiple final accelerator identifiers of the multiple data tags include a final accelerator identifier indicating a first accelerator
- the method further includes: the m first accelerators each transmit data that is not in the first accelerator indicated by the marked final accelerator identifier to the indicated first accelerator through their own interconnected links.
- the final accelerator identification indicates that the accelerator in the node is transmitted to the corresponding accelerator, so that each accelerator in the node obtains the data that other accelerators in the node need to send to itself.
- the method further includes: based on the data volume and the label of each of the multiple copies of data stored in the m first accelerators The final accelerator is identified, and a first distribution strategy of the first data in the m first accelerators is determined; the m first accelerators adjust the distribution of the first data in the m first accelerators according to the first distribution strategy through the links interconnected by themselves.
- determining a first distribution strategy for the first data in the m first accelerators includes: determining a first outbound data volume corresponding to each of the m first accelerators; wherein the first outbound data volume indicates the sum of the data volumes of data in the first data stored in the corresponding first accelerator; based on the first outbound data volume corresponding to each of the m first accelerators, the data volume of each of the multiple copies of data stored in the m first accelerators, and the marked final accelerator identifier, determining the first distribution strategy for the first data in the m first accelerators.
- determining a first distribution strategy for the first data in the m first accelerators includes: determining the second outbound data volume corresponding to each of the m first accelerators based on the data volume of each of the multiple copies of data stored in the m first accelerators and the marked final accelerator identifier; wherein the second outbound data volume indicates the data volume of data whose final accelerator identifier is marked as the communication accelerator identifier in the first data, and the communication accelerator identifier indicates the second accelerator with which the corresponding first accelerator communicates through a network card; based on the second outbound data volume corresponding to each of the m first accelerators, the data volume of each of the multiple copies of data stored in the m first accelerators and the marked final accelerator identifier, determining the first distribution strategy for the first data in the m first accelerators.
- the first distribution strategy indicates a communication status between data in the first data stored in each of the m first accelerators and other first accelerators.
- an embodiment of the present application provides a communication system, comprising a first node and a second node, wherein m first accelerators interconnected in the first node correspond one-to-one to m second accelerators interconnected in the second node, each of the m first accelerators communicates with its corresponding second accelerator through a network card deployed by itself, and multiple copies of data stored in the m first accelerators are each marked with a final accelerator identifier, m is a positive integer greater than or equal to 2, and the first node is used to execute the method of the first aspect.
- an embodiment of the present application provides a server, comprising: at least one memory for storing a program
- the m interconnected first accelerators correspond one-to-one to the m interconnected second accelerators in the second node, each of the m first accelerators communicates with its corresponding second accelerator through a network card deployed by itself, the multiple copies of data stored in the m first accelerators are each marked with a final accelerator identifier, the m first accelerators execute at least one program stored in a memory, and implement the method of the first aspect, where m is a positive integer greater than or equal to 2.
- the server includes a processor, and the processor is used to determine a first distribution strategy of the first data in the m first accelerators based on the data volume of each of the multiple copies of data stored in the m first accelerators and the marked final accelerator identifier; correspondingly, the m first accelerators adjust the distribution of the first data in the m first accelerators according to the first distribution strategy through their own interconnected links.
- an embodiment of the present application provides a computer storage medium, in which instructions are stored. When the instructions are executed on a computer, the computer executes the method provided in the first aspect.
- an embodiment of the present application provides a computer program product comprising instructions, which, when executed on a computer, enables the computer to execute the method provided in the first aspect.
- FIG1 is a system architecture diagram of a communication system provided in an embodiment of the present application.
- FIG2a is a first structural diagram of an electronic device provided in an embodiment of the present application.
- FIG2b is a second structural schematic diagram of an electronic device provided in an embodiment of the present application.
- FIG2c is a third structural diagram of an electronic device provided in an embodiment of the present application.
- FIG3 is a schematic diagram 1 of an application scenario provided by an embodiment of the present application.
- FIG4 is a schematic diagram of an existing communication solution in the application scenario shown in FIG3 ;
- FIG5a is a schematic diagram 1 of a data transmission path provided by an embodiment of the present invention in the application scenario shown in FIG3 ;
- FIG. 5 b is a second schematic diagram of a data transmission path provided by the implementation of the present invention in the application scenario shown in FIG. 3 .
- FIG6 is a flow chart of a communication method according to an embodiment of the present application.
- FIG. 7 a is a flowchart diagram 1 of step 620 in FIG. 6 ;
- FIG7 b is a second flow chart of step 620 in FIG6 ;
- FIG8a is a second schematic diagram of an application scenario provided in an embodiment of the present application.
- FIG8b is a flow chart of a communication method provided by an embodiment of the present application in the application scenario shown in FIG8a;
- FIG9a is a third schematic diagram of an application scenario provided in an embodiment of the present application.
- FIG9b is a schematic diagram of an accelerator identifier in a node of a data tag in the application scenario shown in FIG9a;
- FIG9c is a schematic diagram of data transmission within a node in the application scenario shown in FIG9a;
- FIG9d is a schematic diagram of data transmission between nodes in the application scenario shown in FIG9c;
- FIG. 9e is a schematic diagram of data after transmission within a node in the application scenario shown in FIG. 9d .
- words such as “exemplary”, “for example” or “for example” are used to indicate examples, illustrations or descriptions. Any embodiment or design described as “exemplary”, “for example” or “for example” in the embodiments of the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of words such as “exemplary”, “for example” or “for example” is intended to present related concepts in a concrete way.
- the term "and/or” is merely a description of the association relationship of associated objects, indicating that three relationships may exist.
- a and/or B may represent: A exists alone, B exists alone, and A and B exist at the same time.
- the term “multiple” means two or more.
- multiple systems refers to two or more systems
- multiple terminals refers to two or more terminals.
- first and second are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more of the features.
- the terms “include”, “comprises”, “has” and their variations all mean “including but not limited to”, unless otherwise specifically emphasized.
- the network may be a wired network or a wireless network.
- the wired network may be a cable network, an optical fiber network, a digital data network (DDN), etc.
- the wireless network may be a telecommunication network, an internal network, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), an InfiniBand (IB) network, a remote direct data access (RDMA over Converged Ethernet, RoCE) network, etc. or any combination thereof.
- LAN local area network
- WAN wide area network
- WLAN wireless local area network
- MAN metropolitan area network
- IB InfiniBand
- RDMA over Converged Ethernet, RoCE remote direct data access
- RoCE remote direct data access
- the above-mentioned network communication protocol can be various wired or wireless communication protocols, such as Ethernet, universal serial bus (USB), Firewire, global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), wideband code division multiple access (WCDMA), time-division code division multiple access (TD-SCDMA), long term evolution (LTE), new radio (NR), Bluetooth, wireless fidelity (Wi-Fi) and other communication protocols.
- GSM global system for mobile communications
- GPRS general packet radio service
- CDMA code division multiple access
- WCDMA wideband code division multiple access
- TD-SCDMA time-division code division multiple access
- LTE long term evolution
- NR new radio
- Bluetooth wireless fidelity
- Wi-Fi wireless fidelity
- FIGS 2a to 2b are schematic diagrams of the structure of an electronic device provided in an embodiment of the present invention.
- an electronic device may include a node node, or may include multiple nodes node.
- an electronic device includes a node node0.
- a server includes two nodes node0 and node1.
- the electronic device involved in this solution can be a physical device such as a server or a computer.
- Exemplary embodiments of the electronic device involved in this solution include but are not limited to electronic devices equipped with iOS, Android, Windows, Harmony OS or other operating systems. The embodiment of this application does not specifically limit the type of electronic device.
- node 0 includes a processor, such as a CPU (central processing unit) and m accelerators (devices), which can be referred to as accelerators D.
- a processor such as a CPU (central processing unit) and m accelerators (devices), which can be referred to as accelerators D.
- Multiple accelerators D are interconnected and communicate through interconnected links.
- Each accelerator D among the m accelerators D is provided with a network card, and communicates with other nodes through the network card (see the above description, which will not be repeated here).
- accelerator D is a device for model training or model calculation, and the embodiment of the present invention is not intended to limit the type and structure of the model.
- Accelerator D may include one or more chips, such as GPU (Graphic Processing Unit), MIC (Many Integrated Core), and NPU (Neural-network Processing Unit).
- the number of m accelerators D can be 2 or more.
- FIG. 2a shows two interconnected accelerators D
- FIG. 2b shows four interconnected accelerators D.
- Speed device D The number of m accelerators D can be 2 or more.
- the network card of the accelerator D is connected to the switch, and multiple accelerators D (devices) achieve network communication through the switch connected by the network card.
- the number of accelerators D in the N nodes is the same, m is m; from the current development point of view, the current mainstream architecture is that the number of accelerators D in the N nodes is the same, and the embodiment of the present invention also mainly focuses on the same number of accelerators D in the N nodes to illustrate the method provided by the embodiment of the present invention, but considering that the architecture with different numbers of accelerators D in the N nodes may appear in the future, the present invention does not limit the specific relationship between the numbers of accelerators D in the N nodes.
- any node node among the N nodes node any accelerator D in the node node communicates with N-1 accelerators D (located in different nodes) of other N-1 nodes node through a network card.
- the node node stores data that needs to be sent to N*m accelerators D among the N nodes.
- the node stores N*m copies of data, each of which is marked with an identifier of the accelerator D to be sent to.
- Figure 3 is a schematic diagram of an application scenario provided by an embodiment of the present invention. As shown in FIG3 , there are two nodes node0 and node1. Node0 includes four accelerators D, which are respectively recorded as D00, D01, D02, and D03.
- Node1 includes four accelerators D, which are respectively recorded as D10, D11, D12, and D13.
- the accelerator D00 in node0 data a0, b0, c0, and d0 that need to be sent to the accelerators D10, D11, D12, and D13 in node node1 are stored, as well as data that need to be sent to the accelerators D00, D01, D02, and D03 in node node0 (not shown in the figure).
- the accelerator communicating through the network card is D10, which is expressed as NC: D10; accelerators D01, D02, and D03 are similar and will not be described in detail; for accelerator D10 in node1, data that needs to be sent to accelerators D00, D01, D02, and D03 in node node0 (not shown in the figure), as well as data that needs to be sent to accelerators D10, D11, D12, and D13 in node node1 (not shown in the figure), the accelerator communicating through the network card is D00, which is expressed as NC: D00; accelerators D11, D12, and D13 are similar and will not be described in detail.
- step 1 for any accelerator D of N nodes node, the data that needs to be sent to N-1 accelerators D that communicate with the accelerator D through the network card is aggregated to the accelerator D through the interconnected links between the m accelerators D of the node node where the accelerator D is located; step 2, for any accelerator D of N nodes node, the data aggregated in step 1 is sent to the accelerators D of other nodes node to which the data needs to be sent through its own network card.
- step 2 is an example diagram of the transmission path of the data sent to accelerator D10 stored in node0 in the application scenario shown in Figure 3. It should be pointed out that the transmission paths of the data sent to accelerators D01, D02, D03, D11, D12, and D13 in Figure 3 are similar, and the only difference is that the objects of communication are different and the data sent is different.
- step 1 for accelerator D00 in node node0, accelerator D00 communicates with accelerator D10, and the data to be sent to D10 through the network card is aggregated to accelerator D00.
- accelerator D00 stores data a0, a1, a2, a3 stored by four accelerators D00, D01, D02, and D03 in node node0 that need to be sent to D10.
- Accelerators D01, D02, D03, D10, D11, D12, and D13 are similar (not shown in the figure), and the only difference is that the communication objects are different and the stored data are different, which will not be repeated.
- Step 2 for the accelerator D00 in the node node0, the accelerator D00 sends the stored data a0, a1, a2, a3 that needs to be sent to D10 to the accelerator D10 through the network card, so that the accelerator D10 obtains a0, a1, a2, a3.
- the accelerators D01, D02, D03, D10, D11, D12, and D13 are similar (not shown in the figure). The only difference is that the communication objects are different and the data sent are different, which will not be repeated.
- the completion time of the communication depends on the last accelerator D to complete the transmission. Since the amount of data transmitted by different accelerators D through the network card is different, there may be a large number of nodes waiting for the last node to complete the network transmission, resulting in a large amount of bandwidth and computing power being wasted.
- an embodiment of the present invention proposes a load-balanced communication method.
- the data that needs to be sent to another node node through a network card is evenly distributed among the interconnected m accelerators D in the node, so that the amount of data sent by each accelerator D in the node node to another node node through the network card is similar or the same, thereby avoiding excessive communication volume of a certain accelerator D in the node node causing other nodes to wait.
- Step 1 the m accelerators D of each node in the N nodes transmit data through interconnected links, and the amount of data sent by each accelerator D in the node to any node outside its node through the network card is similar or the same;
- step 2 the N*m accelerators D of the N nodes communicate with each other through the network card.
- step 3 the m accelerators D in some or all of the N nodes transmit data through interconnected links, so that the N*m accelerators D each obtain the data that needs to be sent to itself by N-1 nodes outside its node.
- FIG5a and FIG5b are schematic diagrams of the transmission path of data stored in node0 and sent to accelerator D10 in the application scenario shown in FIG3. examples.
- FIG. 5 a is an example diagram 1 of a transmission path of data stored in node 0 and sent to accelerator D10 in the scenario shown in FIG. 3 .
- the stored data a1 sent to accelerator D10 is divided into two parts a11 and a12. a11 needs to be sent to accelerator D00, and a12 needs to be sent to accelerator D01.
- Accelerator D00 communicates with D10 through the network card
- D01 communicates with D11 through the network card
- D02 communicates with D12 through the network card
- D03 communicates with D13 through the network card.
- the specific process of data transmission is as follows:
- Step 1 For the accelerator D00 in the node node0, the data a11, a2, and a3 that need to be sent to D10 through the network card are gathered on the accelerator D00. At this time, the accelerator D00 stores the data a0, a11, a2, and a3 that need to be sent to D10 stored in the four accelerators D00, D01, D02, and D03 in the node node0.
- Step 2 For accelerator D00 in node node0, the stored data a0, a11, a2, and a3 that need to be sent to D10 are sent to accelerator D10 through the network card.
- the stored data a12 that needs to be sent to D10 is sent to accelerator D11 through the network card.
- Step 3 For the accelerator D11 in the node node1, the stored data a12 that needs to be sent to D10 is sent to the accelerator D10 through the interconnected links, so that the accelerator D10 in the node node1 stores the data a0, a1, a2, and a3 in node0 that need to be sent to D10.
- FIG. 5 b is a second example diagram of the transmission path of data stored in node 0 and sent to accelerator D10 in the scenario shown in FIG. 3 .
- the accelerator D00 communicates with D10 through the network card
- D01 communicates with D11 through the network card
- D02 communicates with D12 through the network card
- D03 communicates with D13 through the network card.
- Step 1 for accelerator D00 in node node0, accelerator D01, accelerator D02 and accelerator D02 communicate with accelerator D00, and gather data a1, a2, a3 that need to be sent to D10 and data c1 that needs to be sent to D12 to accelerator D00.
- Step 2 For the accelerator D00 in the node node0, the stored data a0, a11, a2, a3 that need to be sent to D10 and the data c1 that needs to be sent to D12 are sent to the accelerator D10 through the network card.
- Step 3 For the accelerator D10 in the node node1, the data a0, a1, a2, a3 in node0 that need to be sent to D10 and the data c1 that need to be sent to D12 are stored. Then, the stored data c1 that needs to be sent to D12 is sent to the accelerator D12 through the interconnected link.
- the communication method of each node is the same.
- the following takes a node as an example for description.
- the node can be recorded as node node1, and node1 can also be called the first node.
- the communication method of nodei and any node other than nodei is the same.
- the following takes the communication between node1 and a node other than itself as an example for description.
- node2 For the convenience of description and distinction, a node other than node1 is called node2, and node2 can also be called the second node.
- the communication method can be executed by the CPU in node1, or by each accelerator D in node1. From the current development point of view, the current mainstream execution subject is the CPU.
- the embodiment of the present invention mainly focuses on the execution subject as the CPU to illustrate the method provided by the embodiment of the present invention.
- the accelerator D in node1 is called accelerator D1
- the accelerator D of node2 is called accelerator D2.
- the m accelerators D1 in node1 store multiple copies of data, each of which is marked with information, and the marked information at least includes the identifier of the accelerator D that the data ultimately needs to reach (for the convenience of description and distinction, it is called the final accelerator identifier, which can be expressed as tDevID).
- the final accelerator identifier tDevID indicates the accelerator D that the data ultimately needs to reach.
- the multiple final accelerator identifiers tDevID of the multiple copies of data include m final accelerator identifiers tDevID that respectively indicate the m accelerators D2. In actual applications, multiple copies of data can correspond to the same final accelerator identifier tDevID.
- accelerator D1 can be recorded as D1, and the m accelerators D1 are recorded as D11, D12, ..., D1m; accelerator D2 can be recorded as D2, and the m accelerators D2 are recorded as D21, D22, ..., D2m.
- Fig. 6 is a flow chart 1 of the communication method of nodes provided by an embodiment of the present invention. As shown in Fig. 6, the method specifically includes the following steps.
- Step 610 The m accelerators D1 send the respective data amounts of the multiple copies of data stored in themselves and the marked final accelerator identifier tDevID to the processor.
- Step 620 The processor determines a first distribution strategy for first data in the m accelerators D1 based on the data volume of each of the multiple copies of data stored in the m accelerators D1 and the marked final accelerator identifiers tDevID; wherein the first data is data in the multiple copies of data marked with final accelerator identifiers tDevID indicating the m accelerators D2 respectively.
- the first distribution strategy indicates the distribution of the first data in the m accelerators D1.
- the first data is all the data in the multiple copies of data stored in the m accelerators D1 marked with the m final accelerator identifiers tDevID of the m accelerators D2, specifically indicating all the data sent to the m accelerators D2 in node2.
- node1 distributes all the data to be sent to node2, that is, the first data, evenly among the m accelerators D1 based on the data volume of each of the multiple copies of data stored in the m accelerators D1 and the final accelerator identifier tDevID of the accelerator D to which the data is finally sent.
- the distribution of the first data to be sent to node2 on the m accelerators D1 is adjusted to obtain a first distribution strategy.
- Step 630 The m accelerators D1 adjust the distribution of the first data in the m accelerators D1 according to the first distribution strategy through their interconnected links, so that the difference in the amount of data in the first data possessed by each accelerator after the adjustment is less than or equal to a preset threshold.
- the distribution of the first data in the m accelerators D1 is adjusted by the first distribution strategy, so that the data amounts of the first data in the m accelerators D1 after adjustment are relatively small, for example, the same, or similar.
- tags stored in the m accelerators D1 in node1 indicate that the data volumes of the m copies of the m final accelerator identifications tDevID of the m accelerators D2 are a1, a2, ..., am, and the data volumes of the first data of the m accelerators D1 after adjustment are the same or similar to (a1+a2+...+am)/m.
- Step 641 Each of the m accelerators D1 sends the adjusted data in the first data to the corresponding accelerator D2 through the network card configured by the accelerator D1.
- the m accelerators D1 are marked with a communication accelerator identifier, which can be recorded as cDevID, and the communication accelerator identifier cDevID indicates the accelerator D2 that communicates through the network card. Therefore, the m accelerators D1 each send the data in the adjusted first data to the accelerator D2 indicated by the communication accelerator identifier cDevID through the network card configured by itself. In practical applications, the accelerator D1 can mark the data in the adjusted first data with the communication accelerator identifier cDevID.
- the communication accelerator identifier cDevIDi of the accelerator D1i is D2i.
- Step 651 The m accelerators D2 adjust the distribution of the received first data in the m accelerators D2 through their own interconnected links, so that the m accelerators D2 have data indicating themselves marked with the final accelerator identifier tDevID in the first data.
- the m accelerators D2 in node2 transmit the data not in the accelerator D2 indicated by the marked final accelerator identifier tDevID to the indicated accelerator D2 through interconnected links based on each piece of data sent by the m accelerators D1 and the final accelerator identifier tDevID marked therewith, and adjust the distribution of the first data in the m accelerators D2, so that each of the m accelerators D2 stores the data corresponding to its final accelerator identifier tDevID marked in the first data after adjustment.
- the m accelerators D1 transmit the data in the first data adjusted in step 630 and the corresponding data identifier (used to distinguish different data, such as numbers and/or letters) to the communicating accelerator D2 through their own network cards.
- the m accelerators D1 each obtain information indicating the corresponding relationship between the data identifier dataID and the final accelerator identifier tDevID (for the convenience of description and distinction, it can be called corresponding information), and send the relationship information to the corresponding accelerator D2, so that the corresponding accelerator D2 determines the final accelerator identifier tDevID marked with the data sent by the corresponding accelerator D1 based on the association information.
- the m accelerators D1 transmit the data in the first data adjusted in step 630 and its corresponding final accelerator identification tDevID to the accelerator D2 communicating with it through its own network card, so that each accelerator D2 in node2 receives the data sent by the corresponding accelerator D1 and its marked final accelerator identification tDevID.
- the data that node1 needs to send to node2 through the network card is evenly distributed among the m interconnected accelerators D1 in node1, so that the amount of data sent by each accelerator D1 in node1 to node2 through the network card is similar or the same, thus avoiding the situation where the communication volume of a certain accelerator D1 in node1 is too large and causes other nodes to wait.
- node2 changes the data distribution through the interconnected channels between the m internal accelerators D2, so that each accelerator D2 in node2 obtains the data that node1 needs to send to itself.
- step 630 may also include the following content:
- the m accelerators D1 send the stored data to an accelerator D1 other than themselves indicated by their marked final accelerator identifier tDevID through their own interconnected links.
- step 610 and step 640 in the embodiment shown in FIG. 6 , as shown in FIG. 6 , in the embodiment of the present invention, while executing step 641, at least the following steps are also included:
- Step 642 The m accelerators D2 each send second data to the corresponding accelerator D1 through its own network card; wherein the second data includes data in the accelerator D1 that is not indicated by the marked final accelerator identifier tDevID.
- the second data can be understood as the data in the m accelerators D2 processed by node2 according to the aforementioned steps 610 to 630. There are marks indicating the data of accelerator D1.
- the accelerator D1 may simultaneously receive the second data and the final accelerator identifier tDevID marked with the second data.
- the accelerator D1 may first receive the second data and its corresponding data identifier dataID, and then receive corresponding information of the corresponding relationship between the data identifier dataID and the final accelerator identifier tDevID, and based on the corresponding information, the received second data and its corresponding final accelerator identifier tDevID may be obtained.
- step 651 while executing step 651, at least the following steps are also included:
- Step 652 The m accelerators D1 transmit the data not in the accelerator D1 indicated by the marked final accelerator identifier tDevID to the indicated accelerator D1 through their own interconnected links.
- the first distribution strategy includes target adjustment strategies (denoted as SFDIS) corresponding to each of the m accelerators D1.
- target adjustment strategy SFDISi corresponding to the accelerator D1i indicates the distribution of each piece of data (referred to as target data for the convenience of description and distinction) in the first data stored in the accelerator D1i in the m accelerators D1.
- the target adjustment strategy may include multiple arrays (referred to as target arrays for the convenience of description and distinction), each target array is represented by [dataID, nDevID, fdatasize], where dataID represents the data identifier of the target data stored in D1i, nDevID represents the accelerator D1 to which it needs to be sent (referred to as the accelerator identifier within the node for the convenience of description and distinction), and fdatasize represents the amount of data to be sent (referred to as the amount of target internal transmission data for the convenience of description and distinction).
- [dataID, nDevID, fdatasize] indicates that the amount of data in the target data indicated by dataID that needs to be sent to the accelerator D1 indicated by nDevID is fdatasize.
- a data identifier dataID can correspond to multiple arrays, and nDevID in each array represents a different intra-node accelerator identifier.
- the accelerator D1 indicated by the intra-node accelerator identifier nDevID in the target adjustment strategy can be accelerator D1i, and the data in the first data stored in the i-th accelerator D1 does not need to be sent to other accelerators D1 except accelerator D1i.
- the embodiment of the present invention provides two implementations of step 620 in FIG. 6 .
- FIG7a is a flow chart of step 620 in FIG6. As shown in FIG7a, based on the embodiment shown in FIG6, in the embodiment of the present invention, step 620 may specifically include the following steps:
- Step 6211 The processor determines the first outbound data volume corresponding to each of the m accelerators D1 based on the data volume of each of the multiple copies of data stored in the m accelerators D1 and the marked final accelerator identifier tDevID; wherein the first outbound data volume indicates the sum of the data volumes of the data in the first data stored in the corresponding accelerator D1.
- i 1, 2, ..., m.
- the first outbound data amount of D00 is a0+b0+c0+d0.
- the first outbound data amount of D01 is a1+b1+c1+d1.
- the first outbound data amount of D02 is a2+b2+c2+d2.
- the first outbound data amount of D03 is a3+b3+c3+d3.
- Step 6212 The processor determines a first distribution strategy for the first data in the m accelerators D1 based on the first outbound data volumes corresponding to the m accelerators D1, the data volumes of the multiple copies of data, and the marked final accelerator identifier tDevID.
- the processor determines the inter-card adjustment strategy SSDIS corresponding to each of the m accelerators D1 based on the first outbound data volume corresponding to each of the m accelerators D1.
- the inter-card adjustment strategy SSDISi corresponding to the accelerator D1i indicates the amount of data in the first data stored in the accelerator D1i that the accelerator D1i needs to send to other accelerators D1 other than itself.
- the inter-card adjustment strategy SSDISi includes several arrays (referred to as decision arrays for convenience and distinction), and each decision array is represented by [nDevID, sdatasize].
- nDevID represents the accelerator identifier within the node, which is used to indicate other accelerators D1 other than accelerator D1i
- sdatasize represents the data that needs to be sent (for the convenience of description and distinction, it can be referred to as the amount of data transmitted between cards).
- [nDevID, sdatasize] indicates that the amount of data in the first data stored in the accelerator D1i sent to the accelerator D1 indicated by nDevID is sdatasize.
- the inter-card adjustment strategy SSDISi may be empty, that is, there is no need to send data in the first data stored in the corresponding accelerator D1 to other first accelerators other than the corresponding accelerator D1.
- each of the m accelerators D1 corresponds to a first outbound data volume, for a total of m first outbound data volumes.
- the m first outbound data volumes are analyzed to determine the network communication data volume required by each of the m accelerators D1; then, for the accelerator D1i among the m accelerators D1, the first outbound data volume corresponding to the accelerator D1i is subtracted from the network communication data volume to obtain a first difference amount FDIS (if it is a positive value, it indicates the amount of data that needs to be reduced, and if it is a negative value, it indicates the amount of data that needs to be increased).
- FDIS if it is a positive value, it indicates the amount of data that needs to be reduced, and if it is a negative value, it indicates the amount of data that needs to be increased.
- the first difference amount of accelerator D1i can be expressed as FDISi; after obtaining the first difference amount FDIS corresponding to each of the m accelerators D1, for a total of m first differences After measuring FDIS, the m accelerators D1 are paired and data adjusted based on the m first difference amounts FDIS, so that the amount of data finally transmitted by the m accelerators D1 through the network card is the same or similar to the network communication data amount, and the inter-card adjustment strategy SSDIS corresponding to each of the m accelerators D1 is obtained.
- the network communication data volume is generally the average of the m first outgoing data volumes, that is, the data volume of the first data divided by the value of m.
- the following describes in detail how to determine the inter-card adjustment strategy corresponding to each of the m accelerators D1 based on the network communication data volumes corresponding to the m accelerators D1 and the m first difference amounts FDIS (referred to as step A for ease of description and distinction).
- the first difference quantities FDIS corresponding to the m accelerators D1 are represented in a set.
- the set formed by the m first difference quantities FDIS is called the first set, and the first set includes m first difference quantities FDIS: FDIS1, FDIS2, ..., FDISm.
- step A specifically includes the following contents:
- Step A01 determine whether there are two FDISs in the first set whose sum is greater than 0 and less than or equal to a preset threshold. If yes, execute step A02; if not, execute step A05.
- Step A02 Select two FDIS from the first set whose sum is greater than 0 and less than or equal to a preset threshold, and the FDIS greater than 0 is recorded as >0:FDIS, and the FDIS less than 0 is recorded as ⁇ 0:FDIS.
- Step A03 determine the decision array [nDevID, sdatasize] corresponding to the accelerator D1 corresponding to >0:FDIS, where nDevID indicates the accelerator D1 corresponding to ⁇ 0:FDIS, and sdatasize is the absolute value of ⁇ 0:FDIS.
- Step A04 delete the two FDIS in the first set and execute step A01.
- Step A05 Select the largest FDIS greater than 0 from the first set and record it as >0max:FDIS, and select the FDIS less than 0 and with the smallest absolute value and record it as ⁇ 0min:FDIS.
- Step A06 determine whether the sum of the two >0max:FDIS and ⁇ 0min:FDIS is greater than a preset threshold, if yes, execute step A07, if not, execute step A09.
- Step A07 determine the decision array [nDevID, sdatasize] corresponding to the accelerator D1 corresponding to >0max:FDIS, nDevID indicates the accelerator D1 corresponding to ⁇ 0min:FDIS, and sdatasize is the absolute value of ⁇ 0min:FDIS.
- Step A08 update the >0max:FDIS greater than 0 in the first set to the result of >0max:FDIS+ ⁇ 0min:FDIS, delete ⁇ 0min:FDIS, and execute step A01.
- Step A09 For the accelerator D1i among the m accelerators D1, if the accelerator D1i corresponds to a decision array [nDevID, sdatasize], then all decision arrays [nDevID, sdatasize] corresponding to the accelerator D1i are counted, and the inter-card adjustment strategy SSDISi is obtained after summarization. If the accelerator D1 identifier does not correspond to a decision array [nDevID, sdatasize], the inter-card adjustment strategy SSDISi is empty.
- two first difference amounts FDIS can be randomly selected from the first set. If the difference between the two first difference amounts FDIS is small, and the absolute value of the first difference amount >0:FDIS greater than 0 is greater than the absolute value of the first difference amount ⁇ 0:FDIS less than 0, the decision array [nDevID, sdatasize] corresponding to the accelerator D1 corresponding to the first difference amount >0:FDIS greater than 0 is determined (see A03); if the difference between the two first difference amounts FDIS is small, and the absolute value of the first difference amount >0:FDIS greater than 0 is greater than the absolute value of the first difference amount ⁇ 0:FDIS less than 0, the decision array [nDevID, sdatasize] corresponding to the accelerator D1 corresponding to the first difference amount >0:FDIS greater than 0 is determined (see A07), and then the first set is updated in the manner of step A08.
- the target adjustment strategy SFDISi of accelerator D1i is obtained.
- the target data is the data indicating the final accelerator identifier tDevID of accelerator D2.
- the data volume of each target data stored in the accelerator D1i is expressed in a set form.
- this set is called the target data volume set.
- the inter-card internal data volume sdatasize in all decision arrays in the inter-card adjustment strategy SSDISi is expressed in a set form.
- this set is called the inter-card internal data volume sdatasize set.
- the data volume in the target data volume set is called the target data volume
- the data volume transferred between cards is called the target data volume transferred between cards GsdatasizeGsdatasize.
- Target data volume is greater than or equal to the target card-to-card internal data volume Gsdatasize, and the difference is small, the target card-to-card internal data volume Gsdatasize is used as the target internal data volume and associated with the target data corresponding to the target data volume.
- the target array [dataID, nDevID, fdatasize] corresponding to strategy 1 can be determined, the data indicated by dataID is the data corresponding to the target data volume, nDevID is the nDevID in the decision array [nDevID, sdatasize] where the target card-to-card internal data volume Gsdatasize is located, and fdatasize is the target card-to-card internal data volume Gsdatasize.
- the set update method corresponding to strategy 1 is to delete the target data volume set and the target inter-card internal transmission data volume Gsdatasize in the inter-card internal transmission data volume sdatasize set.
- Target data volume is greater than or equal to the sum of the internal data volumes Gsdatasize between the X target cards, and the difference is small, then the internal data volumes Gsdatasize between the X target cards are respectively used as the target internal data volumes and associated with the target data corresponding to the target data volume.
- the X target arrays [dataID, nDevID, fdatasize] corresponding to strategy 2 can be determined.
- the X fdatasizes in the X target arrays correspond one to one to the X target card-to-card internal transmission data volumes Gsdatasize.
- the data indicated by dataID is the data corresponding to the target data volume
- nDevID is the nDevID in the decision array [nDevID, sdatasize] where the target card-to-card internal transmission data volume Gsdatasize corresponding to fdatasize is located.
- the set update method corresponding to strategy 1 is to delete the target data volume set and the target inter-card internal transmission data volume sdatasize set and the target inter-card internal transmission data volume Gsdatasize.
- Target card-to-card internal transmission data volume Gsdatasize is less than or equal to the sum of Y target data volumes, and the difference is small, then the target card-to-card internal transmission data volume Gsdatasize is divided according to the size of the target data volume to obtain Y target internal transmission data volumes, and the Y target internal transmission data volumes are associated with the Y target data corresponding to the Y target data volumes.
- the Y target arrays [dataID, nDevID, fdatasize] corresponding to strategy 3 can be determined.
- the Y fdatasizes in the Y target arrays are obtained by dividing the target card-to-card internal data volume Gsdatasize.
- the data indicated by dataID is the data corresponding to the target data volume
- nDevID is the nDevID in the decision array [nDevID, sdatasize] where the target card-to-card internal data volume Gsdatasize is located.
- the set update method corresponding to strategy 3 is to delete Y target data volumes and target inter-card internal transmission data volumes Gsdatasize in the target data volume set and inter-card internal transmission data volume sdatasize set.
- Target data volume is larger than the target card-to-card internal data volume Gsdatasize and the difference is large, the target card-to-card internal data volume Gsdatasize is associated with the target data corresponding to the target data volume.
- the target array [dataID, nDevID, fdatasize] corresponding to strategy 4 can be determined, the data indicated by dataID is the data corresponding to the target data volume, nDevID is the nDevID in the decision array [nDevID, sdatasize] where the target card-to-card internal transmission data volume Gsdatasize is located, and fdatasize is the target card-to-card internal transmission data volume Gsdatasize.
- the set update method corresponding to strategy 4 is to update the target data volume in the target data volume set to the result of subtracting the target inter-card internal data volume Gsdatasize from the target data volume, and delete the target inter-card internal data volume Gsdatasize in the inter-card internal data volume sdatasize set.
- Target card-to-card internal data volume Gsdatasize is divided according to the target data volume to obtain the target internal data volume that is the same as the target data volume, and it is associated with the data corresponding to the target data volume.
- the target array [dataID, nDevID, fdatasize] corresponding to strategy 5 can be determined, the data indicated by dataID is the data corresponding to the target data volume, nDevID is the nDevID in the decision array [nDevID, sdatasize] where the target card-to-card internal transmission data volume Gsdatasize is located, and fdatasize is the target data volume.
- the set update method corresponding to strategy 5 is to delete the target data volume in the target data volume set, and update the target inter-card internal data volume Gsdatasize in the inter-card internal data volume sdatasize set to the result of subtracting the target data volume from the target inter-card internal data volume Gsdatasize.
- the above five strategies can be selected and combined at will. For example, first, determine whether the situation described in strategy 1 exists in the target data volume set and the inter-card internal data volume sdatasize set. If so, select the target data volume and target inter-card internal data volume Gsdatasize that meet strategy 1, and obtain the target array [dataID, nDevID, fdatasize] corresponding to strategy 1. Then, update the target data volume set and the inter-card internal data volume sdatasize set according to the set update method corresponding to strategy 1. Repeat the cycle until the situation described in strategy 1 does not exist in the target data volume set and the inter-card internal data volume sdatasize set.
- the target array [dataID, nDevID, fdatasize] of the target data can be determined, the data indicated by dataID is the target data, nDevID is the accelerator D1 where the target data is located, and fdatasize is the data size of the target data.
- the target array [dataID, nDevID, fdatasize] of the target data can also be determined, the data indicated by dataID is the target data, nDevID is the accelerator D1 where the target data is located, and fdatasize is the result of subtracting the data size of the target data from the sdatasize in the target array where the dataID indicating the target data is located.
- any one or more strategies from strategy 1 to strategy 5 can be selected to determine the target adjustment strategy SFDISi.
- accelerator D1i in order to reduce the communication cost between m accelerators D1, it is necessary to ensure as much as possible that the target data is sent to the accelerator D2 indicated by its marked final accelerator identifier tDevID through the accelerator D1 communicating with the network card.
- the accelerator D1i determines the accelerator D2 indicated by the final accelerator identifier tDevID marked by the target data, and the accelerator D1 (for the convenience of description and distinction, denoted as GD1) that communicates through the network card. If the inter-card adjustment policy SSDISi does not have nDevID indicating GD1, the data amount of the target data is taken as the available data amount, and the correspondence between the target data and the available data amount is recorded through an array (for the convenience of description and distinction, called the available data amount array).
- the available data amount array is expressed as [dataID, adatasize], where the data indicated by dataID is the target data, and adatasize indicates the available data amount.
- the inter-card adjustment policy SSDISi has nDevID indicating GD1 (for the convenience of description and distinction, denoted as GnDevID), determine whether the data amount of the target data is greater than or equal to sdatasize in the decision array where GnDevID is located.
- the target array [dataID, nDevID, fdatasize]
- the data indicated by dataID is the target data
- nDevID is the accelerator D1 where the target data is located
- fdatasize is the sdatasize in the decision array where GnDevID is located.
- the difference between the data volume of the target data and the sdatasize in the decision array where GnDevID is located can also be determined, and the difference is the available data volume sdatasize.
- the target array [dataID, nDevID, fdatasize]
- the data indicated by dataID is the target data
- nDevID is the accelerator D1 where the target data is located
- fdatasize is the data volume of the target data.
- the supplementary data volume array determines the difference between the data volume of the target data and sdatasize in the array where GnDevID is located (for the convenience of description and distinction, called the supplementary data volume), and record the corresponding relationship between GnDevID and the supplementary data volume through an array (for the convenience of description and distinction, called the supplementary data volume array), the supplementary data volume array is represented as [nDevID, rdatasize], nDevID represents nDevID in the decision array where Gsdatasize is located, and rdatasize represents the supplementary data volume.
- a set formed by available data amount adatasize (for the convenience of description and distinction, called available data amount set) and a set formed by supplementary data amount rdatasize (for the convenience of description and distinction, called supplementary data amount set) are obtained.
- the available data volume set and the supplementary data volume set are processed to determine several target arrays [dataID, nDevID, fdatasize].
- the difference is only that the target data volume is replaced by the target available data volume.
- the target card internal transmission data volume Gsdatasize is replaced by the target supplementary data volume.
- the target array [dataID, nDevID, fdatasize] of the target data can be determined, the data indicated by dataID is the target data, nDevID is the accelerator D1 where the target data is located, and fdatasize is the data volume of the target data.
- the target array [dataID, nDevID, fdatasize] of the target data can also be determined, the data indicated by dataID is the target data, nDevID is the accelerator D1 where the target data is located, and fdatasize is the result of subtracting the data volume of the target data from the sdatasize in the target array where the dataID indicating the target data is located.
- FIG7b is a flow chart of step 620 in FIG6. As shown in FIG7b, based on the embodiment shown in FIG6, in the embodiment of the present invention, step 620 may specifically include the following steps:
- Step 6221 The processor determines the second outbound data volume and initial communication information corresponding to each of the m accelerators D1 based on the data volume of each of the multiple copies of data and the marked terminal accelerator identifier tDevID to be sent to; wherein the second outbound data volume indicates the data volume of the data whose terminal accelerator identifier tDevID is marked as the communication accelerator identifier cDevID in the first data, the initial communication information indicates the data volume of the data whose terminal accelerator identifier tDevID is marked as the communication accelerator identifier and stored by each of the other accelerators D1, and the communication terminal accelerator identifier cDevID indicates the accelerator D2 with which the corresponding accelerator D1 communicates through the network card.
- the initial communication information OCD1i of accelerator D1i includes multiple arrays (referred to as initial communication arrays for ease of description and distinction), each initial communication array is expressed as [nDevID, tdatasize], tdatasize represents the amount of data stored in the accelerator D1 indicated by nDevID, and the marked final accelerator identifier tDevID is the communication accelerator identifier cDevIDi of accelerator D1i.
- the second outbound data volumes corresponding to the second accelerator D12, ..., and the mth accelerator D1m are similar to the initial communication information and are not described in detail.
- the communication accelerator identifier cDevID D10
- the second outbound data volume corresponding to D00 is a0+a1+a2+a3
- the initial communication information OCD00 corresponding to D00 includes [D01, a1], [D02, a2], [D03, a3].
- Step 6223 The processor determines a first distribution strategy for the first data in the m accelerators D1 based on the second outbound data volumes corresponding to each of the m accelerators D1 and the initial communication information.
- each of the m accelerators D1 corresponds to a second outbound data volume, a total of m second outbound data volumes, and the m second outbound data volumes are analyzed to determine the network communication data volume required by each of the m accelerators D1; then, for each of the m accelerators D1, the second outbound data volume corresponding to the accelerator D1i is subtracted from the network communication data volume to obtain a second difference amount SDIS (if it is a positive value, it indicates the amount of data that needs to be reduced, and if it is a negative value, it indicates the amount of data that needs to be increased), and the second difference amount of the accelerator D1i can be expressed as SDISi; after obtaining the second difference amount SDIS corresponding to each of the m accelerators D1, a total of m second difference amounts SDIS, the m accelerators D1 are paired and data adjusted based on the m second difference amounts SDIS, so that the data volume of the m accelerators D1 finally transmitted through the network is the same or
- the second difference amounts SDIS corresponding to the m accelerators D1 are represented in a set.
- the set formed by the m second difference amounts SDIS is called a first set.
- the first set includes m second difference amounts SDIS: SDIS1, SDIS2, ..., SDI Sm.
- Step B can refer to the description of step A above, the only difference being that FDIS is replaced by SDIS.
- the target adjustment strategies SFDIS corresponding to the m accelerators D1 can be determined based on the inter-card adjustment strategies SSDIS corresponding to the m accelerators D1. The determination of the target adjustment strategies SFDIS corresponding to the m accelerators D1 is described in detail below.
- the accelerator D1 corresponding to the accelerator D1 the inter-card adjustment policy SSDIS corresponding to the accelerator D1 is empty, then for each target data stored in the accelerator D1, the target array [dataID, nDevID, fdatasize] of the target data is determined, the data indicated by dataID is the target data, nDevID indicates the final accelerator identifier marked by the target data, the accelerator D2 indicated by tDevID communicates with the accelerator D1 through the network card, and fdatasize is the data size of the target data.
- the accelerator D1 corresponding to the second difference amount >0:SDIS greater than 0 among the m second difference amounts
- the accelerator D1 is D1j
- the terminal accelerator identification tDevID that is the same as the communication accelerator identification cDevID of the accelerator D1j is used as the target terminal accelerator identification GtDevID
- OCD1j multiple arrays [nDevID, tdatasize]
- several target arrays [dataID, nDevID, fdatasize] of the target data marked with the target terminal accelerator identification GtDevID stored by the m accelerator D1j are determined. The details are as follows:
- sdatasize in the decision array is used as the required data volume, and the data volume of the target data marked with the target final accelerator identifier GtDevID stored by the accelerator D1 indicated by nDevID is used as the initial data volume.
- the data indicated by dataID is the target data of the target final accelerator identifier GtDevID stored in the accelerator D1.
- nDevID is the accelerator D1 where the target data is located, and fdatasize is the initial data volume.
- the difference between the initial data volume and the required data volume is used as the supplementary data volume rdatasize, and the supplementary data volume array [dataID, rdatasize] is recorded.
- the data indicated by dataID is the target data of the target final accelerator identifier GtDevID stored in the accelerator D1.
- the target array [dataID, nDevID, fdatasize] corresponding to the accelerator D1 indicated by nDevID in the array [nDevID, sdatasize] the data indicated by dataID is the target data of the target terminal accelerator identifier GtDevID stored in the accelerator D1
- nDevID is the accelerator D1 where the target data is located
- fdatasize is the required data volume.
- the difference between the initial data volume and the required data volume is used as the available data volume adatasize, and the available data volume array [dataID, adatasize] is recorded, and the data indicated by dataID is the target data of the target terminal accelerator identifier GtDevID stored in the accelerator D1.
- the data volume of the target data marked with the target final accelerator identifier GtDevID stored by the accelerator D1 is taken as the available data volume.
- a supplementary data set formed by several supplementary data amounts sdatasize and an available data set formed by several available data amounts adatasize can be obtained. Then, according to the above-mentioned processing method for the required data set and the inter-card internal data set sdatasize, the available data set and the supplementary data set are processed to determine several target arrays [dataID, nDevID, fdatasize].
- the accelerator D1 After processing the accelerator D1 corresponding to each second difference amount >0:SDIS among the m second difference amounts, further, for all target data not indicated by dataID in all target arrays [dataID, nDevID, fdatasize], determine the target array [dataID, nDevID, fdatasize] of the target data, the data indicated by dataID is the target data, nDevID indicates the final accelerator identifier tDevID marked by the target data, the accelerator D2 indicated by tDevID communicates with the accelerator D1 through the network card, and fdatasize is the data size of the target data.
- nDevID For all target arrays [dataID, nDevID, fdatasize], when the sum of the fdatasize in the target array where the same dataID is located is different from the data size of the target data indicated by the dataID, determine the target array [dataID, nDevID, fdatasize] of the target data, the data indicated by dataID is the target data, nDevID indicates the final accelerator identifier marked by the target data, the accelerator D2 indicated by tDevID communicates with the accelerator D1 through the network card, and fdatasize is the data size of the target data minus the fdatasize in the target array where the dataID is located.
- the accelerator D1 obtains a piece of data corresponding to each target array [dataID, nDevID, fdatasize] according to the corresponding target adjustment strategy [dataID, nDevID, fdatasize], the data volume of the data is fdatasize, and the data is marked with the corresponding nDevID in the target array; when the data volume of the target data indicated by fdatasize and dataID is the same, the data is the target data indicated by dataID, and when the data volume of the target data indicated by fdatasize and dataID is different, the data is part of the target data indicated by dataID.
- step 630 the data in the accelerator D1 indicated by the accelerator identifier nDevID in the marked node that is not in the marked node is transmitted to the corresponding accelerator D1 through the interconnected links of the accelerator D1 itself, so that the data of the accelerator identifier nDevID in the node with the same mark converges to the indicated accelerator D1.
- FIG 8a is a schematic diagram of the application scenario provided by the embodiment of the present application.
- N nodes node have a management node, and the management node can manage m accelerators D in N nodes node.
- N*m accelerators D can be globally numbered in sequence, and the accelerators are identified as D1, D2, ..., DN*m, where 1 to m represent the accelerators D in node node0, m+1 to 2m represent the accelerators D in node node1, and so on and so forth;
- N*m accelerators D can be globally numbered in sequence, and the accelerators are identified as D01, D02, ..., D0m, ..., D(N-1)m, where D01, D02, ..., D0m represent the accelerators D in node node0, and so on and so forth.
- the management node can determine N*N*m copies of data, each of which is marked with a data identifier dataID, a final accelerator identifier tDevID, an initial accelerator identifier oDevID (indicating the accelerator where the data needs to be stored), and a label. After that, the management node sends the N*N*m copies of data and its marked final accelerator identifier tDevID, initial accelerator identifier oDevID, and label to the accelerator D indicated by the data marked oDevID for storage. It is worth noting that during the data processing process, the data always carries the marked data identifier dataID, final accelerator identifier tDevID, and initial accelerator identifier oDevID.
- the management node can also determine the communication information corresponding to each of the N*m accelerators D.
- the communication information includes N-1 communication accelerator identifiers, and the accelerators D indicated by the N-1 communication accelerator identifiers are located in different nodes node, and these nodes node are nodes other than the node node where the corresponding accelerator D is located.
- the m accelerators D in the node nodei each communicate with one accelerator D in the other N-1 nodes through a network card. It is worth noting that the m accelerators D in the node nodei each communicate with a different accelerator D in the same node through a network card.
- the management node can send the communication information corresponding to each accelerator D to the corresponding accelerator D.
- the N*m accelerators D each store the accelerator D communication information.
- each of the N nodes node has m accelerators D, and each accelerator D stores the first model and the second model.
- N*m copies of data can be obtained.
- Each of the N*m copies of data is marked with a label, a final accelerator identifier tDevID and an initial accelerator identifier oDevID.
- the final accelerator identifier tDevID marked with m copies of the N*m copies of data indicates the accelerator D of the node where it is located
- the final accelerator identifier tDevID marked with the other m*(N-1) copies of data indicates the accelerator D in other nodes.
- the label indicates the actual output result that the second model needs to achieve.
- the communication method provided by the embodiment of the present invention can be applied to ultra-large-scale large model training.
- the large model can be a model in the field of NLP (Natural Language Processing).
- NLP Natural Language Processing
- representation vector embedding is usually involved, for example, one word has one embedding, and correspondingly, the data stored in the management node is the embedding. Since the capacity of the embedding generally exceeds the storage space of the accelerator D, the full amount of embedding is deployed in the memory of the management node, and during the training process, the management node puts the part of the embedding required for the current training in advance to the accelerator D. Since the embedding is sparse, it can be compressed. After compression, the communication method provided by the embodiment of the present invention can be used for communication, which can optimize the communication volume and improve the overall throughput.
- the N*m accelerators D are respectively deployed with the encoding layer (the first model mentioned above) and the task layer (the second model mentioned above) of the embedding, wherein the encoding layer is used to extract higher-dimensional information of the embedding, and the task layer is used to implement tasks based on the output results of the encoding layer, for example, predicting the words represented by the embedding.
- each of the N*m accelerators D sends the encoding result (which can be called the first processed data) after the embedding passes through the encoding layer to the other N-1 accelerators D, and each of the N*m accelerators D inputs the obtained encoding result into the task layer to obtain the task execution result; then the task execution result (which can be called the second processed data) is returned by the original path, and the N*m accelerators D can determine the error based on the respective labels of each embedding and the task execution result, and can train the encoding layer and the task layer based on the error, and repeat the cycle, and finally the N*m accelerators D each obtain the trained encoding layer and task layer.
- N*m accelerators D each have a knowledge learning model of a specific expert (the first model mentioned above) and a fusion model of multiple expert knowledge (the second model mentioned above).
- the fusion model is used to implement a task based on the output results of the knowledge learning models of different experts to obtain a task result.
- the task can be object recognition, speech recognition, fault prediction, etc.
- each of the N*m accelerators D passes the sample (i.e., the data stored in the management node) through the knowledge learning model to obtain a learning result (which can be called the first processed data); then, each of the N*m accelerators D sends the learning result to the other N-1 accelerators D, and each of the N*m accelerators D inputs the obtained learning result into the fusion model to obtain the task execution result; then, the task execution result (which can be called the second processed data) is returned along the original path.
- N*m accelerators D can determine the error based on the label of each sample and the task execution result, and can train the knowledge learning model and the fusion model based on the error. The cycle is repeated, and finally the N*m accelerators D each obtain the trained knowledge learning model and the fusion model.
- FIG8b is a flow chart of a communication method provided by an embodiment of the present application in the application scenario shown in FIG8a. As shown in FIG8b, the method may specifically include the following steps:
- Step 801 The processor of the management node determines the originating accelerator identifier oDevID, the ending accelerator identifier tDevID, the tag, and the communication information corresponding to each of the N*m accelerators D marked by N*N*m pieces of data, and the communication information includes (N-1) communication accelerator identifiers cDevID.
- Step 802 The processor of the management node sends N*N*m copies of data to the accelerator D indicated by its marked originating accelerator identifier oDevID.
- the N*N*m copies of data carry the marked originating accelerator identifier oDevID, the final accelerator identifier tDevID, and a tag.
- Step 803 The processor of the management node sends each of the m communication information to its corresponding accelerator D.
- Step 804 Each of the N*m accelerators D performs a first processing on the received N*m pieces of data using a first model stored in the accelerator D to obtain N*m pieces of processed first processed data.
- Step 805 the N*m accelerators D each determine a target array [dataID, nDevID, fdatasize] corresponding to each of the N*m first processed data.
- the accelerator Dj for any node of the N-1 nodes node other than the node nodei (for the convenience of description, it can be called the target node Gnode), based on the first processed data amount of each first processed data of the accelerator D in the marked final accelerator identification Gnode stored in the m accelerators D in the node nodei, adjust the distribution of all the first processed data of the accelerator D in the marked final accelerator identification Gnode in the m accelerators D in the node nodei, and obtain several target arrays [dataID, nDevID, fdatasize] corresponding to each first processed data of the accelerator D in the marked final accelerator identification Gnode.
- the target array [dataID, nDevID, fdatasize] of the first processed data is determined.
- the target array [dataID, nDevID, fdatasize] of the first processed data is determined.
- each of the N*m accelerators D obtains a plurality of first processed data adapted to the target array [dataID, nDevID, fdatasize] based on the target arrays [dataID, nDevID, fdatasize] and corresponding communication information corresponding to the processed N*m first processed data; wherein each first processed data is marked with an initial accelerator identifier oDevID, an end accelerator identifier tDevID, a communication accelerator identifier cDevID, and an intra-node accelerator identifier nDevID, and the accelerators D indicated by cDevID and tDevID are located in the same node.
- the accelerator D indicated by the marked final accelerator DtDevID stored in the accelerator Dj is the data of the accelerator D in the node nodei.
- the data does not need to be sent through the network card. Therefore, the in-node accelerator identifier nDevID and the final accelerator identifier tDevID marked by these data are the same. Subsequently, the data marked with the in-node accelerator identifier nDevID can be sent to the accelerator D indicated by the identifier through the interconnection link of the m accelerators D in the node nodei.
- the communication accelerator identifier cDevID of the data is determined from the N-1 communication accelerator identifiers cDevID in the communication information corresponding to the accelerator D, and the node where the accelerator D indicated by the communication accelerator identifier cDevID and the final accelerator identifier tDevID is located is the same. It is worth noting that if the accelerator D indicated by the final accelerator identifier tDevID marked by the piece is the accelerator D of the node where the accelerator D is located, the communication accelerator identifier cDevID is empty.
- Step 807 the m accelerators D deployed in each of the N nodes transmit the first processed data in the accelerator D not indicated by the accelerator identifier nDevID in the marked node to the indicated accelerator D through their own interconnected links, so that the data of the accelerator identifier nDevID in the node with the same mark converges to the accelerator D indicated by the identifier.
- the m accelerators D deployed by itself have the same amount of data marked with the same communication accelerator identifier cDevID after adjustment, and the difference between the data amounts of all the copies of data marked by the final accelerator identifier of the accelerator D in the jth node is less than or equal to the preset threshold.
- the jth node represents any node other than the i-th node in the N nodes.
- the accelerator identifier nDevID in the node marked by the data is the accelerator D where the data is located, the data does not need to be transmitted and can be retained in the accelerator D.
- Step 808 The m accelerators D deployed by the N nodes respectively send the first processed data and the originating accelerator identifier oDevID and the ending accelerator identifier tDevID marked by the device to the accelerator D indicated by the communication accelerator identifier cDevID marked by the device through the network card.
- Step 809 Each of the N*m accelerators D receives the corresponding N-1 communication accelerator identifiers cDevID through its own network card. The first processed data sent by the accelerator D and the originating accelerator identifier oDevID and the ending accelerator identifier tDevID marked therewith.
- Step 810 The m accelerators D deployed by the N nodes transmit the first processing data of the accelerator D not indicated by the marked final accelerator identifier tDevID to the indicated accelerator D through their own interconnected links.
- Step 811 Each of the N*m accelerators D performs a second process on the obtained N*m pieces of first processed data based on a second model stored in the accelerator D to obtain N*m pieces of processed second processed data;
- Step 812 Each of the N*m accelerators D sends N*m copies of the second processed data to the accelerator D indicated by its original accelerator identifier oDevID.
- N*m pieces of second processed data are communicated according to the method shown in the aforementioned steps 804 to 809, with the only difference being that the first processed data is replaced by the second processed data.
- Step 813 The N*m accelerators D each update the first model and the second model based on the second processed data and labels corresponding to the N*m pieces of data, and execute step 803.
- N*m accelerators D will process the N*m copies of data stored in themselves to obtain N*m copies of model output data after processing. After that, the N*m accelerators D communicate the processed N*m copies of model output data according to the method provided above, so that the model output data marked with the same final accelerator identifier are aggregated to the accelerator D indicated by the identifier. After the accelerator D processes the aggregated model output data, it obtains the model output data again and returns to the original path to implement model update. This is repeated to implement model training.
- Figure 9a is a schematic diagram of the application scenario provided by the embodiment of the present application.
- Node0 includes two accelerators D00 and D01
- node1 includes two accelerators D10 and D11.
- D00 and D10 communicate through the network
- D01 and D11 communicate through the network
- the information marked by the accelerators D00, D01, D10, and D11 includes a data identifier dataID, an end accelerator identifier tDevID, and an initial accelerator identifier oDevID.
- the data identifier dataID is used to distinguish different data. It can be a data address, such as the starting address and length of the data, or a data number, such as numbers and/or letters.
- Table 1 The relevant information stored by the accelerators D00, D01, D10, and D11 is shown in Table 1 below:
- the first distribution strategy may be determined using the method shown in FIG. 7a .
- the first set formed by the first outgoing data volume is ⁇ 14, 3 ⁇ ;
- the first difference amount FDIS is the difference 5.5 between the first outbound data amount 14 and the network communication data amount 8.5.
- the first difference amount FDIS is the difference between the first outbound data amount 3 and the network communication data amount 8.5 -5.5.
- the target arrays [D003, D00, 8.5], [D001, D00, 1], [D002, D01, 2] can also be determined.
- D01 its corresponding target array includes [D011, D00, 5], [D012, D01, 7], [D013, D01, 2], and [D014, D01, 1].
- the data under the target array is determined, and the data is marked with the end accelerator identifier oDevID, the start accelerator identifier oDevID, the intra-node accelerator identifier nDevID, and the communication accelerator identifier cDevID.
- the data indicated by D003 can be divided, one piece of data is recorded as D003a, and the other piece of data is recorded as D003b.
- node1 For node1. If the method shown in Figure 7a is adopted. The processing process is similar to node0 and will not be repeated. Then for node1, the first outbound data volume of D10 is 10, and the first outbound data volume of D11 is 10; then the first set formed by the first outbound data volume is ⁇ 10, 10 ⁇ , and the first outbound data volumes corresponding to D10 and D11 are the same, then the inter-card adjustment strategy corresponding to D10 and D11 is empty. Considering the method shown in Figure 7a, no adjustment is required. In order to minimize the number of data transmissions, the method shown in 7b can be adopted.
- the second outbound data volumes corresponding to D10 and D11 are first calculated.
- the accelerator D communicating with it is DO1
- the second set formed by the second outgoing data volume is ⁇ 9, 11 ⁇ .
- the second difference amount SDIS is the difference 1 between the second outgoing data amount 9 and the network communication data amount 10 .
- the second difference amount SDIS is the difference between the second outgoing data amount 11 and the network communication data amount 10 -1.
- the difference between the second difference amounts of D10 and D11 is similar, and therefore, the inter-card adjustment strategies of D10 and D11 are both empty.
- the target arrays corresponding to D11 can be determined as [D111, D10, 2], [D112, D11, 8], [D113, D10, 7], [D114, D11, 1].
- the data under the target array is determined, and the data is marked with the end accelerator identifier oDevID, the start accelerator identifier oDevID, the intra-node accelerator identifier nDevID, and the communication accelerator identifier cDevID.
- the data stored in the accelerators D00, D01, D10, and D11 are marked with the intra-node accelerator nDevID identifier as shown in FIG. 9b.
- Step 1 D00 and D01 in node0 transmit the data in the accelerator D not indicated by the accelerator identifier nDevID in the node to the indicated accelerator D.
- Step 2 D00 in node0 and D10 in node1 communicate through the network card, and D01 in node0 and D11 in node1 communicate through the network card.
- D00, D01, D10, and D11 each send data to the accelerator D indicated by the communication accelerator identifier cDevID marked by the data.
- the final accelerator identifier tDevID marked by the data and the accelerator D indicated by the marked communication accelerator identifier cDevID are located on the same node.
- Step 3 D10 and D11 in node 1 communicate with each other.
- the communication result is shown in FIG9e.
- processors in the embodiments of the present application may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof.
- CPU central processing unit
- DSP digital signal processors
- ASIC application-specific integrated circuits
- FPGA field programmable gate arrays
- a general-purpose processor may be a microprocessor or any conventional processor.
- the method steps in the embodiments of the present application can be implemented by hardware or by a processor executing software instructions.
- the software instructions can be composed of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, mobile hard disks, CD-ROMs, or any other form of storage medium known in the art.
- An exemplary storage medium is coupled to a processor so that the processor can read information from the storage medium and write information to the storage medium.
- the storage medium can also be a component of the processor.
- the processor and the storage medium can be located in an ASIC.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer instructions may be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium.
- the computer instructions may be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DATASIZEL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.
- the computer-readable storage medium may be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more available media integrated.
- the available medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid state drive (SSD)), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
公开了一种通信方法、系统和服务器。该方法应用于第一节点中互连的m个第一加速器,m个第一加速器与第二节点中互连的m个第二加速器一一对应通过网卡通信,方法包括:m个第一加速器通过自身互连的链路,调整第一数据在m个第一加速器的分布,使得各自调整后自身具有的第一数据中的数据的数据量差异小于等于预设阈值;第一数据为m个第一加速器存储的标记分别指示m个第二加速器的终加速器标识的数据;m个第一加速器各自通过自身配置的网卡,将调整后自身具有的第一数据中的数据发往对应的第二加速器,将节点内通过网络发送的数据均分在不同的加速器上,避免了某个加速器的通信量过大导致节点等待。
Description
本申请要求于2022年10月20日提交中国国家知识产权局、申请号为202211284784.6、申请名称为“通信方法、系统及服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及通信技术领域,尤其涉及一种通信方法、系统及服务器。
随着算法技术的发展,模型训练对算力的需求越来越高,通常使用多个加速器来完成计算任务,在模型并行下,每个加速器上存储部分的模型,在模型训练过程中都需要使用聚合通信来完成数据交换。
目前,聚合通信采用分层的方式实现,具体原理如下:
对于任一节点的互连的多个加速器中的任一加速器,该加速器与其他节点的加速器通过网卡通信,首先,通过该节点中的多个加速器之间互连的链路,将需要发往与该加速器通信的其他节点的加速器的数据汇聚到该加速器,其次,通过自身的网卡将汇聚的数据发往与该加速器通信的其他节点的加速器。
但是上述聚合通信的方式的通信的时间取决于最后一个完成传输的加速器,由于不同节点的加速器通过网络传输的数据量会存在不同,可能会增加通信时长。
发明内容
本申请实施例提供了一种通信方法、系统及服务器,能够将节点内的通过网络发送到其他节点的数据均分在节点内不同的加速器上,避免了节点中的某个加速器的通信量过大导致其他的节点等待,确保网络通信效率。
第一方面,本申请实施例提供了一种通信方法,应用于第一节点,第一节点中互连的m个第一加速器与第二节点中互连的m个第二加速器一一对应,m个第一加速器各自和其对应的第二加速器通过自身部署的网卡通信,m个第一加速器存储的多份数据各自标记有终加速器标识,m为大于等于2的正整数,方法包括:
m个第一加速器通过自身互连的链路,调整第一数据在m个第一加速器的分布,使得各自调整后自身具有的第一数据中的数据的数据量差异小于等于预设阈值;其中,第一数据为多份数据中标记分别指示m个第二加速器的终加速器标识的数据;
m个第一加速器各自通过自身配置的网卡,将调整后自身具有的第一数据中的数据发往对应的第二加速器,以使m个第二加速器通过自身互连的链路,调整接收到的第一数据在m个第二加速器的分布,使得m个第二加速器具有第一数据中标记终加速器标识指示自身的数据。
本方案中,能够将节点内的通过网络发送到其他节点的数据均分在节点内不同的加速器上,避免了节点中的某个加速器的通信量过大导致其他的节点等待,确保网络通信效率;后续,节点再通过内部m个加速器之间的互连的通道,改变数据分布,使得节点中的每个加速器得到其他节点需要发往自身的数据。
在一种可能的实现方式中,方法还包括:m个第一加速器各自通过自身的网卡接收对应的第二加速器发送的第二数据;其中,第二数据包括未在标记的终加速器标识指示的第一加速器内的数据;m个第一加速器通过自身互连的链路,将未在标记的终加速器标识指示的第一加速器内的数据传输至指示的第一加速器。
本方案中,在节点接收到其他节点发送的数据时,可以基于内部m个加速器之间的互连的通道,改变数据在节点中的m个加速器之间的分布,使得节点中的每个加速器得到其他节点需要发往自身的数据。
在一种可能的实现方式中,多份数据标记的多个终加速器标识包括指示第一加速器的终加速器标识,方法还包括:m个第一加速器通过自身互连的链路,各自将未在标记的终加速器标识指示的第一加速器内的数据传输至指示的第一加速器。
本方案中,基于内部m个加速器之间的互连的通道,将终加速器标识指示节点内的加速器传输到对应的加速器内,使得节点中的每个加速器得到节点内的其他的加速器需要发往自身的数据。
在一种可能的实现方式中,方法还包括:基于m个第一加速器存储的多份数据各自的数据量和标记的
终加速器标识,确定第一数据在m个第一加速器的第一分布策略;m个第一加速器通过自身互连的链路,按照第一分布策略调整第一数据在m个第一加速器的分布。
可选地,基于m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定第一数据在m个第一加速器的第一分布策略,包括:确定m个第一加速器各自对应的第一外传数据量;其中,第一外传数据量指示了对应的第一加速器存储的第一数据中的数据的数据量之和;基于m个第一加速器各自对应的第一外传数据量、m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定第一数据在m个第一加速器的第一分布策略。
可选地,基于m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定第一数据在m个第一加速器的第一分布策略,包括:基于m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定m个第一加速器各自对应的第二外传数据量;其中,第二外传数据量指示了第一数据中标记的终加速器标识为通信加速器标识的数据的数据量,通信加速器标识指示了对应的第一加速器通过网卡通信的第二加速器;基于m个第一加速器各自对应的第二外传数据量、m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定第一数据在m个第一加速器的第一分布策略。
在一种可能的实现方式中,第一分布策略指示了m个第一加速器各自存储的第一数据中的数据和其他的第一加速器之间的通信情况。
第二方面,本申请实施例提供了一种通信系统,包括第一节点和第二节点,第一节点中互连的m个第一加速器与第二节点中互连的m个第二加速器一一对应,m个第一加速器各自和其对应的第二加速器通过自身部署的网卡通信,m个第一加速器存储的多份数据各自标记有终加速器标识,m为大于等于2的正整数,第一节点用于执行如第一方面的方法。
第三方面,本申请实施例提供了一种服务器,包括:至少一个存储器,用于存储程序;
互连的m个第一加速器,与第二节点中互连的m个第二加速器一一对应,m个第一加速器中每个第一加速器和其对应的第二加速器通过自身部署的网卡通信,m个第一加速器存储的多份数据各自标记有终加速器标识,m个第一加速器执行至少一个存储器存储的程序,实现如第一方面的方法,m为大于等于2的正整数。
根据一种可行的实现方式,服务器包括处理器,处理器用于基于m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定第一数据在m个第一加速器的第一分布策略;对应的,m个第一加速器通过自身互连的链路,按照第一分布策略调整第一数据在m个第一加速器的分布。
第四方面,本申请实施例提供了一种计算机存储介质,计算机存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行第一方面中所提供的方法。
第五方面,本申请实施例提供了一种包含指令的计算机程序产品,当指令在计算机上运行时,使得计算机执行第一方面中所提供的方法。
图1是本申请实施例提供的一种通信系统的系统架构图;
图2a是本申请实施例提供的电子设备的结构示意图一;
图2b是本申请实施例提供的电子设备的结构示意图二;
图2c是本申请实施例提供的电子设备的结构示意图三;
图3是本申请实施例提供的应用场景的示意图一;
图4是图3所示的应用场景下的现有的通信方案的示意图;
图5a是图3所示的应用场景下的本发明实施提供的数据传输路径的示意图一;
图5b是图3所示的应用场景下的本发明实施提供的数据传输路径的示意图二。
图6是本申请实施例提供的通信方法的流程示意图一;
图7a是图6中步骤620的流程示意图一;
图7b是图6中步骤620的流程示意图二;
图8a是本申请实施例提供的应用场景的示意图二;
图8b是图8a所示的应用场景下的本申请实施例提供的通信方法的流程示意图;
图9a是本申请实施例提供的应用场景的示意图三;
图9b是图9a所示的应用场景中的数据标记的节点内加速器标识的示意图;
图9c是图9a所示的应用场景中的节点内的数据传输后的示意图;
图9d是图9c所示的应用场景中的节点间的数据传输后的示意图;
图9e是图9d所示的应用场景中的节点内的数据传输后的示意图。
为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本申请实施例中的技术方案进行描述。
在本申请实施例的描述中,“示例性的”、“例如”或者“举例来说”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”、“例如”或者“举例来说”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”、“例如”或者“举例来说”等词旨在以具体方式呈现相关概念。
在本申请实施例的描述中,术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,单独存在B,同时存在A和B这三种情况。另外,除非另有说明,术语“多个”的含义是指两个或两个以上。例如,多个系统是指两个或两个以上的系统,多个终端是指两个或两个以上的终端。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
图1是本发明实施例提供的一种通信系统的系统架构图。如图1所示,通信系统包括:N个节点node,N个节点表示为node0、node1、…、nodeN-1、。图1示出了4个节点node0、node1、node2、node3,即N=4。
其中,节点node和节点node之间通过网络通信。网络可以为有线网络或无线网络。示例地,有线网络可以为电缆网络、光纤网络、数字数据网(Dataital Data Network,DDN)等,无线网络可以为电信网络、内部网络、互联网、局域网络(Local Area Network,LAN)、广域网络(Wide Area Network,WAN)、无线局域网络(Wireless Local Area Network,WLAN)、城域网(Metropolitan Area Network,MAN)、互连结构(InfiniBand,IB)网络、远程直接数据存取(RDMA over Converged Ethernet,RoCE)网络等或其任意组合。可以理解的是,网络可使用任何已知的网络通信协议来实现不同客户端层和网关之间的通信,上述网络通信协议可以是各种有线或无线通信协议,诸如以太网、通用串行总线(universal serial bus,USB)、火线(firewire)、全球移动通讯系统(global system for mobile communications,GSM)、通用分组无线服务(general packet radio service,GPRS)、码分多址接入(code division multiple access,CDMA)、宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA)、长期演进(long termevolution,LTE)、新空口(new radio,NR)、蓝牙(bluetooth)、无线保真(wireless fidelity,Wi-Fi)等通信协议。
图2a至图2b为本发明实施例提供的电子设备的结构示意图。如图2a至图2c所示,根据一种可行的实现方式,一台电子设备中可以包括一个节点node,也可以包括多个节点node。如图2a和图2b所示,一台电子设备包括一个节点node0。如图2c所示,一台服务器包括两个节点node0和node1。
在一个例子中,本方案中涉及的电子设备可以为服务器、计算机等实体设备。本方案中涉及的电子设备的示例性实施例包括但不限于搭载iOS、android、Windows、鸿蒙系统(Harmony OS)或者其他操作系统的电子设备。本申请实施例对电子设备的类型不做具体限定。
进一步的,如图2a和图2c所示,节点0包括处理器,比如CPU(central processing unit,中央处理器)和m个加速器(device),可以简称加速器D,多个加速器D之间互连,通过互连的链路通信,m个加速器D中每个加速器D设置有网卡,通过网卡和其他的节点通过网络(参加上文描述,不再赘述)通信。这里,加速器D用于模型训练或模型计算的设备,本发明实施例并不意图限制模型的类型和结构。加速器D可以包括一个或多个芯片,比如,GPU(Graphic Processing Unit,图形处理器),MIC(Many Integrated Core,众核协处理器)、NPU(Neural-network Processing Unit,嵌入式神经网络处理器)。
其中,m个加速器D可以为2个及2个以上,图2a示出了2个互连的加速器D,图2b示出了4个加
速器D。
在一个例子中,如图2a和图2c所示,加速器D设置的网卡连接交换机,多个加速器D(device)通过网卡连接的交换机实现网络通信。
通信系统中的其他节点node的结构参见上文对节点0的描述,不再赘述。本发明实施例中,N个节点中的加速器D的数目相同m均为m;从当前的发展来看,目前主流的架构为N个节点中的加速器D的数目相同,本发明实施例也主要围绕这N个节点node中的加速器D的数目相同来说明本发明实施例提供的方法,但考虑到未来可能出现N个节点中的加速器D的数目不同的架构,本发明并不限制N个节点node中的加速器D的数目的具体关系。
本发明实施例中,N个节点node中的任一节点node,该节点node中的任一个加速器D和其他的N-1个节点node的N-1个加速器D(位于不同的节点)通过网卡通信,另外,该节点node存储有需要发往N个节点中N*m个加速器D的数据,具体地,该节点存储有N*m份数据,每份数据标记有需发往的加速器D的标识。图3为本发明实施例提供的应用场景的示意图。如图3所示,存在两个节点node0和node1,node0中包括4个加速器D分别记为D00、D01、D02、D03,node1中包括4个加速器D分别记为D10、D11、D12、D13;对于node0中的加速器D00,存储有需要发往节点node1中加速器D10、D11、D12、D13的数据a0、b0、c0、d0,以及,需要发往节点node0中的加速器D00、D01、D02、D03的数据(图中未示意),通过网卡通信的加速器为D10,表示为NC:D10;加速器D01、D02、D03类同,不在赘述;对于node1中的加速器D10,存储有需要发往节点node0中加速器D00、D01、D02、D03的数据(图中未示意),以及,需要发往节点node1中的加速器D10、D11、D12、D13的数据(图中未示意),通过网卡通信的加速器为D00,表示为NC:D00;加速器D11、D12、D13类同,不再赘述。
在相关技术中,为了减少网络通信规模,主要分两步实现通信,step1,对于N个节点node的任一加速器D,通过该加速器D所在节点node的m个加速器D之间互连的链路,将需要发往与该加速器D通过网卡通信的N-1个加速器D的数据汇聚到该加速器D;step2,对于N个节点node的任一加速器D,通过自身的网卡将step1汇聚的数据发往该数据需要发往的其他的节点node的加速器D。为了便于理解,下面结合图3示出的应用场景进行详细的描述。图4为图3所示的应用场景下node0存储的发往加速器D10的数据的传输路径的示例图。需要指出,图3中的发往加速器D01、D02、D03、D11、D12、D13的数据的传输路径类同,区别仅仅在于通信的对象不同,发送的数据不同。
如图4所示,step1,对于节点node0中的加速器D00,加速器D00和加速器D10通信,将需要通过网卡发往D10的数据汇聚到加速器D00上,此时,加速器D00上存储有节点node0中4个加速器D00、D01、D02、D03存储的需要发往D10的数据a0、a1、a2、a3。加速器D01、D02、D03、D10、D11、D12、D13类同(图中未示意),区别仅仅在于通信的对象不同,存储的数据不同,不再赘述。step2,对于节点node0中的加速器D00,加速器D00将存储的需要发往D10的数据a0、a1、a2、a3通过网卡发往加速器D10,使得加速器D10得到a0、a1、a2、a3,加速器D01、D02、D03、D10、D11、D12、D13类同(图中未示意),区别仅仅在于通信的对象不同,发送的数据不同,不再赘述。
但是,在上述相关技术描述的技术方案中,由于N个节点的N*m加速器D之间是并行通信的,通信的完成时间取决于最后一个完成传输的加速器D,由于不同加速器D通过网卡传输的数据的数据量存在不同,可能会存在大量的节点等待最后一个节点完成网络传输,从而造成了大量带宽和算力被浪费。
为了解决上述技术问题,本发明实施例提出了一种负载均衡的通信方法,对于N个节点node中任一节点node,将需要通过网卡发往另一个节点node的数据均分在节点内的互连的m个加速器D上,使得该节点node内的每个加速器D通过网卡向另一个节点node发送到的数据的数据量相似或相同,避免了节点node中的某个加速器D的通信量过大导致其他的节点等待。
在具体实现时,主要分成3步完成,step1,N个节点node中每个节点node的m个加速器D通过互连的链路进行数据传递,节点node内的每个加速器D通过网卡向其所在节点之外的任一节点node发送到的数据的数据量相似或相同;step2,N个节点的N*m个加速器D之间通过网卡进行网络通信,此时,由于节点node内的每个加速器D通过网卡向相同节点发送到的数据的数据量相似或相同,从而避免了节点node中的某个加速器D的通信量过大导致其他的节点等待;step3,N个节点中的部分或全部节点node中的m个加速器D之间通过互连的链路进行数据传递,使得N*m个加速器D各自得到自身所在节点node之外的N-1个节点需要发往自身的数据。
为了便于理解本发明实施例提供的方案和相关技术方案中的区别,下面结合图3示出的应用场景进行详细的描述。图5a和图5b为图3所示的应用场景下node0存储的发往加速器D10的数据的传输路径的示
例图。
图5a为图3所示场景下node0存储的发往加速器D10的数据的传输路径的示例图一。
如图5a所示,对于节点node0中的加速器D01,存储的发往加速器D10的数据a1分为两部分a11和a12,a11需要发送加速器D00,a12需要发往加速器D01。加速器D00通过网卡与D10通信,D01通过网卡与D11通信,D02通过网卡与D12通信,D03通过网卡与D13通信。则数据传输的具体过程如下:
step1,对于节点node0中的加速器D00,将需要通过网卡发往D10的数据a11、a2、a3汇聚到加速器D00上,此时,加速器D00上存储有节点node0中4个加速器D00、D01、D02、D03存储的需要发往D10的数据a0、a11、a2、a3。
step2,对于节点node0中的加速器D00,将存储的需要发往D10的数据a0、a11、a2、a3通过网卡发往加速器D10,对于节点node0中的加速器D01,将存储的需要发往D10的数据a12通过网卡发往加速器D11。
step3,对于节点node1中的加速器D11,将存储的需要发往D10的数据a12通过互连的链路发往加速器D10,使得节点node1中的加速器D10,存储的node0中需要发往D10的数据a0、a1、a2、a3。
图5b为图3所示场景下node0存储的发往加速器D10的数据的传输路径的示例图二。
如图5b所示,对于节点node0中的加速器D01,存储的发往加速器D10的数据a1和需要发送加速器D12的数据c1,数据c1需要首先发往加速器D00。加速器D00通过网卡与D10通信,D01通过网卡与D11通信,D02通过网卡与D12通信,D03通过网卡与D13通信,则数据传输的具体过程如下:
step1,对于节点node0中的加速器D00,加速器D01、加速器D02和加速器D02和加速器D00通信,将需要发往D10的数据a1、a2、a3以及需要发往D12的数据c1,汇聚到加速器D00上。
step2,对于节点node0中的加速器D00,将存储的需要发往D10的数据a0、a11、a2、a3和需要发往D12的数据c1通过网卡发往加速器D10。
step3,对于节点node1中的加速器D10,存储的node0中需要发往D10的数据a0、a1、a2、a3和需要发往D12的数据c1,之后,将存储的需要发往D12的数据c1通过互连的链路发往加速器D12。
下文将详细描述图1所示的通信系统中的节点间的通信方法的过程。每个节点各自的通信方法相同,下面以一个节点为例进行说明,为了便于描述和区别,将该节点可以记为节点node1,node1也可以称为第一节点。考虑到节点之间通信需要通过网卡,而本发明实施例中,只存在一次网卡通信,因此,只需要考虑两个节点之间的通信平衡即可,该nodei和nodei之外的任一节点的通信方法相同,下面以node1与自身之外的一个节点的通信为例进行描述,为了便于描述和区别,将node1之外的一个节点称为node2,node2也可以称为第二节点。该通信方法可以由node1中的CPU执行,也可以由node1中的每个加速器D执行,从当前的发展来看,目前主流的执行主体为CPU,本发明实施例主要围绕执行主体为CPU来说明本发明实施例提供的方法。另外,为了便于描述和区别,将node1中的加速器D称为加速器D1,将node2的加速器D称为加速器D2。node1中的m个加速器D1存储有多份数据,每份数据标记有信息,标记的信息至少包括最终需要达到的加速器D的标识(为了便于描述和区别,称为终加速器标识,可以表示为tDevID)。终加速器标识tDevID指示了数据最终需要达到的加速器D。对应的,多份数据的多个终加速器标识tDevID包括分别指示m个加速器D2的m个终加速器标识tDevID。在实际应用中,多份数据可以对应相同的终加速器标识tDevID。为了便于描述和区别,加速器D1可以记为D1,m个加速器D1记为D11、D12、…、D1m;加速器D2可以记为D2,m个加速器D2记为D21、D22、…、D2m。
图6为本发明实施例提供的节点的通信方法的流程流程图一。如图6所示,具体包括如下步骤。
步骤610、m个加速器D1向处理器发送自身存储的多份数据各自的数据量和标记的终加速器标识tDevID。
步骤620、处理器基于m个加速器D1存储的多份数据各自的数据量和标记的终加速器标识tDevID,确定第一数据在m个加速器D1分布的第一分布策略;其中,第一数据为多份数据中标记分别指示m个加速器D2的终加速器标识tDevID的数据。
这里,第一分布策略指示了第一数据在m个加速器D1的分布情况,第一数据为m个加速器D1存储的多份数据中标记指示m个加速器D2的m个终加速器标识tDevID的所有份数据,具体地来说指示了发往node2中m个加速器D2的所有份数据。
本发明实施例中,node1基于m个加速器D1存储的多份数据各自的数据量和标记的最终需要发往的加速器D的终加速器标识tDevID,以需要发往node2的全部数据即第一数据均匀分布在m个加速器D1为目
的,调整需要发往node2的第一数据在m个加速器D1的分布,得到第一分布策略。
步骤630、m个加速器D1通过自身互连的链路,按照第一分布策略调整第一数据在m个加速器D1的分布,使得各自调整后自身具有的第一数据中的数据的数据量差异小于等于预设阈值。
本发明实施例,通过第一分布策略调整第一数据在m个加速器D1分布,使得m个加速器D1各自调整后自身具有的第一数据中的数据的数据量差异较小,比如相同,再比如相似。
举例来说,假设node1中m个加速器D1各自存储的标记指示m个加速器D2的m个终加速器标识tDevID的m份数据各自的数据量为a1、a2、…、am,m个加速器D1各自调整后自身具有的第一数据中的数据的数据量与(a1+a2+…+am)/m相同或相似。
步骤641、m个加速器D1各自通过自身配置的网卡向对应的加速器D2,发送调整后自身具有的第一数据中的数据。
在实际应用中,m个加速器D1标记有通信加速器标识,可以记为cDevID,该通信加速器标识cDevID指示了通过网卡通信的加速器D2。因此,m个加速器D1各自通过自身配置的网卡,将调整后自身具有的第一数据中的数据发往通信加速器标识cDevID指示的加速器D2。在实际应用中,可以为加速器D1可以为调整后自身具有的第一数据中的数据标记通信加速器标识cDevID。
举例来说,假设D1i和D2i对应,即D1i指示的加速器D1和D2i指示的加速器D2通过网卡通信,则加速器D1i的通信加速器标识cDevIDi为D2i。
步骤651、m个加速器D2通过自身互连的链路,调整接收到的第一数据在m个加速器D2的分布,使得m个加速器D2具有第一数据中标记终加速器标识tDevID指示自身的数据。
在实际应用中,node2中的m个加速器D2基于m个加速器D1发送的每份数据和其标记的终加速器标识tDevID,通过互连的链路,将未在标记的终加速器标识tDevID指示的加速器D2内的数据传输至指示的加速器D2,调整第一数据在m个加速器D2的分布,将使得m个加速器D2各自在调整后对应的存储有第一数据中标记自身的终加速器标识tDevID的数据。
可选地,m个加速器D1通过自身的网卡,将步骤630调整后自身具有的第一数据中的数据和其对应的数据标识(用于区别不同的数据,比如可以数字和/或字母等)传输至通信的加速器D2。考虑到第一数据中的每份数据各自标记有数据标识(记为dataID)和终加速器标识tDevID,数据的数据标识dataID和终加速器标识tDevID存在对应关系,因此,m个加速器D1各自得到指示数据标识dataID和终加速器标识tDevID的对应关系的信息(为了便于描述和区别,可以称为对应信息),将该关系信息发送到对应的加速器D2,从而使得对应的加速器D2基于该关联信息,确定接收到对应的加速器D1发送的数据标记的终加速器标识tDevID。
可选地,m个加速器D1通过自身的网卡,各自将步骤630调整后自身具有的第一数据中的数据和其对应的终加速器标识tDevID传输至与其通信的加速器D2,从而使得node2中的每个加速器D2接收到对应的加速器D1发送的数据和其标记的终加速器标识tDevID。
综上,本发明实施例中,将node1需要通过网卡发往node2的数据均分在node1内的互连的m个加速器D1上,使得该node1内的每个加速器D1通过网卡向node2发送到的数据的数据量相似或相同,避免了node1中的某个加速器D1的通信量过大导致其他的节点等待。后续,node2再通过内部m个加速器D2之间的互连的通道,改变数据分布,使得node2中的每个加速器D2得到node1需要发往自身的数据。
值得注意点的是,m个加速器D1中存储的多份数据标记的多个终加速器标识tDevID包括指示加速器D1的终加速器标识tDevID。因此,在调整第一数据在m个加速器D1的分布的过程中,m个加速器D1通过自身互连的链路,各自将存储的数据发往其标记终加速器标识tDevID指示的自身之外的加速器D1。具体地,步骤630还可以包括如下内容:
m个加速器D1通过自身互连的链路,各自将存储的数据发往其标记的终加速器标识tDevID指示的自身之外的加速器D1。
在上述图6所示实施例中步骤610和步骤640的基础上,如图6所示,本发明实施例中,在执行步骤641的同时,至少还包括如下步骤:
步骤642、m个加速器D2各自通过自身的网卡向对应的加速器D1发送第二数据;其中,第二数据包括未在标记的终加速器标识tDevID指示的加速器D1内的数据。
需要说明的是,第二数据可以理解为node2按照前述步骤610至630处理后的m个加速器D2中的所
有标记指示加速器D1的数据。
可选地,加速器D1可以同时接收第二数据和第二数据标记的终加速器标识tDevID。
可选地,加速器D1可以先接收第二数据和其对应的数据标识dataID,然后接收数据标识dataID和终加速器标识tDevID的对应关系的对应信息,基于该对应信息,可以得到接收第二数据和其对应的终加速器标识tDevID。
值得注意的是,若第二数据中不存在未在标记的终加速器标识tDevID指示的加速器D1内的数据时,此时,node2和node1之间的完成数据通信。
对应的,在执行步骤651的同时,至少还包括如下步骤:
步骤652、m个加速器D1通过自身互连的链路,将未在标记的终加速器标识tDevID指示的加速器D1内的数据传输至指示的加速器D1。
根据一种可行的实现方式,第一分布策略包括m个加速器D1各自对应的目标调整策略(记为SFDIS),对于m个加速器D1中的第i个加速器D1i,加速器D1i对应的目标调整策略SFDISi指示了加速器D1i中存储的第一数据中的每份数据(为了便于描述和区别,称为目标数据)各自在m个加速器D1的分布情况。在实际应用中,目标调整策略可以包括多个数组(为了便于描述和区别,称为目标数组),每个目标数组通过[dataID、nDevID、fdatasize]表示,其中,dataID表示D1i存储的目标数据的数据标识、nDevID表示需要发往的加速器D1(为了便于描述和区别,可以称为节点内加速器标识)、fdatasize表示需要发送的数据的数据量(为了便于描述和区别,可以称为目标内传数据量)。对应的,[dataID、nDevID、fdatasize]表示dataID指示的目标数据中需要向nDevID指示的加速器D1发送的数据的数据量为fdatasize。值得注意的是,一个数据标识dataID可以对应多个数组,每个数组中的nDevID表示不同的节点内加速器标识。在实际应用中,目标调整策略中的节点内加速器标识nDevID指示的加速器D1可以为加速器D1i,则第i加速器D1存储的第一数据中的数据无需发往加速器D1i之外的其他的加速器D1。
在该实现方式的基础上,本发明实施例提供了图6中步骤620的两种实现方式。
实现方式1,图7a是图6中步骤620的流程示意图。如图7a所示,在上述图6所示实施例的基础上,本发明实施例中,步骤620,具体可以包括如下步骤:
步骤6211、处理器基于m个加速器D1存储的多份数据各自的数据量和标记的终加速器标识tDevID,确定m个加速器D1各自对应的第一外传数据量;其中,第一外传数据量指示了对应的加速器D1存储的第一数据中的数据的数据量之和。
举例来说,对于m个加速器D1中的加速器D1i,假设其存储的标记指示m个加速器D2的m个终加速器标识tDevID的m份目标数据的数据量为TDi1、TDi2、…、TDim,则加速器D1i对应的第一外传数据量ODATASIZEi=TDi1+TDi2+…+TDim。这里,i=1、2、…、m。
举例来说,如图3所示,对于node0,D00的第一外传数据量为a0+b0+c0+d0。D01的第一外传数据量为a1+b1+c1+d1。D02的第一外传数据量为a2+b2+c2+d2。D03的第一外传数据量为a3+b3+c3+d3。
步骤6212、处理器基于m个加速器D1各自对应的第一外传数据量、多份数据各自的数据量和标记的终加速器标识tDevID,确定第一数据在m个加速器D1的第一分布策略。
具体地,处理器基于m个加速器D1各自对应的第一外传数据量,确定m个加速器D1各自对应的卡间调整策略SSDIS。加速器D1i对应的卡间调整策略SSDISi指示了加速器D1i需要向自身之外的其他的加速器D1,发送的加速器D1i存储的第一数据中的数据的数据量。在实际应用中,卡间调整策略SSDISi包括若干个数组(为了便于和区别,称为决策数组),每个决策数组通过[nDevID、sdatasize]表示。其中,nDevID表示节点内加速器标识,用于指示加速器D1i之外的其他的加速器D1,sdatasize表示需要发送的数据(为了便于描述和区别,可以称为卡间内传数据量)。[nDevID、sdatasize]表示向nDevID指示的加速器D1发送的加速器D1i存储的第一数据中的数据的数据量为sdatasize。在实际应用中,卡间调整策略SSDISi可以为空,即无需向对应的加速器D1之外的其他的第一加速,发送对应的加速器D1存储的第一数据中的数据。
在一种可行的实现方式中,m个加速器D1各自对应一个第一外传数据量总共m个第一外传数据量。对m个第一外传数据量进行分析,确定m个加速器D1各自需要的网络通信数据量;然后,对于m个加速器D1中的加速器D1i,将该加速器D1i对应的第一外传数据量减去网络通信数据量,得到第一差异量FDIS(若为正值则指示了需要减少的数据的数据量,若为负值,则指示了需要增加的数据的数据量),加速器D1i的第一差异量可以表示为FDISi;在得到m个加速器D1各自对应的第一差异量FDIS共m个第一差异
量FDIS后,基于m个第一差异量FDIS,对m个加速器D1进行配对和数据调整,使得m个加速器D1最终通过网卡传输的数据的数据量和网络通信数据量相同或相似,得到m个加速器D1各自对应的卡间调整策略SSDIS。
在实际应用中,网络通信数据量一般为m个第一外传数据量的均值,也即第一数据的数据量除以m的值。下面对基于m个加速器D1对应的网络通信数据量和m个第一差异量FDIS,确定m个加速器D1各自对应的卡间调整策略(为了便于描述和区别,称为步骤A)进行详细描述。
在实际应用中,m个加速器D1各自对应的第一差异量FDIS以集合的方式表示,这里,为了便于描述和区别,将m个第一差异量FDIS形成的集合称为第一集合,第一集合包括m个第一差异量FDIS:FDIS1、FDIS2、……、FDISm。
可选地,步骤A具体包括如下内容:
步骤A01、判断第一集合中是否存在相加之和大于0且小于等于预设阈值的两个FDIS,如果是,执行步骤A02,如果否,执行步骤A05。
步骤A02、从第一集合中选择相加之和大于0且小于等于预设阈值的两个FDIS,大于0的FDIS记为>0:FDIS,小于0的FDIS记为<0:FDIS。
步骤A03、对于>0:FDIS,确定>0:FDIS对应的加速器D1对应的决策数组[nDevID、sdatasize],nDevID指示<0:FDIS对应的加速器D1,sdatasize为<0:FDIS的绝对值。
步骤A04、删除第一集合中的两个FDIS,执行步骤A01。
步骤A05、从第一集合中选择大于0的最大的FDIS记为>0max:FDIS,以及,小于0且绝对值最小的FDIS记为<0min:FDIS。
步骤A06、判断两个>0max:FDIS和<0min:FDIS之和是否大于预设阈值,如果是,执行步骤A07,如果否,执行步骤A09。
步骤A07、确定>0max:FDIS对应的加速器D1对应的决策数组[nDevID、sdatasize],nDevID指示<0min:FDIS对应的加速器D1,sdatasize为<0min:FDIS的绝对值。
步骤A08、将第一集合中大于0的>0max:FDIS更新为>0max:FDIS+<0min:FDIS的结果,删除<0min:FDIS,执行步骤A01。
步骤A09、对于m个加速器D1中的加速器D1i,若该加速器D1i对应有决策数组[nDevID、sdatasize],则统计该加速器D1i对应的所有的决策数组[nDevID、sdatasize],汇总后得到卡间调整策略SSDISi。若该加速器D1标识未对应有决策数组[nDevID、sdatasize],则卡间调整策略SSDISi为空。
需要说明的是,上述方法仅仅作为示例,并不构成具体限定,在实际应用,可以从第一集合中随机选择两个第一差异量FDIS,若这两个第一差异量FDIS之间的差异较小,且大于0的第一差异量>0:FDIS的绝对值大于小于0的第一差异量<0:FDIS的绝对值,确定大于0的第一差异量>0:FDIS对应的加速器D1对应的决策数组[nDevID、sdatasize](参见A03);若这两个第一差异量FDIS之间的差异较小,且大于0的第一差异量>0:FDIS的绝对值大于小于0的第一差异量<0:FDIS的绝对值,确定大于0的第一差异量>0:FDIS对应的加速器D1对应的决策数组[nDevID、sdatasize](参见A07),然后,按照步骤A08的方式更新第一集合。
接下来对如何确定m个加速器D1各自对应的目标调整策略SFDIS进行描述。考虑到m个加速器D1中的每个加速器D1的目标调整策略SFDIS的确定方式相同,下面以确定加速器D1i的目标调整策略SFDISi为例进行描述。
根据一种可行的实现方式,基于加速器D1i的卡间调整策略SSDISi中的所有决策数组[nDevID、sdatasize],结合加速器D1i存储的每份目标数据的数据量,得到加速器D1i的目标调整策略SFDISi。目标数据为标记指示加速器D2的终加速器标识tDevID的数据。
在实际应用中,加速器D1i存储的每份目标数据的数据量以集合的方式表示,这里,为了便于描述和区别,将该集合称为目标数据量集合。卡间调整策略SSDISi中的所有决策数组中的卡间内传数据量sdatasize以集合的方式表示,这里,为了便于描述和区别,将该集合称为卡间内传数据量sdatasize集合。
下面给出5种分配策略。为了便于描述和区别,将目标数据量集合中数据量称为目标数据量,将卡间内传数据量称为目标卡间内传数据量GsdatasizeGsdatasize。
策略1,目标数据量大于等于目标卡间内传数据量Gsdatasize,且差值较小,则将目标卡间内传数据量Gsdatasize作为目标内传数据量关联到目标数据量对应的目标数据。
对应的,可以确定策略1对应的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为目标数据量对应的数据,nDevID为目标卡间内传数据量Gsdatasize所在的决策数组[nDevID、sdatasize]中的nDevID,fdatasize为目标卡间内传数据量Gsdatasize。
对应的,策略1对应的集合更新方式为删除目标数据量集合和卡间内传数据量sdatasize集合中的目标数据量和目标卡间内传数据量Gsdatasize。
策略2,目标数据量大于等于X个目标卡间内传数据量Gsdatasize之和,且差值较小,则将X个目标卡间内传数据量Gsdatasize分别作为目标内传数据量关联到目标数据量对应的目标数据。
对应的,可以确定策略2对应的X个目标数组[dataID、nDevID、fdatasize],X个目标数组中的X个fdatasize分别为X个目标卡间内传数据量Gsdatasize一一对应,dataID指示的数据为目标数据量对应的数据,nDevID为fdatasize对应的目标卡间内传数据量Gsdatasize所在的决策数组[nDevID、sdatasize]中的nDevID。
对应的,策略1对应的集合更新方式为删除目标数据量集合和卡间内传数据量sdatasize集合中的目标数据量和X个目标卡间内传数据量Gsdatasize。
策略3,目标卡间内传数据量Gsdatasize小于等于Y个目标数据量之和,且差值较小,则按照目标数据量的大小,将该目标卡间内传数据量Gsdatasize划分后得到Y个目标内传数据量,Y个目标内传数据量关联到Y个目标数据量对应的Y个目标数据。
对应的,可以确定策略3对应的Y个目标数组[dataID、nDevID、fdatasize],Y个目标数组中的Y个fdatasize分别为目标卡间内传数据量Gsdatasize划分得到,dataID指示的数据为目标数据量对应的数据,nDevID为目标卡间内传数据量Gsdatasize所在的决策数组[nDevID、sdatasize]中的nDevID。
对应的,策略3对应的集合更新方式为删除目标数据量集合和卡间内传数据量sdatasize集合中的Y个目标数据量和目标卡间内传数据量Gsdatasize。
策略4,目标数据量大于目标卡间内传数据量Gsdatasize,且相差较大,则将目标卡间内传数据量Gsdatasize关联到目标数据量对应的目标数据上。
对应的,可以确定策略4对应的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为目标数据量对应的数据,nDevID为目标卡间内传数据量Gsdatasize所在的决策数组[nDevID、sdatasize]中的nDevID,fdatasize为目标卡间内传数据量Gsdatasize。
对应的,策略4对应的集合更新方式为将目标数据量集合中的目标数据量更新为目标数据量减去目标卡间内传数据量Gsdatasize的结果,删除卡间内传数据量sdatasize集合中的目标卡间内传数据量Gsdatasize。
策略5,目标数据量小于目标卡间内传数据量Gsdatasize,且相差较大,则按照目标数据量对目标卡间内传数据量Gsdatasize进行划分,得到和目标数据量相同的目标内传数据量,将其关联到目标数据量对应的数据上。
对应的,可以确定策略5对应的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为目标数据量对应的数据,nDevID为目标卡间内传数据量Gsdatasize所在的决策数组[nDevID、sdatasize]中的nDevID,fdatasize为目标数据量。
对应的,策略5对应的集合更新方式为删除目标数据量集合中的目标数据量,将卡间内传数据量sdatasize集合中的目标卡间内传数据量Gsdatasize更新为目标卡间内传数据量Gsdatasize减去目标数据量的结果。
上述5个策略可以随意选择组合。示例地,首先,判断目标数据量集合和卡间内传数据量sdatasize集合中是否存在策略1描述的情况,若存在,则选择满足策略1的目标数据量和目标卡间内传数据量Gsdatasize,得到策略1对应的目标数组[dataID、nDevID、fdatasize],之后,按照策略1对应的集合更新方式更新目标数据量集合和卡间内传数据量sdatasize集合。循环反复,直到目标数据量集合和卡间内传数据量sdatasize集合中不存在策略1描述的情况。
接着,判断目标数据量集合和卡间内传数据量sdatasize集合中是否存在策略2描述的情况,若存在,则选择满足策略2的目标数据量和X个目标卡间内传数据量Gsdatasize,得到策略2对应的X个目标数组[dataID、nDevID、fdatasize],之后,按照策略2对应的集合更新方式更新目标数据量集合和卡间内传数据量sdatasize集合。循环反复,直到目标数据量集合和卡间内传数据量sdatasize集合中不存在策略2描述的情况。
接着,判断目标数据量集合和卡间内传数据量sdatasize集合中是否存在策略3描述的情况,若存在,
则选择满足策略3的Y个目标数据量和目标卡间内传数据量Gsdatasize,得到策略3对应的Y个目标数组[dataID、nDevID、fdatasize],之后,按照策略3对应的集合更新方式更新目标数据量集合和卡间内传数据量sdatasize集合。循环反复,直到目标数据量集合和卡间内传数据量sdatasize集合中不存在策略3描述的情况。
接着,判断目标数据量集合和卡间内传数据量sdatasize集合中是否存在策略4描述的情况,若存在,则选择满足策略4的目标数据量和目标卡间内传数据量Gsdatasize,得到策略4对应的目标数组[dataID、nDevID、fdatasize],之后,按照策略4对应的集合更新方式更新目标数据量集合和卡间内传数据量sdatasize集合。循环反复,直到目标数据量集合和卡间内传数据量sdatasize集合中不存在策略4描述的情况。
接着,判断目标数据量集合和卡间内传数据量sdatasize集合中是否存在策略5描述的情况,若存在,则选择满足策略5的目标数据量和目标卡间内传数据量Gsdatasize,得到策略5对应的目标数组[dataID、nDevID、fdatasize],之后,按照策略5对应的集合更新方式更新目标数据量集合和卡间内传数据量sdatasize集合。循环反复,直到目标数据量集合和卡间内传数据量sdatasize集合中不存在策略5描述的情况。
最后,对于所有的目标数组[dataID、nDevID、fdatasize]中dataID没有指示的目标数据,可以确定该目标数据的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该目标数据,nDevID为该目标数据所在的加速器D1,fdatasize为该目标数据的数据量。对于所有的目标数组[dataID、nDevID、fdatasize]中的相同dataID所在的目标数组中的fdatasize之和,与该dataID指示的目标数据的数据量不同时,还可以确定该目标数据的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该目标数据,nDevID为该目标数据所在的加速器D1,fdatasize为该目标数据的数据量减去指示该目标数据的dataID所在的目标数组中的sdatasize的结果。
最后,统计所有的目标数组[dataID、nDevID、fdatasize],得到加速器D1i的目标调整策略SFDISi。
需要说明的是,上述方案仅仅作为示例,并不构成具体限定,在实际应用,可以选择策略1至策略5的任意一个或多个策略,确定目标调整策略SFDISi。
下面介绍另一种可行的实现方式,对于加速器D1i,为了减少m个加速器D1之间的通信成本,需要尽可能保障目标数据发往其标记的终加速器标识tDevID指示的加速器D2通过网卡通信的加速器D1。
可选地,对于加速器D1i存储的每份目标数据,确定该份目标数据标记的终加速器标识tDevID指示的加速器D2,通过网卡通信的加速器D1(为了便于描述和区别,记为GD1),若卡间调整策略SSDISi不存在指示GD1的nDevID,则该目标数据的数据量作为可用数据量,并通过数组(为了便于描述和区别,称为可用数据量数组)记录该目标数据和可用数据量的对应关系,可用数据量数组表示为[dataID、adatasize]表示,dataID指示的数据为该目标数据,adatasize表示可用数据量;若卡间调整策略SSDISi存在指示GD1的nDevID(为了便于描述和区别,记为GnDevID),则判断该目标数据的数据量是否大于等于GnDevID所在的决策数组中的sdatasize。
如果,该目标数据的数据量大于等于GnDevID所在数组中的sdatasize,确定目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该目标数据,nDevID为该目标数据所在的加速器D1,fdatasize为GnDevID所在决策数组中的sdatasize。另外,若大于,进一步地,还可以确定该目标数据的数据量和GnDevID所在的决策数组中的sdatasize之间的差值,该差值为可用数据量sdatasize。
若该目标数据的数据量小于GnDevID所在的决策数组中的sdatasize,确定目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该目标数据,nDevID为该目标数据所在的加速器D1,fdatasize为该目标数据的数据量。进一步地,确定该目标数据的数据量和GnDevID所在数组中的sdatasize之间的差值(为了便于描述和区别,称为补充数据量),并通过数组(为了便于描述和区别,称为补充数据量数组)记录GnDevID和补充数据量之间的对应关系,补充数据量数组表示为[nDevID、rdatasize],nDevID表示Gsdatasize所在的决策数组中的nDevID,rdatasize表示补充数据量。
对于加速器D1i,将存储的每份目标数据按照上述方式处理完成后,得到可用数据量adatasize形成的集合(为了便于描述和区别,称为可用数据量集合)和补充数据量rdatasize形成的集合(为了便于描述和区别,称为补充数据量集合)。
然后,按照上述对目标数据量集合和卡间内传数据量sdatasize集合的处理方式,处理可用数据量集合和补充数据量集合,确定若干个目标数组[dataID、nDevID、fdatasize]。这里,相对于目标数据量集合和卡间内传数据量sdatasize集合的处理过程来说,区别仅仅作为将目标数据量替换为目标可用数据量,
目标卡间内传数据量Gsdatasize替换为目标补充数据量。
另外,对于所有的目标数组[dataID、nDevID、fdatasize]中dataID没有指示加速器D1i存储的目标数据,可以确定该目标数据的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该目标数据,nDevID为该目标数据所在的加速器D1,fdatasize为该目标数据的数据量。对于所有的目标数组[dataID、nDevID、fdatasize]中的相同dataID所在的目标数组中的fdatasize之和,与该dataID指示的目标数据的数据量不同时,还可以确定该目标数据的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该目标数据,nDevID为该目标数据所在的加速器D1,fdatasize为该目标数据的数据量减去指示该目标数据的dataID所在的目标数组中的sdatasize的结果。
最后,统计所有的目标数组[dataID、nDevID、fdatasize],得到加速器D1i的目标调整策略SFDISi。
在实际应用中,如不存在补充数据量,统计所有的目标数组[dataID、nDevID、fdatasize],得到加速器D1i的目标调整策略SFDISi。
实现方式2,图7b是图6中步骤620的流程示意图。如图7b所示,在上述图6所示实施例的基础上,本发明实施例中,步骤620,具体可以包括如下步骤:
步骤6221、处理器基于多份数据各自的数据量和标记的需发往的终加速器标识tDevID,确定m个加速器D1各自对应的第二外传数据量和初始通信信息;其中,第二外传数据量指示了第一数据中标记的终加速器标识tDevID为通信加速器标识cDevID的数据的数据量,初始通信信息指示了其他的加速器D1各自存储的标记的终加速器标识tDevID为通信加速器标识的数据的数据量,通信终加速器标识cDevID指示了对应的加速器D1通过网卡通信的加速器D2。
在实际应用中,加速器D1i的初始通信信息OCD1i包括多个数组(为了便于描述和区别,称为初始通信数组),每个初始通信数组表示为[nDevID,tdatasize],tdatasize表示nDevID指示的加速器D1存储的,标记的终加速器标识tDevID为加速器D1i的通信加速器标识cDevIDi的数据的数据量。
举例来说,对于m个加速器D1中的加速器D1i,假设其存储的标记指示m个加速器D2的m个终加速器标识tDevID的m份数据的数据量为TDi1、TDi2、…、TDim,各自标记的终加速器标识tDevID为D21、D22、…、D2m,假设加速器D2i和加速器D1i通信,则加速器D1i对应的通信加速器标识cDevIDi为加速器D2i;则第1个加速器D1的D11的对应的通信加速器标识cDevID1为D21,则第二外传数据量ODATASIZE1=TD11+TD21+…+TDm1,初始通信信息OCD11包括[nDevID=D12,tdatasize=TD21]、…、[nDevID=D12,tdatasize=TDm1]。第2个加速器D12、…、第m个加速器D1m对应的第二外传数据量和初始通信信息类同,不再赘述。
举例来说,如图3所示,对于node0,D00和D10通信,通信加速器标识cDevID=D10,则D00对应的第二外传数据量为a0+a1+a2+a3,D00对应的初始通信信息OCD00包括[D01,a1],[D02,a2],[D03,a3]。
D01和D11通信,通信加速器标识cDevID=D11,则D01对应的第二外传数据量为b0+b1+b2+b3,D01对应的初始通信信息OCD01包括[D00,b0],[D02,b2],[D03,b3]。
D02和D12通信,通信加速器标识cDevID=D12,则D02对应的第二外传数据量为c0+c1+c2+c3,D02对应的初始通信信息OCD02包括[D00,c0],[D01,c1],[D03,c3]。
D03和D13通信,通信加速器标识cDevID=D13,则D03对应的第二外传数据量为d0+d1+d2+d3,D03对应的初始通信信息OCD03包括[D00,d0],[D01,d1],[D02,d2]。
步骤6223、处理器基于m个加速器D1各自对应的第二外传数据量和初始通信信息,确定第一数据在m个加速器D1的第一分布策略。
在一种可行的实现方式中,m个加速器D1各自对应一个第二外传数据量总共m个第二外传数据量,对m个第二外传数据量进行分析,确定m个加速器D1各自需要的网络通信数据量;然后,对于m个加速器D1中的每个加速器D1,将该加速器D1i对应的第二外传数据量减去网络通信数据量,得到第二差异量SDIS(若为正值则指示了需要减少的数据的数据量,若为负值,则指示了需要增加的数据的数据量),加速器D1i的第二差异量可以表示为SDISi;在得到m个加速器D1各自对应的第二差异量SDIS共m个第二差异量SDIS后,基于m个第二差异量SDIS,对m个加速器D1进行配对和数据调整,使得m个加速器D1最终通过网络传输的数据的数据量和网络通信数据量相同或相似,最终确定m个加速器D1各自对应的卡间调整策略SSDIS。卡间调整策略SSDIS的描述参见上文描述不再赘述。
m个加速器D1各自对应的第二差异量SDIS以集合的方式表示,这里,为了便于描述和区别,将m个第二差异量SDIS形成的集合称为第一集合,第一集合包括m个第二差异量SDIS:SDIS1、SDIS2、……、
SDISm。
下面对基于m个加速器D1对应的网络通信数据量和m个第二差异量SDIS,确定m个加速器D1各自对应的卡间调整策略SSDIS(为了便于描述和区别,称为步骤B)进行详细描述。步骤B可以参见上文对步骤A的描述,区别仅仅在于将FDIS替换为SDIS。
在确定了m个加速器D1各自对应的卡间调整策略SSDIS后,即可基于m个加速器D1各自对应的卡间调整策略SSDIS确定m个加速器D1各自对应的目标调整策略SFDIS。下面对确定m个加速器D1各自对应的目标调整策略SFDIS进行详细描述。
需要说明的是,对于m个第二差异量SDIS中小于0的第二差异量<0:SDIS对应的加速器D1,该加速器D1对应的卡间调整策略SSDIS为空,则对于该加速器D1存储的每份目标数据,确定该目标数据的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该目标数据,nDevID指示该目标数据标记的终加速器标识tDevID指示的加速器D2通过网卡通信的加速器D1,fdatasize为该目标数据的数据量。
另外,对于m个第二差异量中大于0的第二差异量>0:SDIS对应的加速器D1,为了便于描述和区别,假设该加速器D1为D1j,与该加速器D1j的通信加速器标识cDevID相同的终加速器标识tDevID作为目标终加速器标识GtDevID;基于该加速器D1j对应的卡间调整策略SSDISj和初始通信信息OCD1j(多个数组[nDevID,tdatasize]),确定m个加速器D1j存储的标记目标终加速器标识GtDevID的目标数据的若干个目标数组[dataID、nDevID、fdatasize]。具体如下:
具体地,对于卡间调整策略SSDISj中每个决策数组[nDevID、sdatasize],将该决策数组中的sdatasize作为需求数据量,nDevID指示的加速器D1存储的标记目标终加速器标识GtDevID的目标数据的数据量作为初始数据量。
判断需求数据量是否大于初始数据量,如果是,确定该决策数组[nDevID、sdatasize]中nDevID指示的加速器D1对应的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该加速器D1中存储的标记目标终加速器标识GtDevID的目标数据,nDevID为该目标数据所在的加速器D1,fdatasize为初始数据量。并将初始数据量和需求数据量之间的差值作为补充数据量rdatasize,并记录补充数据量数组[dataID、rdatasize],dataID指示的数据为该加速器D1中存储的标记目标终加速器标识GtDevID的目标数据。
如果否,该数组[nDevID、sdatasize]中nDevID指示的加速器D1对应的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该加速器D1中存储的标记目标终加速器标识GtDevID的目标数据,nDevID为该目标数据所在的加速器D1,fdatasize为需求数据量。并将初始数据量和需求数据量之间的差值作为可用数据量adatasize,并记录可用数据量数组[dataID、adatasize],dataID指示的数据为该加速器D1中存储的标记目标终加速器标识GtDevID的目标数据。
另外,对于卡间调整策略SSDISj中所有决策数组[nDevID、sdatasize]中的nDevID没有指示的加速器D1,将该加速器D1存储的标记目标终加速器标识GtDevID的目标数据的数据量作为可用数据量。
在处理完卡间调整策略SSDISj中每个决策数组后,可以得到若干个补充数据量sdatasize形成的补充数据量集合和若干个可用数据量adatasize形成的可用数据量集合,然后,按照上述对需求数据量集合和卡间内传数据量sdatasize集合的处理方式,处理可用数据量集合和补充数据量集合,确定若干个目标数组[dataID、nDevID、fdatasize]。
确定m个加速器D1j存储的标记目标终加速器标识GtDevID的目标数据的若干个目标数组[dataID、nDevID、fdatasize]。
在处理完m个第二差异量中每个大于0的第二差异量>0:SDIS对应的加速器D1后,进一步地,对于所有的目标数组[dataID、nDevID、fdatasize]中的dataID没有指示的目标数据,确定该目标数据的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该目标数据,nDevID指示该目标数据标记的终加速器标识tDevID指示的加速器D2通过网卡通信的加速器D1,fdatasize为该目标数据的数据量。对于所有的目标数组[dataID、nDevID、fdatasize]中的相同dataID所在的目标数组中的fdatasize之和,和该dataID指示的目标数据的数据量不同时,确定该目标数据的目标数组[dataID、nDevID、fdatasize],dataID指示的数据为该目标数据,nDevID指示该目标数据标记的终加速器标识tDevID指示的加速器D2通过网卡通信的加速器D1,fdatasize为该目标数据的数据量减去dataID所在的目标数组中的fdatasize的结果。
最后,统计所有的目标数组[dataID、nDevID、fdatasize],得到m个加速器D1各自对应的目标调整策略SFDIS。
需要说明的是,上述方法仅仅作为示例,并不构成具体限定。
值得注意的是,在实际应用中,对于m个加速器D1中的每个加速器D1,该加速器D1按照对应的目标调整策略中的[dataID、nDevID、fdatasize],得到每个目标数组[dataID、nDevID、fdatasize]对应的一份数据,该数据的数据量为fdatasize,并为该数据标记对应的目标数组中的nDevID;当fdatasize和dataID指示的目标数据的数据量相同中,该数据为dataID指示的目标数据,当fdatasize和dataID指示的目标数据的数据量不同,该数据为dataID指示的目标数据中部分。对应的,在步骤630中,通过自身互连的链路,将未在标记的节点内加速器标识nDevID指示的加速器D1内的数据,传输至对应的加速器D1,使得标记相同的节点内加速器标识nDevID的数据汇聚到指示的加速器D1。
图8a是本申请实施例提供的应用场景的示意图二。如图8a所示,N个节点node具有管理节点,管理节点可以管理N个节点node中的m个加速器D,比如,可以对N*m个加速器D进行全局顺序编号,得到加速器标识为D1、D2、…、DN*m,这里,1至m表示节点node0中的加速器D,m+1至2m表示节点node1中的加速器D,以此类推不再赘述;再比如,可以对可以对N*m个加速器D进行全局顺序编号,得到加速器标识为D01、D02、…、D0m、…、D(N-1)m,这里,D01、D02、…、D0m表示节点node0中的加速器D,以此类推不再赘述。具体地,该管理节点可以确定N*N*m份数据,每份数据标记有数据标识dataID、终加速器标识tDevID、始加速器标识oDevID(表示数据需要存储的加速器)、标签,之后,管理节点将N*N*m份数据和其标记的终加速器标识tDevID、始加速器标识oDevID和标签,发送至数据标记的oDevID指示的加速器D存储。值得注意的是,在数据处理过程中,数据一直带着标记的数据标识dataID、终加速器标识tDevID、始加速器标识oDevID。
另外,管理节点还可以确定N*m个加速器D各自对应的通信信息。其中,通信信息包括N-1个通信加速器标识,N-1个通信加速器标识指示的加速器D位于不同的节点node,且这些节点node为对应的加速器D所在的节点node之外的节点node。具体来说,对于N个节点node中的节点nodei,节点nodei中的m个加速器D各自与其他的N-1节点中的1个加速器D通过网卡通信,值得注意的是,节点nodei中的m个加速器D各自与相同节点中不同的加速器D通过网卡通信。之后,管理节点可以将每个加速器D各自对应的通信信息发送到对应的加速器D。N*m个加速器D各自存储有加速器D通信信息。
另外,N个节点node中每个节点node各自具有m个加速器D,每个加速器D各自存储有第一模型和第二模型,在管理节点发送数据后,可以得到N*m份数据,每份N*m份数据各自标记有标签、终加速器标识tDevID和始加速器标识oDevID,其中,N*m份数据中的m份数据标记的标记的终加速器标识tDevID指示自身所在节点的加速器D,其他的m*(N-1)份数据标记的终加速器标识tDevID指示其他的节点中的加速器D,标签指示了第二模型需要达到的真实输出结果。
具体地,本发明实施例提供的通信方法可以应用超大规模的大模型训练。
示例地,该大模型可以为NLP(Natural Language Processing,自然语言处理)领域的模型。在NLP领域,通常涉及到表征向量embedding,比如,一个词一个embedding,对应的,管理节点存储的数据为embedding。由于embedding的容量普遍超过了加速器D的存储空间,所以全量的embedding部署在管理节点的内存中,在训练过程中由该管理节点提前将当前训练所需要的部分embedding提前放到加速器D。由于embedding具有稀疏性,可以被压缩,压缩后可以使用本发明实施例提供的通信方法来进行通信,可以优化通信量,提高整体吞吐量。具体地,N*m个加速器D分别部署有embedding的编码层(上述第一模型)和任务层(上述第二模型),其中,编码层用于提取embedding的更高维的信息,任务层用于基于编码层的输出结果实现任务,比如,预测embedding表示的词。在实际应用中,N*m个加速器D各自将embedding通过编码层后的编码结果(可以称为第一处理数据)发送到其他的N-1个加速器D上,N*m个加速器D各自将得到的编码结果输入到任务层,得到任务执行结果;然后将任务执行结果(可以称为第二处理数据)原路返回,N*m个加速器D可以基于每个embedding各自的标签和任务执行结果确定误差,基于误差可以训练编码层和任务层,循环反复,最终N*m个加速器D各自得到训练好的编码层和任务层。
示例地,N*m个加速器D分别具有特定专家的知识学习模型(上述第一模型)和多个专家知识的融合模型(上述第二模型),融合模型用于基于不同专家的知识学习模型的输出结果实现任务,得到任务结果,该任务可以为物体识别,也可以为语音识别,还可以为故障预测等任务。在实际应用中,N*m个加速器D各自将样本(即管理节点存储的数据)通过知识学习模型后得到学习结果(可以称为第一处理数据);然后,将N*m个加速器D各自将学习结果发送到其他的N-1个加速器D上,N*m个加速器D各自将得到的学习结果输入到融合模型,得到任务执行结果;然后将任务执行结果(可以称为第二处理数据)原路返回,
N*m个加速器D可以基于每个样本各自的标签和任务执行结果确定误差,基于误差可以训练知识学习模型和融合模型,循环反复,最终N*m个加速器D各自得到训练好的知识学习模型和融合模型。
图8b是图8a所示的应用场景下的本申请实施例提供的通信方法的流程示意图。如图8b所示,具体可以包括如下步骤:
步骤801、管理节点的处理器确定N*N*m份数据各自标记的始加速器标识oDevID、终加速器标识tDevID、标签,以及,N*m个加速器D各自对应的通信信息,通过信息包括(N-1)个通信加速器标识cDevID。
步骤802、管理节点的处理器将N*N*m份数据发往其标记的始加速器标识oDevID指示的加速器D,N*N*m份数据携带标记的始加速器标识oDevID、终加速器标识tDevID、标签。
步骤803、管理节点的处理器将m个通信信息各自发往其对应的加速器D。
步骤804、N*m个加速器D各自通过自身存储的第一模型对接收到的N*m份数据进行第一处理,得到处理后的N*m份第一处理数据。
步骤805、N*m个加速器D各自确定N*m份第一处理数据各自对应的目标数组[dataID、nDevID、fdatasize]。
对于节点nodei中的第j个加速器Dj,加速器Dj对于节点nodei之外的N-1个节点node的任意节点(为了便于描述,可以称为目标节点Gnode),基于节点nodei中m个加速器D存储的标记的终加速器标识指示Gnode中加速器D的每份第一处理数据的第一处理数据量,调整标记的终加速器标识指示Gnode中加速器D的所有第一处理数据,在节点nodei中的m个加速器D的分布,得到标记的终加速器标识指示Gnode中加速器D的每份第一处理数据对应的若干个目标数组[dataID、nDevID、fdatasize]。另外,对于加速器Dj中存储的标记的终加速器标识指示节点nodei中的加速器D的第一处理数据,确定该第一处理数据的目标数组[dataID、nDevID、fdatasize]。在处理完节点nodei中m个加速器D存储的标记的终加速器标识,指示节点nodei之外的N-1个节点node中的加速器D的第一处理数据后,对于加速器Dj中存储的标记的终加速器标识指示节点nodei中的加速器D的第一处理数据,确定该第一处理数据的目标数组[dataID、nDevID、fdatasize]。最终,得到加速器Dj存储的N*m份目标第一处理数据各自对应的目标数组[dataID、nDevID、fdatasize]。
步骤806、N*m个加速器D各自基于处理后的N*m份第一处理数据各自对应的目标数组[dataID、nDevID、fdatasize]和对应的通信信息,得到适配目标数组[dataID、nDevID、fdatasize]的多份第一处理数据;其中,每份第一处理数据标记有始加速器标识oDevID、终加速器标识tDevID、通信加速器标识cDevID、节点内加速器标识nDevID,cDevID和tDevID指示的加速器D位于同一节点内。
值得注意的是,对于节点nodei中的第j个加速器Dj,加速器Dj存储的标记的终加速器DtDevID指示的加速器D为节点nodei内加速器D的数据,该数据无需通过网卡发送,因此,这些数据标记的节点内加速器标识nDevID和终加速器标识tDevID相同,后续,通过节点nodei中m个加速器D的互连链路将标记节点内加速器标识nDevID的数据发往该标识指示的加速器D即可。
需要说明的是,对于N*m个加速器D的每个加速器D中的适配目标数组[dataID、nDevID、fdatasize]的每份数据,基于该数据标记的终加速器标识tDevID,从该加速器D对应的通信信息中的N-1个通信加速器标识cDevID确定该数据的通信加速器标识cDevID,通信加速器标识cDevID和终加速器标识tDevID指示的加速器D所在的节点相同。值得注意的是,若该份标记的终加速器标识tDevID指示的加速器D为自身所在节点的加速器D,该通信加速器标识cDevID为空。
步骤807、N个节点各自部署的m个加速器D通过自身互连的链路,将未在标记的节点内加速器标识nDevID指示的加速器D内的第一处理数据传输至指示的加速器D,使得标记相同的节点内加速器标识nDevID的数据汇聚到该标识指示的加速器D。
需要说明的是,对于节点nodei,自身部署的m个加速器D各自在调整后,标记的相同的通信加速器标识cDevID的数据的数据量相同,各自具有的标记指示第j个节点中的加速器D的终加速器标识的所有份数据的数据量之间的差值小于等于预设阈值。第j个节点表示N个节点中第i个节点之外的任一节点。另外,若数据标记的节点内加速器标识nDevID为该数据所在的加速器D,则该数据无需传输,保留在加速器D中即可。
步骤808、N个节点各自部署的m个加速器D各自通过网卡,将第一处理数据和器标记的始加速器标识oDevID、终加速器标识tDevID,发往器标记的通信加速器标识cDevID指示的加速器D。
这里,无需处理通信加速器标识cDevID为空的数据。
步骤809、N*m个加速器D各自通过自身的网卡接收对应的N-1个通信加速器标识cDevID指示的加
速器D发送的第一处理数据和其标记的始加速器标识oDevID、终加速器标识tDevID。
步骤810、N个节点各自部署的m个加速器D通过自身互连的链路,各自将未在标记的终加速器标识tDevID指示的加速器D的第一处理数据传输至指示的加速器D。
步骤811、N*m个加速器D各自基于自身存储的第二模型对得到的N*m份第一处理数据进行第二处理,得到处理后的N*m份第二处理数据;
步骤812、N*m个加速器D各自将N*m份第二处理数据发送到其标识的始加速器标识oDevID指示的加速器D。
这里,N*m份第二处理数据按照前述步骤804至809所示的方法进行通信,区别仅仅在于将第一处理数据替换成第二处理数据。
步骤813、N*m个加速器D各自基于N*m份数据各自对应的第二处理数据和标签,更新第一模型和第二模型,执行步骤803。
综上,N*m个加速器D会对自身存储的N*m份数据进行处理,得到处理后的N*m份模型输出数据,之后,N*m个加速器D之间进行按照前述提供的方法对处理后的N*m份模型输出数据进行通信后,使得标记相同的终加速器标识的模型输出数据汇聚到该标识指示的加速器D中,该加速器D对汇聚后的模型输出数据进行处理后,再次得到模型输出数据后原路返回,实现模型更新,如此反复,实现模型训练。
下面结合图9a至图9e对本发明实施例的具体实现进行示例性的描述。
图9a是本申请实施例提供的应用场景的示意图三,如图9a所示,存在两个节点node0和node1,node0中包括2个加速器D00、D01,node1中包括2个加速器D10、D11。其中,D00与D10通过网络通信,D01与D11通过网络通信;加速器D00、D01、D10、D11标记的信息包括数据标识dataID、终加速器标识tDevID、始加速器标识oDevID,这里,数据标识dataID用于区别不同的数据,可以为数据地址,比如,数据的起始地址和长度,也可以为数据编号,比如,数字和/或字母。则加速器D00、D01、D10、D11存储的相关的信息如下表1所示:
表1
其中,对于node0来说,可以采用图7a所示的方法确定第一分布策略。
首先计算D00、D01各自的第一外传数据量。
对于D00,加速器D00存储的标记终加速器标识D10、D11的数据的数据量为:10、4,则D00对应的第一外传数据量为14=10+4。
对于D01,D01表示的加速器D存储的标记终加速器标识D10、D11的目标数据的数据量为:2、1,则D00对应的第一外传数据量为3=2+1。
则第一外传数据量形成的第一集合为{14、3};
然后,计算网络通信数据量为第一集合的均值8.5=(14+3)/2。
然后,计算D00、D01各自对应的第一差异量。
对于D00,第一差异量FDIS为第一外传数据量14和网络通信数据量8.5的差值5.5。
对于D01,第一差异量FDIS为第一外传数据量3和网络通信数据量8.5的差值-5.5。
由于两个第一差异量FDIS=5.5、-5.5可以匹配,基于此,可以确定D00对应的卡间调整策略包括D00对应的决策数组[nDevID=D01、sdatasize=5.5];D00对应的卡间调整策略为空。
接着,确定D00对应的目标数组:考虑到D01和D11通信,为了减少在节点传输的开销,需要将数据传输至其标识的终加速器标识tDevID指示的加速器通信的加速器,因此,确定D00存储的标记终加速器标识tDevID=D11的数据的数据量4,考虑到存储数据量4小于D00对应的第一差异量FDIS=5.5,因此,可以确定D00对应的目标数组[D004,D01,4],表示D00向D01发送的D004指示的数据的数据量为4。进一步的确定补充数据量rdatasize=1.5=5.5-4;标记终加速器标识tDevID=D10的数据的数据量10作为可用数据量adatasize;这里,只有一个可用数据量adatasize=10和一个补充数据量rdatasize=1.5,因此,可以确定D00对应的目标数组[D003,D01,1.5],表示D00向D01发送的D003指示的数据的数据量为1.5。另外,还可以确定目标数组[D003,D00,8.5]、[D001,D00,1],[D002,D01,2]。
对于D01,其对应的目标数组包括[D011、D00、5]、[D012、D01、7]、[D013、D01、2]、[D014、D01、1]。
之后,对于每个目标数组,确定该目标数组下的数据,并为该数据标记终加速器标识oDevID、始加速器标识oDevID、节点内加速器标识nDevID、通信加速器标识cDevID。
示例地,对于目标数组[D004,D01,4],D004指示的数据在标记终加速器标识oDevID=D11、始加速器标识oDevID=D00的基础上,还可以为D004指示的数据标记的节点内加速器标识nDevID=D01,通信加速器标识cDevID=D11。
示例地,对于目标数组[D001,D00,1],D001指示的数据在标记终加速器标识oDevID=D00、始加速器标识oDevID=D00的基础上,还可以为D001指示的数据标记的节点内加速器标识nDevID=D00,通信加速器标识cDevID为空,或者,为D00。
对于[D003,D01,1.5]和[D003,D00,8.5],可以对D003指示的数据划分,一份数据记为D003a,另一份数据记为D003b。D003a的数据量为8.5,标记的节点内加速器标识nDevID=D00、通信加速器标识cDevID=D11,D003b的数据量为1.5,标记的节点内加速器标识nDevID=D01、通信加速器标识cDevID=D11。D003a和D003b各自指示的数据还标记相同的终加速器标识oDevID=D11、始加速器标识oDevID=D00。
其他的目标数组类同,不再赘述。
其中,对于node1来说。若采用图7a所示的方法。处理过程和node0类同,不再赘述。则对于node1来说,D10的第一外传数据量为10,D11的第一外传数据量为10;则第一外传数据量形成的第一集合为{10、10},D10和D11各自对应的第一外传数据量相同,则D10和D11对应的卡间调整策略为空。考虑到采集图7a所示的方法,无需调整,为了尽可能减少数据传输的次数,则可以采用7b所示的方法。
则对于node1来说,首先计算D10、D11各自对应的第二外传数据量。
对于D10,通信加速器标识cDevID=DOO,表示D10和D00通信,则D10和D11存储的标记终加速器标识oDevID=D00的数据的数据量为:7、2;则D10对应的第二外传数据量为9=7+2。另外,对于D10来说,与其通信的加速器D为DOO,D11需要将自身存储的终加速器标识tDevID=D00的数据发送到D10,因此,D10对应的初始通信信息包括[D11,2],表示加速器D11向加速器D10传输的数据的数据量为2。
对于D11,通信加速器标识cDevID=DO1,表示D11和D01通信,D10和D11存储的标记终加速器标识tDevID=D01的数据的数据量为:3、8;则D11对应的第二外传数据量为11=3+8。另外,对于D11来说,与其通信的加速器D为DO1,D10需要将自身存储的终加速器标识oDevID=D01的数据发送到D11,因此,D11对应的初始通信信息包括[D10,3],表示加速器D10向加速器D11传输的目标数据的数据量为3。
则第二外传数据量形成的第二集合为{9、11}。
然后,计算网络通信数据量为第二集合的均值10=(9+11)/2。
然后,计算D10、D11各自对应的第二差异量SDIS。
对于D10,第二差异量SDIS为第二外传数据量9和网络通信数据量10的差值1。
对于D11,第二差异量SDIS为第二外传数据量11和网络通信数据量10的差值-1。
D10和D11的第二差异量之间的差值相似,因此,D10和D11各自的卡间调整策略均为空。
接着,可以确定D10对应的目标数组:D01和D11通信,而D11对应的第二外传数据量指示了D10和D11存储的标记终加速器标识tDevID=D01的数据的数据量之和,由于需要按照第二外传数据量进行传输,因此,D10需要将存储的标记终加速器标识tDevID=D01的数据传输至D11,D10存储的标记终加速器标识tDevID=D01的数据的数据量为3,则可以确定D10对应的目标数组[D102,D11,3],表示D10向D11发送自身存储的D102指示的数据的数据量为3。另外,D00对应的目标数组还包括[D101、D10、7]、[D103、D10、1]、[D104、D11、2]。
按照相似的方式可以确定该D11对应的目标数组[D111,D10,2],[D112,D11,8]、[D113、D10、7]、[D114、D11、1]。
之后,对于每个目标数组,确定该目标数组下的数据,并为该数据标记终加速器标识oDevID、始加速器标识oDevID、节点内加速器标识nDevID、通信加速器标识cDevID。
示例地,对于[D102,D11,3],D102指示的数据在标记终加速器标识oDevID=D01、始加速器标识oDevID=D10的基础上,还可以为D102指示的数据标记的节点内加速器标识nDevID=D11,通信加速器标识cDevID=D01。
示例地,[D103、D10、1],D102指示的数据在标记终加速器标识oDevID=D10、始加速器标识oDevID=D10的基础上,可以为D102指示的数据标记的节点内加速器标识nDevID=D10,通信加速器标识cDevID为空或者D10。
其他类同,不再赘述。
则按照上述方式处理后,加速器D00、D01、D10、D11存储的数据的标记的节点内加速器nDevID标识如图9b所示。
具体地,处理后,加速器D00、D01、D10、D11存储的相关信息如下表2所示:
表2
下面对通信方案进行描述。
Step1:node0中的D00和D01将未在节点内加速器标识nDevID指示的加速器D内的数据传输至指示的加速器D。
具体地,node0中的D00和D01之间通信,D00将标记节点内加速器标识nDevID=D01的数据发往D01,保留标记节点内加速器标识nDevID=D00的数据,D01将标记节点内加速器标识nDevID=D00的数据发往D00,保留标记节点内加速器标识nDevID=D01的数据;node1中的D10和D11之间通信,D10将标记节点内加速器标识nDevID=D11的数据发往D11,保留标记节点内加速器标识nDevID=D10的数据,D11将标记节点内加速器标识nDevID=D10的数据发往D10,保留标记节点内加速器标识nDevID=D11的数据,得到的
通信结果如图9c所示。
Step2:node0中的D00和node1中的D10之间通过网卡通信,node0中的D01和node1中的D11之间通过网卡通信,D00、D01、D10、D11各自将数据发往该数据标记的通信加速器标识cDevID指示的加速器D,该数据标记的终加速器标识tDevID和标记的通信加速器标识cDevID指示的加速器D位于同一节点。
具体地,node0中的D00和node1中的D10之间通信,D00保留标记终加速器标识tDevID=D00的数据,标记终加速器标识tDevID=D10的数据发往其标记的通信加速器标识cDevID=D10,D10保留标记终加速器标识tDevID=D10的数据,标记终加速器标识tDevID=D00的数据发往其标记的通信加速器标识cDevID=D00;node0中的D01和node1中的D11之间通信,D01将保留标记终加速器标识tDevID=D01的数据,标记终加速器标识tDevID=D10和D11的数据发往其标记的通信加速器标识cDevID=D11,D11将保留标记终加速器标识tDevID=D11的数据,标记终加速器标识tDevID=D11的数据发往其标记的通信加速器标识cDevID=D01,得到的通信结果如图9d所示。
Step3:node1中的D10和D11之间通信,D11保留标记终加速器标识tDevID=D11的数据,将标记终加速器标识tDevID=D10的数据发往D10,得到的通信结果如图9e所示。
可以理解的是,本申请的实施例中的处理器可以是中央处理单元(central processing unit,CPU),还可以是其他通用处理器、数字信号处理器(dataital signal processor,DATASIZEP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。
本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable rom,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DATASIZEL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state Disk,SSD))等。
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。
Claims (10)
- 一种通信方法,其特征在于,应用于第一节点,所述第一节点中互连的m个第一加速器与第二节点中互连的m个第二加速器一一对应,所述m个第一加速器各自和其对应的第二加速器通过自身部署的网卡通信,所述m个第一加速器存储的多份数据各自标记有终加速器标识,所述m为大于等于2的正整数,所述方法包括:所述m个第一加速器通过自身互连的链路,调整第一数据在所述m个第一加速器的分布,使得各自调整后自身具有的所述第一数据中的数据的数据量差异小于等于预设阈值;其中,所述第一数据为所述多份数据中标记分别指示m个第二加速器的终加速器标识的数据;所述m个第一加速器各自通过自身配置的网卡,将调整后自身具有的所述第一数据中的数据发往对应的第二加速器,以使m个第二加速器通过自身互连的链路,调整接收到的所述第一数据在m个第二加速器的分布,使得所述m个第二加速器具有所述第一数据中标记终加速器标识指示自身的数据。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:m个第一加速器各自通过自身的网卡接收对应的第二加速器发送的第二数据;其中,所述第二数据包括未在标记的终加速器标识指示的第一加速器内的数据;所述m个第一加速器通过自身互连的链路,将未在标记的终加速器标识指示的第一加速器内的数据传输至指示的第一加速器。
- 根据权利要求1所述的方法,其特征在于,所述多份数据标记的多个终加速器标识包括指示第一加速器的终加速器标识,所述方法还包括:所述m个第一加速器通过自身互连的链路,各自将未在标记的终加速器标识指示的第一加速器内的数据传输至指示的第一加速器。
- 根据权利要求1至3任一所示的方法,其特征在于,所述方法还包括:基于所述m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定所述第一数据在所述m个第一加速器的第一分布策略;所述m个第一加速器通过自身互连的链路,调整第一数据在所述m个第一加速器的分布,包括:所述m个第一加速器通过自身互连的链路,按照所述第一分布策略调整所述第一数据在所述m个第一加速器的分布。
- 根据权利要求4所示的方法,其特征在于,所述基于所述m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定所述第一数据在所述m个第一加速器的第一分布策略,包括:确定m个第一加速器各自对应的第一外传数据量;其中,所述第一外传数据量指示了对应的第一加速器存储的所述第一数据中的数据的数据量之和;基于所述m个第一加速器各自对应的第一外传数据量、所述m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定所述第一数据在所述m个第一加速器的第一分布策略。
- 根据权利要求4所示的方法,其特征在于,所述基于所述m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定所述第一数据在所述m个第一加速器的第一分布策略,包括:基于所述m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定所述m个第一加速器各自对应的第二外传数据量;其中,所述第二外传数据量指示了所述第一数据中标记的终加速器标识为通信加速器标识的数据的数据量,所述通信加速器标识指示了对应的第一加速器通过网卡通信的第二加速器;基于所述m个第一加速器各自对应的第二外传数据量、所述m个第一加速器存储的多份数据各自的数据量和标记的终加速器标识,确定所述第一数据在所述m个第一加速器的第一分布策略。
- 一种通信系统,其特征在于,包括第一节点和第二节点,所述第一节点中互连的m个第一加速器与所述第二节点中互连的m个第二加速器一一对应,所述m个第一加速器各自和其对应的第二加速器通过自身部署的网卡通信,所述m个第一加速器存储的多份数据各自标记有终加速器标识,所述m为大于等于2的正整数,所述第一节点用于执行如权利要求1-6任一所述的方法。
- 一种服务器,其特征在于,包括:至少一个存储器,用于存储程序;互连的m个第一加速器,与第二节点中互连的m个第二加速器一一对应,所述m个第一加速器中每个 第一加速器和其对应的第二加速器通过自身部署的网卡通信,所述m个第一加速器存储的多份数据各自标记有终加速器标识,所述m个第一加速器执行所述至少一个存储器存储的程序,实现如权利要求1-6任一所述的方法,所述m为大于等于2的正整数。
- 一种服务器,其特征在于,包括:至少一个存储器,用于存储程序;互连的m个第一加速器,与第二节点中互连的m个第二加速器一一对应,所述m个第一加速器中每个第一加速器和其对应的第二加速器通过自身部署的网卡通信,所述m个第一加速器存储的多份数据各自标记有终加速器标识,所述m个第一加速器执行所述至少一个存储器存储的程序,实现如权利要求1-3任一所述的方法,所述m为大于等于2的正整数;处理器,用于执行所述至少一个存储器存储的程序,实现如权利要求4-6任一所示的方法。
- 一种计算机存储介质,所述计算机存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1-6任一所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211284784.6 | 2022-10-20 | ||
CN202211284784.6A CN117955901A (zh) | 2022-10-20 | 2022-10-20 | 通信方法、系统及服务器 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024082670A1 true WO2024082670A1 (zh) | 2024-04-25 |
Family
ID=90736803
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/101734 WO2024082670A1 (zh) | 2022-10-20 | 2023-06-21 | 通信方法、系统及服务器 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117955901A (zh) |
WO (1) | WO2024082670A1 (zh) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018055648A (ja) * | 2016-09-30 | 2018-04-05 | 株式会社日立製作所 | アクセラレーションシステム及びアクセラレーション方法 |
US20190042512A1 (en) * | 2017-08-04 | 2019-02-07 | Dell Products L.P. | Systems and methods for interconnecting gpu accelerated compute nodes of an information handling system |
US20200117990A1 (en) * | 2018-10-10 | 2020-04-16 | Korea Advanced Institute Of Science And Technology | High performance computing system for deep learning |
CN113296718A (zh) * | 2021-07-27 | 2021-08-24 | 阿里云计算有限公司 | 数据处理方法以及装置 |
CN113708979A (zh) * | 2021-09-29 | 2021-11-26 | 深圳市腾讯网域计算机网络有限公司 | 网络加速的方法和装置 |
CN114979000A (zh) * | 2022-01-21 | 2022-08-30 | 华为技术有限公司 | 一种数据传输系统、方法及相关设备 |
-
2022
- 2022-10-20 CN CN202211284784.6A patent/CN117955901A/zh active Pending
-
2023
- 2023-06-21 WO PCT/CN2023/101734 patent/WO2024082670A1/zh unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018055648A (ja) * | 2016-09-30 | 2018-04-05 | 株式会社日立製作所 | アクセラレーションシステム及びアクセラレーション方法 |
US20190042512A1 (en) * | 2017-08-04 | 2019-02-07 | Dell Products L.P. | Systems and methods for interconnecting gpu accelerated compute nodes of an information handling system |
US20200117990A1 (en) * | 2018-10-10 | 2020-04-16 | Korea Advanced Institute Of Science And Technology | High performance computing system for deep learning |
CN113296718A (zh) * | 2021-07-27 | 2021-08-24 | 阿里云计算有限公司 | 数据处理方法以及装置 |
CN113708979A (zh) * | 2021-09-29 | 2021-11-26 | 深圳市腾讯网域计算机网络有限公司 | 网络加速的方法和装置 |
CN114979000A (zh) * | 2022-01-21 | 2022-08-30 | 华为技术有限公司 | 一种数据传输系统、方法及相关设备 |
Also Published As
Publication number | Publication date |
---|---|
CN117955901A (zh) | 2024-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Client-edge-cloud hierarchical federated learning | |
WO2018099084A1 (zh) | 一种神经网络模型训练方法、装置、芯片和系统 | |
WO2020168761A1 (zh) | 训练模型的方法和装置 | |
WO2019148960A1 (zh) | 一种数据分析装置、系统及方法 | |
CN112738820A (zh) | 一种服务功能链的动态部署方法、装置及计算机设备 | |
WO2021052374A1 (zh) | 网络拥塞控制方法、节点、系统及存储介质 | |
CN112532409B (zh) | 网络参数配置方法、装置、计算机设备以及存储介质 | |
CN107454019B (zh) | 软件定义网络动态带宽分配方法、装置、设备及存储介质 | |
JP7451689B2 (ja) | ネットワーク輻輳処理方法、モデル更新方法、および関連装置 | |
CN109379230B (zh) | 一种基于广度优先搜索的服务功能链部署方法 | |
US20220156633A1 (en) | System and method for adaptive compression in federated learning | |
Chen et al. | Latency minimization for mobile edge computing networks | |
US20230042747A1 (en) | Message Processing Method and Device, Storage Medium, and Electronic Device | |
US11888703B1 (en) | Machine learning algorithms for quality of service assurance in network traffic | |
CN109710612A (zh) | 向量索引的召回方法、装置、电子设备和存储介质 | |
CN112054968A (zh) | 面向大规模时间敏感网络的调度方法、装置及电子设备 | |
GB2572537A (en) | Generating or obtaining an updated neural network | |
US10474644B2 (en) | Systems and methods for optimizing selection of a replication data node in a distributed file system | |
CN113765825B (zh) | 一种链式业务流调度的规划方法和系统架构 | |
WO2024082670A1 (zh) | 通信方法、系统及服务器 | |
WO2021238508A1 (zh) | 一种数据处理的方法、装置和设备 | |
Jain et al. | Data-prediction model based on stepwise data regression method in wireless sensor network | |
CN113301126A (zh) | 一种适用于异构组网网关的边缘计算方法 | |
CN116127400B (zh) | 基于异构计算的敏感数据识别系统、方法及存储介质 | |
CN101453361A (zh) | 一种网站请求队列管理方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23878667 Country of ref document: EP Kind code of ref document: A1 |