CN116938727A

CN116938727A - Scattered protocol processing method and device and readable storage medium

Info

Publication number: CN116938727A
Application number: CN202310682761.9A
Authority: CN
Inventors: 李丹; 熊典; 丛鹏宇
Original assignee: China Mobile Communications Corp Research Institute; Tsinghua University
Current assignee: China Mobile Communications Corp Research Institute; Tsinghua University
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-10-24

Abstract

The invention provides a dispersion protocol processing method, a device and a readable storage medium, wherein the method comprises the following steps: acquiring first data placement information and second data placement information in a tree network topology, wherein the first data placement information comprises first data information in a target server before performing a scatter protocol operation on a target node, and the second data placement information comprises second data information in the target server after performing the scatter protocol operation on the target node; under the condition that first data information in N1 target servers under the target child node is updated, the first data information in the N1 target servers is placed into N2 target servers under the target child node; n2 is related to the convergence ratio of the target child node; according to the updated first data information in the N2 target servers, updated first data placement information is obtained; and performing dispersion protocol processing according to the updated first data placement information, the second data placement information and the dispersion protocol algorithm.

Description

Scattered protocol processing method and device and readable storage medium

Technical Field

The present invention relates to the field of wireless communications technologies, and in particular, to a method and apparatus for processing a scatter protocol, and a readable storage medium.

Background

The existing multi-machine communication framework adopts experience parameters to make decisions when selecting a full-specification scheme, and the full-specification scheme selected based on the experience parameters only shows optimal performance under a simple network (such as a network with all nodes under the same switch or full connection), but is difficult to maintain high performance under a relatively complex network topology (such as a hierarchical tree network topology).

Disclosure of Invention

The invention aims to provide a distributed protocol processing method, a device and a readable storage medium, which are used for solving the problem that a full protocol scheme selected based on the existing mode is difficult to maintain high performance under a tree network topology.

In order to achieve the above object, an embodiment of the present invention provides a dispersion protocol processing method, including:

acquiring first data placement information and second data placement information in a tree network topology, wherein the tree network topology comprises at least two servers and at least two switches, the first data placement information comprises first data information placed in each target server before a scatter reduction operation is performed on a target node, the second data placement information comprises second data information placed in each target server after the scatter reduction operation is performed on the target node, the target node is the switch, and the target server is a server in a subtree taking the target node as a root node in the tree network topology;

Under the condition that updating of first data information placed in N1 target servers under a target child node is determined, the first data information in the N1 target servers is replaced into N2 target servers under the target child node, and updated first data information in the N2 target servers is obtained; n1 is the total number of servers included in a subtree in the tree network topology taking the target child node as a root node, N2 is related to a convergence ratio of the target child node, and the target child node includes at least one child node of the target node;

according to the updated first data information in the N2 target servers, updated first data placement information is obtained;

and performing dispersion protocol processing according to the updated first data placement information, the second data placement information and the dispersion protocol algorithm.

Optionally, before the repositioning the first data information in the N1 target servers to the N2 target servers under the target child node, the method further includes:

determining e child nodes in d child nodes of a target child node according to the convergence ratio of the target child node, wherein d is the total number of child nodes contained in the target child node;

And determining the N2 according to the number of the target servers under the e sub-nodes.

Optionally, the method of the embodiment of the present invention further includes:

calculating a first overhead time length and a second overhead time length, wherein the first overhead time length is the time length required by the N1 servers under the target child node to transmit the data owned by the servers out of the target child node, and the second overhead time length is the time length required by the N2 servers to transmit the data owned by the servers out of the target child node;

and under the condition that the second overhead time length is smaller than the first overhead time length, determining first data information placed in N1 target servers under the target child node to update.

Optionally, the first overhead duration and the second overhead duration are related to the following parameters, respectively;

communication delay parameters including the number of communication steps and delay caused by starting one-step communication;

a bandwidth overhead parameter, the bandwidth overhead parameter comprising an overhead required for transmitting data of a unit data amount for a total data amount transferred on a physical link;

calculating overhead parameters including the number of aggregate operations and the time required for a single operation by the processing unit;

The memory read-write overhead parameters comprise total data quantity of the memory and time required by memory read-write unit data;

bandwidth contention parameters including the number of hops in a communication, a threshold number of hops when bandwidth congestion occurs, and a linear ratio between the increase in time consumption and the number of hops caused by network contention.

Optionally, the first overhead duration and the second overhead duration respectively satisfy the following formulas:

T＝Aα+Bβ+Cγ+Dδ+max(w-w _t ,0)Bε；

wherein T represents a first overhead time length or a second overhead time length, A represents the number of communication steps, alpha represents delay caused by starting one-step communication, B represents the total data amount transferred on one physical link, beta represents overhead required for transmitting unit data amount data, C represents the data amount of aggregation operation, gamma represents time required for single operation of a processing unit, D represents the total data amount of read and write memory, delta represents time required for reading and writing unit data of the memory, w represents the number of fan-ins of communication, w _t Represents the threshold number of fan-ins at which bandwidth congestion occurs and epsilon represents the linear ratio between the increase in time consumption caused by network contention and the number of fan-ins.

Optionally, in the case that the subtree under the target node is in an asymmetric structure, the scatter reduction algorithm is that data is sent by a first server to a second server, where the first server is a server that places the data before performing the scatter reduction operation, and the second server is a server that places the data after performing the scatter reduction operation.

The embodiment of the invention also provides a dispersion protocol processing device, which comprises:

a first obtaining module, configured to obtain first data placement information and second data placement information in a tree network topology, where the tree network topology includes at least two servers and at least two switches, the first data placement information includes first data information placed in each target server before a scatter reduction operation is performed on a target node, the second data placement information includes second data information placed in each target server after a scatter reduction operation is performed on the target node, the target node is the switch, and the target server is a server in a subtree in the tree network topology that uses the target node as a root node;

the resetting module is used for resetting the first data information in the N1 target servers to N2 target servers under the target child node under the condition that the first data information in the N1 target servers is updated, so as to obtain the updated first data information in the N2 target servers; n1 is the total number of servers included in a subtree in the tree network topology taking the target child node as a root node, N2 is related to a convergence ratio of the target child node, and the target child node includes at least one child node of the target node;

The second acquisition module is used for acquiring updated first data placement information according to the updated first data information in the N2 target servers;

and the processing module is used for carrying out scattered protocol processing according to the updated first data placement information, the updated second data placement information and the scattered protocol algorithm.

Optionally, the device of the embodiment of the present invention further includes:

a first determining module, configured to determine, before the resetting module relocates first data information in the N1 target servers to N2 target servers under the target child nodes, e child nodes among d child nodes of the target child nodes according to a convergence ratio of the target child nodes, where d is a total number of child nodes included in the target child nodes;

and the second determining module is used for determining the N2 according to the number of the target servers under the e child nodes.

the computing module is used for computing a first overhead time length and a second overhead time length, wherein the first overhead time length is the time length required by the N1 servers under the target child node to transmit the data owned by each server out of the target child node, and the second overhead time length is the time length required by the N2 servers to transmit the data owned by each server out of the target child node;

And the third determining module is used for determining first data information placed in N1 target servers under the target child node to update under the condition that the second overhead time length is smaller than the first overhead time length.

T＝Aα+Bβ+Cγ+Dδ+max(w-w _t ,0)Bε；

Wherein T represents a first overhead time length or a second overhead time length, A represents the number of communication steps, alpha represents delay caused by starting one-step communication, and B represents one piece of delayThe total data quantity transferred on the physical link, beta represents the overhead required for transmitting the unit data quantity data, C represents the data quantity of the aggregation operation, gamma represents the time required by the processing unit to perform single operation, D represents the total data quantity of the read and write memory, delta represents the time required by the memory to read and write the unit data, w represents the number of fans in communication, w _t Represents the threshold number of fan-ins at which bandwidth congestion occurs and epsilon represents the linear ratio between the increase in time consumption caused by network contention and the number of fan-ins.

The embodiment of the invention also provides a dispersion protocol processing device, which comprises: a transceiver and a processor;

the processor is configured to obtain first data placement information and second data placement information in a tree network topology, where the tree network topology includes at least two servers and at least two switches, the first data placement information includes first data information placed in each target server before a scatter reduction operation is performed on a target node, the second data placement information includes second data information placed in each target server after a scatter reduction operation is performed on the target node, the target node is the switch, and the target server is a server in a subtree in the tree network topology that uses the target node as a root node; under the condition that updating of first data information placed in N1 target servers under a target child node is determined, the first data information in the N1 target servers is replaced into N2 target servers under the target child node, and updated first data information in the N2 target servers is obtained; n1 is the total number of servers included in a subtree in the tree network topology taking the target child node as a root node, N2 is related to a convergence ratio of the target child node, and the target child node includes at least one child node of the target node; according to the updated first data information in the N2 target servers, updated first data placement information is obtained; and performing dispersion protocol processing according to the updated first data placement information, the second data placement information and the dispersion protocol algorithm.

The embodiment of the invention also provides a dispersion protocol processing device, which comprises: a transceiver, a processor, a memory, and a program or instructions stored on the memory and executable on the processor; wherein the processor, when executing the program or instructions, implements the steps of the scatter protocol processing method as described above.

The embodiment of the invention also provides a readable storage medium, on which a program or an instruction is stored, characterized in that the program or the instruction, when executed by a processor, implement the steps of the decentralized protocol processing method as described above.

The technical scheme of the invention has the following beneficial effects:

in the embodiment of the invention, first data placement information and second data placement information in a tree network topology are acquired; under the condition that updating of first data information placed in N1 target servers under a target child node is determined, the first data information in the N1 target servers is replaced into N2 target servers under the target child node, and updated first data information in the N2 target servers is obtained; according to the updated first data information in the N2 target servers, updated first data placement information is obtained; and performing dispersion protocol processing according to the updated first data placement information, the second data placement information and the dispersion protocol algorithm. Since N2 is related to the convergence ratio corresponding to the target child node, by relocating the first data information in the N1 target servers to the N2 target servers under the target child node, the uplink bandwidth of the target child node can be effectively utilized without generating bandwidth congestion, for example, the uplink bandwidth of the target child node is 1MB, the downlink bandwidth of each target child node is 0.5MB, and the convergence ratio of the target child node is 0.5, and the N2 is 2. The scheme based on the embodiment of the invention can form a full-protocol scheme which is efficient and effective in avoiding network congestion for any tree network topology, so that the full-protocol scheme maintains high performance under the tree network topology.

Drawings

FIG. 1 is a schematic flow chart of a method for processing a dispersion protocol according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a symmetric tree network topology according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an asymmetric tree network topology according to an embodiment of the present invention;

FIG. 4 is a data scheme placement schematic for the network topology shown in FIG. 2;

FIG. 5 is a data scheme placement schematic for the network topology shown in FIG. 3;

FIG. 6 is a schematic block diagram of a distributed protocol processing apparatus according to an embodiment of the present invention;

FIG. 7 is a block diagram of a distributed protocol processing apparatus according to an embodiment of the present invention;

FIG. 8 is a block diagram of a distributed protocol processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In various embodiments of the present application, it should be understood that the sequence numbers of the following processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

In addition, the terms "system" and "network" are often used interchangeably herein.

In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B may be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.

The following description is presented to enable one skilled in the art to better understand the embodiments of the present application.

The full reduction (Allreduce) is an operation for performing full synchronization of data between multiple processes and machines, and is widely used in distributed systems, which is an important link affecting the performance of the distributed systems. Distributed machine learning is a typical full-specification application scenario, and the speed of data synchronization between different computing nodes directly affects the overall performance of the system. Thus, improving the performance of the full specification is beneficial to improving the efficiency of the distributed system. A typical full specification method includes:

Ring AllReduce: the method uses a ring-like logical topology, where data flows unidirectionally between adjacent nodes on the ring. The method has the advantages of supporting any node number and avoiding the problem of bandwidth competition caused by a 'many-to-one' traffic mode. However, communication delays can have a serious impact on the performance of the method due to the excessive number of communication steps, depending on the chain length; in addition, the method only carries out merging calculation on two blocks of data at a time, so that the memory read-write expense is too high.

Parameter Server (PS): in the method, each node is respectively responsible for collecting and merging a part of data. Each node communicates with all other nodes, forming a "many-to-many" traffic pattern. The method has the advantages of supporting any node number, having less memory overhead and constant communication dependence chain length of 2. However, since "many-to-many" communication may result in network bandwidth contention, network throughput may be degraded and protocol performance may be degraded.

Recursive Halving and Doubling: the method adopts the logic topology of a binary tree, and adopts a strategy of halving the data volume layer by layer and doubling the communication distance layer by layer to carry out the protocol. The advantage of this approach is that bandwidth contention problems are avoided and the communication dependent chain length index level is reduced compared to Ring AllReduce. However, when the number of nodes is not the integer power of two, additional communication overhead is introduced, which limits the application range of the method; in addition, the method only performs merging calculation on two blocks of data at a time, and excessive memory read-write expense is introduced.

Common multi-machine communication libraries include MPI, NCCL, etc., which all contain several implementations of full-specification methods. In practical application, they select a specific method to perform full-specification operation according to the number of nodes participating in synchronization and the size of data volume. However, the selection basis is an empirical parameter, the difference of the structure and hardware conditions of different clusters is ignored, and the performance is poor in the current hardware high-speed iteration environment.

In addition, the conventional full-specification method (scheme) has a deviation in generation and selection means. Specifically:

with the current high-speed development of networks, network bandwidth is more and more close to memory bandwidth, so access memory overhead introduced by the protocol calculation process is not negligible. This is not considered in the existing selection means.

As the network scale increases, the number of nodes participating in the full specification increases, which can create serious network contention problems. Network contention can lead to bandwidth congestion, resulting in an actual bandwidth on a congested link that is less than the maximum bandwidth that it theoretically supports. This problem is not fully considered in the existing selection means.

Based on the problems of the existing full-specification method and the defects of the existing full-specification method selection means, the invention provides a distributed specification processing method which is generally used in a common tree network cluster environment and can select the optimal logic topology according to the specific environment so as to improve the full-specification performance.

As shown in fig. 1, an embodiment of the present invention provides a dispersion protocol processing method, including:

step 101: first data placement information and second data placement information in a tree network topology are obtained, the tree network topology comprises at least two servers and at least two switches, the first data placement information comprises first data information placed in each target server before a scatter reduction operation is performed on a target node, the second data placement information comprises second data information placed in each target server after the scatter reduction operation is performed on the target node, the target node is the switch, and the target server is a server in a subtree taking the target node as a root node in the tree network topology.

In a tree network topology, the lowest level (leaf node) is the server and the up intermediate nodes are all switches. In the embodiment of the invention, the data placement scheme is generated layer by layer from bottom to top. Assuming a tree network topology, containing a total of N servers, the data on each server will be split into N blocks. The scheme of data placement will be recursively generated from bottom to top, and in the embodiment of the present invention, the generation scheme under only one node (i.e., the target node described above) will be described.

Assuming that the target node is marked as A, the lower part of the target node is provided with C child nodes which are directly connected with the A and marked as C _i (i.epsilon. {0,1, …, (c-1) }). These child nodes may be switches or servers. For one by C _i The number of servers (i.e., the number of leaf nodes) contained in this tree is labeled n as a subtree of the root node _i . C before performing the scatter reduction operation on A ₀ ,C ₁ ,…,C _c-1 All have to have completed their dispersion specifications (recursive requirements), namely: for any C _i A subtree, n _i Each server has completedSpecification of block data blocks. It should be noted here that if C _i Itself is a server (leaf node, there are n _i =1), then the scatter reduction operation is naturally complete. The initial data placement state of the step-by-step dispersion protocol of the node A is the data placement state corresponding to the first data placement information. Subsequently, after completion of the scatter reduction on a, there is a total of n= Σnin the subtree with a as the root node _i A plurality of servers, each of which has completed +.>The specification of the block data block is that the ending data placement state of the step is the data placement state corresponding to the second data placement information.

As shown in fig. 2, which is a symmetric tree network topology, fig. 3 is an asymmetric tree network topology, where sw2 represents a target node, sw0 and sw1 represent child nodes of the target node, in fig. 2, there are three servers identified as 0, 1 and 2 under sw0, and three servers identified as 3, 4 and 5 under sw 0. In fig. 3, there are three servers under sw0 identified as 0, 1 and 2, and four servers under sw0 identified as 3, 4, 5 and 6.

The data placement scheme of the network topology shown in fig. 2 is described below in connection with fig. 4.

As shown in fig. 4, in the initial state, six servers identified as 0, 1, 2, 3, 4, and 5 store a, b, c, d, e and f 6 data blocks, respectively, and data block a, i.e., [0-2]. A, of three servers identified as 0, 1, and 2, and data block b, i.e., [0-2]. B, of three servers identified as 0, 1, and 2 are placed in the server identified as 0 before the scatter reduction operation is performed on sw 2; the server identified as 1 has placed therein the data block c, i.e., [0-2]. C, of the three servers identified as 0, 1 and 2, and the data block d, i.e., [0-2]. D, of the three servers identified as 0, 1 and 2; the server identified as 2 has placed therein the data block e, i.e., [0-2]. E, of the three servers identified as 0, 1 and 2, and the data block f, i.e., [0-2]. F, of the three servers identified as 0, 1 and 2.

Before the scatter reduction operation is performed on sw2, the data block a of the three servers identified as 3, 4 and 5, i.e., [3-5]. A, and the data block b of the three servers identified as 3, 4 and 5, i.e., [3-5]. B, are placed in the server identified as 3; the server identified as 4 has placed therein the data block c in the three servers identified as 3, 4 and 5, i.e., [3-5]. C, and the data block d in the three servers identified as 3, 4 and 5, i.e., [3-5]. D; the server identified as 5 has placed therein the data block e in the three servers identified as 3, 4 and 5, i.e., [3-5]. E, and the data block f in the three servers identified as 3, 4 and 5, i.e., [3-5]. F.

After the scatter protocol operation is performed on sw2, data blocks a, namely [0-5]. A, in five servers identified as 0, 1, 2, 3, 4 and 5 are placed in the server identified as 0; the server identified as 1 has placed therein data blocks c, i.e., [0-5]. C, in six servers identified as 0, 1, 2, 3, 4, and 5; the server marked as 2 is provided with data blocks e in six servers marked as 0, 1, 2, 3, 4 and 5, namely [0-5]. E; the server marked 3 is provided with data blocks b in six servers marked 0, 1, 2, 3, 4 and 5, namely [0-5]. B; the server marked as 4 is provided with data blocks d in six servers marked as 0, 1, 2, 3, 4 and 5, namely [0-5]. D; the server identified as 5 has placed therein the data blocks f, i.e., [0-5]. F, of the six servers identified as 0, 1, 2, 3, 4, and 5.

The data placement scheme of the network topology shown in fig. 3 is described below in connection with fig. 5.

As shown in fig. 5, in the initial state, seven servers identified as 0, 1, 2, 3, 4, 5, and 6 store a, b, c, d, e, f and g in total 7 data blocks, respectively, and data block a, i.e., [0-2]. A, data block b, i.e., [0-2]. B, of three servers identified as 0, 1, and 2, and data block c, i.e., [0-2]. C, of three servers identified as 0, 1, and 2 are placed in the server identified as 0 among the three servers identified as 0, 1, and 2 before the scatter reduction operation is performed on sw 2; the server identified as 1 has placed therein the data block d, i.e., [0-2]. D, of the three servers identified as 0, 1 and 2, and the data block e, i.e., [0-2]. E, of the three servers identified as 0, 1 and 2; the server identified as 2 has placed therein the data block f, i.e., [0-2]. F, of the three servers identified as 0, 1 and 2, and the data block g, i.e., [0-2]. G, of the three servers identified as 0, 1 and 2.

Before performing the scatter reduction operation on sw2, the data block a of the four servers identified as 3, 4, 5 and 6, i.e., [3-6]. A, and the data block b of the four servers identified as 3, 4, 5 and 6, i.e., [3-6]. B, are placed in the server identified as 3; the server marked 4 is provided with data blocks c in the four servers marked 3, 4, 5 and 6, namely [3-6]. C, and data blocks d in the four servers marked 3, 4, 5 and 6, namely [3-6]. D; the server identified as 5 has placed therein the data block e of the four servers identified as 3, 4, 5 and 6, i.e., [3-6]. E, and the data block f of the four servers identified as 3, 4, 5 and 6, i.e., [3-6]. F; the server identified as 6 has placed therein the data blocks g of four servers identified as 3, 4, 5 and 6, i.e., [3-6]. G.

After performing the scatter protocol operation on sw2, the data block a, i.e., [0-6]. A, of seven servers identified as 0, 1, 2, 3, 4, 5, and 6 is placed in the server identified as 0; the server identified as 1 has placed therein data blocks d, i.e., [0-5]. D, in seven servers identified as 0, 1, 2, 3, 4, 5, and 6; the server marked as 2 is provided with data blocks f in seven servers marked as 0, 1, 2, 3, 4, 5 and 6, namely [0-6]. F; the server marked 3 is provided with data blocks b in seven servers marked 0, 1, 2, 3, 4, 5 and 6, namely [0-6]. B; the server marked as 4 is provided with data blocks c in seven servers marked as 0, 1, 2, 3, 4, 5 and 6, namely [0-6]. C; the server identified as 5 has placed therein data block e, i.e., [0-5]. E, of seven servers identified as 0, 1, 2, 3, 4, 5, and 6; the server identified as 6 has placed therein the data blocks g, i.e., [0-5]. G, of the seven servers identified as 0, 1, 2, 3, 4, 5, and 6.

Step 102: under the condition that updating of first data information placed in N1 target servers under a target child node is determined, the first data information in the N1 target servers is replaced into N2 target servers under the target child node, and updated first data information in the N2 target servers is obtained; n1 is the total number of servers contained in a subtree of the tree network topology taking the target child node as a root node, N2 is related to the convergence ratio of the target child node, and the target child node comprises at least one child node of the target node.

Here, the number of nodes participating in communication at the time of upper layer communication is reduced by relocating the first data information in the N1 target servers to the N2 target servers under the target child node.

Step 103: and obtaining updated first data placement information according to the updated first data information in the N2 target servers.

Specifically, updated first data placement information is obtained according to the updated first data information in the N2 target servers under each target child node.

Step 104: and performing dispersion protocol processing according to the updated first data placement information, the second data placement information and the dispersion protocol algorithm.

In the embodiment of the invention, the full specification is completed by a mode of first dispersing the specification (Scatter-Reduce) and then fully collecting (AllGather), and the operation modes of the dispersing specification and the fully collecting are completely symmetrical. Thus, it is necessary to construct a solution of the scatter protocol, and then reverse it to obtain a fully collected solution, thereby obtaining a complete full protocol solution.

Optionally, e is the inverse of the convergence ratio of the target child node.

In one embodiment of the present invention, it is assumed that one target child node of the target node is C _a The target child node C _a With D sub-nodes D ₀ ,D ₁ ,…,D _d-1 Each sub-node has m under it ₀ ,m ₁ ,…,m _d-1 A server (leaf node), andin the initial data placement state, n is a total _a A plurality of servers, each of which collects the specification +.>A block data block. After updating the data placement state, the data rearranges to C _a D on e child nodes of (a) ₀ ,D ₁ ,…,D _e-1 There are total +.>Servers (leaf nodes). Wherein e is according to C _a Is generally equal to switch C _a Reciprocal of the convergence ratio, ensuring that the e child nodes can be fully utilized by switch C _a Uplink bandwidth of (a) is determined. These total p servers, each server will collect the protocol +. >A block data block. From now on, the p servers will represent C _a N below _a Bench server and C _a Other servers communicate.

T＝Aα+Bβ+Cγ+Dδ+max(w-w _t ,0)Bε；

wherein T represents a first overhead time period or a second overhead time period, A represents the number of communication steps, alpha represents delay caused by starting one-step communication, B represents the total data amount transferred on one physical link, beta represents overhead required for transmitting unit data amount data, C represents the data amount of aggregation operation (such as addition operation or maximum operation), gamma represents time required for a single operation by a processing unit, D represents the total data amount of read and write memory, delta represents time required for reading and writing unit data of memory, w represents the fan-in number of communication, and w _t Represents the threshold number of fan-ins at which bandwidth congestion occurs and epsilon represents the linear ratio between the increase in time consumption caused by network contention and the number of fan-ins.

In an embodiment of the present invention, the above calculation formula may be described as an overhead model, which is an improvement based on the existing overhead model. The conventional overhead model only contains three terms of communication delay, bandwidth overhead and computational overhead. With the rapid development of data centers and network technologies, access overhead is increasingly significant, and negative effects caused by communication congestion are also increasingly significant, which cannot be represented in a traditional model, so that the traditional model cannot guide the design and selection of full-specification schemes in modern clusters. The overhead model in the embodiment of the invention adds memory read-write overhead and bandwidth competition factorE.g. w _t And epsilon), the cost model can be used for accurately estimating the task time consumption of the network distributed system. In a given cluster, six pending parameters in the formula need to be determined: alpha, beta, gamma, delta, epsilon, w _t . And running a plurality of Co-localized PS time consumption tests with the number of the participating nodes being reduced to a plurality, and then putting test results into a formula to fit the specific values of the six undetermined parameters.

In the embodiment of the invention, for an actual deployment environment, an automatic test needs to be performed on the environment to collect environment information, so as to obtain an overhead model of the network system, and the first overhead duration and the second overhead duration are calculated based on the overhead model.

In the embodiment of the invention, after the updated first data placement information and the updated second data placement information are determined, an algorithm adopted from the first data placement information to the second data placement information is selected. If all subtrees under the target child node are symmetrical (as shown in FIG. 2), then a scatter reduction algorithm commonly used in the industry, such as Ring, co-localized PS, recursive Halving algorithms, etc., may be employed. Specifically, the cost model in the embodiment of the invention can be utilized to calculate the cost duration required by each algorithm, and then the dispersion protocol algorithm with the shortest cost duration is selected. If the subtree under the target child node is in an asymmetric structure (as shown in fig. 3), the scatter reduction algorithm is that data is sent by a first server to a second server, the first server being the server that placed the data before performing the scatter reduction operation, and the second server being the server that placed the data after performing the scatter reduction operation.

According to the scheme of the embodiment of the invention, through the generalized definition of the tree-like protocol logic topology, the irregular expansion design of the tree-like protocol logic topology, the automatic detection of the actual physical environment and the accurate modeling description of the physical environment, the full protocol operation can be efficiently completed on most clusters without being limited by the node number, bandwidth, delay and other physical parameters of the clusters, and the method has good universality.

The application scenario and embodiment of the invention are as follows:

an important application of distributed systems is distributed machine learning today. Data parallelism is an important distributed method, and the protocol is a main scheme for supporting data parallelism, which has a significant impact on the performance of distributed machine learning. Taking the common Open MPI application as an example:

1. the proposal scheme is implemented in the Open MPI, and the modified Open MPI version is compiled.

2. And on a given cluster, introducing the modified Open MPI into a network topology model, running a real machine test, and automatically parameterizing the expense model.

3. The model is trained using OpenMPI.

Therefore, the optimization of the performance can be completed on the premise of not modifying the training codes, and the transparency of the user is realized.

The scheme of the embodiment of the invention is a protocol method production scheme based on tree network topology, is based on automatic logic topology generation preference of a novel overhead model, supports any number of nodes, reduces memory read-write overhead by adjusting tree width, and avoids bandwidth competition by multiple operations.

As shown in fig. 6, the embodiment of the present invention further provides a decentralized protocol processing device 600, including:

A first obtaining module 601, configured to obtain first data placement information and second data placement information in a tree network topology, where the tree network topology includes at least two servers and at least two switches, the first data placement information includes first data information placed in each target server before performing a scatter reduction operation on a target node, the second data placement information includes second data information placed in each target server after performing a scatter reduction operation on the target node, the target node is the switch, and the target server is a server in a subtree in the tree network topology that uses the target node as a root node;

a resetting module 602, configured to, in a case where it is determined to update first data information placed in N1 target servers under a target child node, reset the first data information in the N1 target servers to N2 target servers under the target child node, to obtain updated first data information in the N2 target servers; n1 is the total number of servers included in a subtree in the tree network topology taking the target child node as a root node, N2 is related to a convergence ratio of the target child node, and the target child node includes at least one child node of the target node;

A second obtaining module 603, configured to obtain updated first data placement information according to the updated first data information in the N2 target servers;

and the processing module 604 is configured to perform a scatter protocol processing according to the updated first data placement information, the updated second data placement information, and the scatter protocol algorithm.

T＝Aα+Bβ+Cγ+Dδ+max(w-w _t ,0)Bε；

It should be noted that, the device is a device corresponding to the above method embodiment, and all implementation manners of the above method embodiment can be applied to the device embodiment, and the same technical effects can be achieved, which is not repeated herein.

As shown in fig. 7, the embodiment of the present invention further provides a decentralized protocol processing device, including: a transceiver 720 and a processor 710;

the processor 710 is configured to obtain first data placement information and second data placement information in a tree network topology, where the tree network topology includes at least two servers and at least two switches, the first data placement information includes first data information placed in each target server before performing a scatter reduction operation on a target node, the second data placement information includes second data information placed in each target server after performing a scatter reduction operation on the target node, the target node is the switch, and the target server is a server in a subtree in the tree network topology that uses the target node as a root node; under the condition that updating of first data information placed in N1 target servers under a target child node is determined, the first data information in the N1 target servers is replaced into N2 target servers under the target child node, and updated first data information in the N2 target servers is obtained; n1 is the total number of servers included in a subtree in the tree network topology taking the target child node as a root node, N2 is related to a convergence ratio of the target child node, and the target child node includes at least one child node of the target node; according to the updated first data information in the N2 target servers, updated first data placement information is obtained; and performing dispersion protocol processing according to the updated first data placement information, the second data placement information and the dispersion protocol algorithm.

As shown in fig. 8, the embodiment of the present invention further provides a decentralized protocol processing device, including: transceiver 810, processor 800, memory 820, and programs or instructions stored on memory 820 and executable on processor 800; the processor 800, when executing the program or instructions, implements the steps of the scatter protocol processing method as described above.

The transceiver 810 is configured to receive and transmit data under the control of the processor 800.

Wherein in fig. 8, a bus architecture may comprise any number of interconnected buses and bridges, and in particular, one or more processors represented by processor 800 and various circuits of memory represented by memory 820, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The transceiver 810 may be a number of elements, i.e., including a transmitter and a receiver, providing a means for communicating with various other apparatus over a transmission medium. The user interface 830 may also be an interface capable of interfacing with an inscribed desired device for a different user device, including but not limited to a keypad, display, speaker, microphone, joystick, etc.

The processor 800 is responsible for managing the bus architecture and general processing, and the memory 820 may store data used by the processor 800 in performing operations.

The readable storage medium of the embodiment of the present invention stores a program or an instruction, which when executed by a processor, implements the steps in the decentralized protocol processing method described above, and can achieve the same technical effects, and is not repeated here.

The processor is a processor in the decentralized protocol processing device described in the foregoing embodiment. The readable storage medium includes a computer readable storage medium such as a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk or an optical disk.

It is further noted that the terminals described in this specification include, but are not limited to, smartphones, tablets, etc., and that many of the functional components described are referred to as modules in order to more particularly emphasize their implementation independence.

In an embodiment of the invention, the modules may be implemented in software for execution by various types of processors. An identified module of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may, for instance, be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate instructions stored in different bits which, when joined logically together, comprise the module and achieve the stated purpose for the module.

Indeed, a module of executable code may be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, and across several memory devices. Likewise, operational data may be identified within modules and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices.

Where a module may be implemented in software, taking into account the level of existing hardware technology, a module may be implemented in software, and one skilled in the art may, without regard to cost, build corresponding hardware circuitry, including conventional Very Large Scale Integration (VLSI) circuits or gate arrays, and existing semiconductors such as logic chips, transistors, or other discrete components, to achieve the corresponding functions. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.

The exemplary embodiments described above are described with reference to the drawings, many different forms and embodiments are possible without departing from the spirit and teachings of the present invention, and therefore, the present invention should not be construed as limited to the exemplary embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will convey the scope of the invention to those skilled in the art. In the drawings, the size of the elements and relative sizes may be exaggerated for clarity. The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Unless otherwise indicated, a range of values includes the upper and lower limits of the range and any subranges therebetween.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A dispersion protocol processing method, comprising:

2. The method of claim 1, wherein the repositioning of the first data information in the N1 target servers to the N2 target servers under the target child node is preceded by:

3. The method according to claim 1 or 2, further comprising:

4. A method according to claim 3, wherein the first and second overhead durations are each related to the following parameters;

5. The method of claim 4, wherein the first overhead duration and the second overhead duration each satisfy the following formulas:

T＝Aα+Bβ+Cγ+Dδ+max(w-w _t ,0)Bε；

wherein T represents a first overhead duration or a second overhead durationA represents the number of communication steps, alpha represents the delay caused by starting one-step communication, B represents the total data amount transmitted on one physical link, beta represents the overhead required for transmitting the data of the unit data amount, C represents the data amount of the aggregation operation, gamma represents the time required for the processing unit to perform a single operation, D represents the total data amount of the read and write memory, delta represents the time required for the memory to read and write the unit data, w represents the number of fans in communication, w _t Represents the threshold number of fan-ins at which bandwidth congestion occurs and epsilon represents the linear ratio between the increase in time consumption caused by network contention and the number of fan-ins.

6. The method of claim 1, wherein in the case where the subtree under the target node is an asymmetric structure, the scatter reduction algorithm is that data is sent by a first server to a second server, the first server being the server that placed the data before performing the scatter reduction operation, and the second server being the server that placed the data after performing the scatter reduction operation.

7. A dispersion protocol processing apparatus, comprising:

8. The apparatus as recited in claim 7, further comprising:

9. The apparatus according to claim 7 or 8, further comprising:

10. The apparatus of claim 9, wherein the first and second overhead durations are each related to the following parameters;

11. The apparatus of claim 10, wherein the first overhead duration and the second overhead duration each satisfy the following formulas:

T＝Aα+Bβ+Cγ+Dδ+max(w-w _t ,0)Bε；

wherein T represents a first overhead time length or a second overhead time length, A represents the number of communication steps, alpha represents delay caused by starting one-step communication, B represents the total data amount transferred on one physical link, beta represents overhead required for transmitting unit data amount data, C represents the data amount of aggregation operation, gamma represents time required for single operation of a processing unit, D represents the total data amount of read and write memory, delta represents time required for reading and writing unit data of the memory, w represents the number of fan-ins of communication, w _t Representing the threshold number of fan-ins at which bandwidth congestion occurs, ε represents the linear ratio between the increase in time consumption due to network contention and the number of fan-ins。

12. The apparatus of claim 7, wherein in the case where the subtree under the target node is an asymmetric structure, the scatter reduction algorithm is that data is sent by a first server to a second server, the first server being the server that placed the data before performing the scatter reduction operation, the second server being the server that placed the data after performing the scatter reduction operation.

13. A dispersion protocol processing apparatus, comprising: a transceiver and a processor;

14. A dispersion protocol processing apparatus comprising: a transceiver, a processor, a memory, and a program or instructions stored on the memory and executable on the processor; the method according to any one of claims 1 to 6, characterized in that the steps of the scatter protocol processing method according to any one of claims 1 to 6 are implemented when the program or instructions are executed by the processor.

15. A readable storage medium having stored thereon a program or instructions which when executed by a processor implement the steps of the scatter reduction processing method of any of claims 1 to 6.