CN115086437A

CN115086437A - Gradient polymerization acceleration method and device based on clustering and XDP technology

Info

Publication number: CN115086437A
Application number: CN202210676787.8A
Authority: CN
Inventors: 徐宏力; 杨鹏; 赵功名
Original assignee: Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Current assignee: Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-20
Anticipated expiration: 2042-06-15
Also published as: CN115086437B

Abstract

The invention provides a gradient polymerization acceleration method and device based on clustering and XDP technology. Wherein, the method comprises the following steps: determining a cluster head computing node for pre-polymerization from a plurality of computing nodes, and clustering all the computing nodes according to the cluster head computing node; deploying and configuring aggregation programs on all cluster head computing nodes and servers, and configuring and deploying gradient receiving programs on all the computing nodes; sending the local gradient data of the non-cluster-head computing node to a designated cluster-head computing node for gradient pre-polymerization; and sending the pre-polymerized gradient data to a server through the cluster head computing node so as to perform final gradient aggregation through the server. The technical scheme of the invention effectively relieves the bandwidth bottleneck problem of the server, simultaneously avoids the delay problem caused by excessive network hops, and finally improves the speed and performance of gradient aggregation.

Description

Gradient polymerization acceleration method and device based on clustering and XDP technology

Technical Field

The invention belongs to the field of Quality of Service (QoS), and particularly relates to a gradient aggregation acceleration method and device based on clustering and XDP technologies.

Background

In recent years, the revival of deep neural networks brings breakthrough to a plurality of application fields such as computer vision, natural language processing to recommendation systems and the like. To achieve better performance, the model is trained using a larger data set to train a deeper, more complex, deep neural network, resulting in doubling the computational effort every 3-4 months. Therefore, deploying a model training task in parallel on multiple machines, called distributed DNN training, is a practical way to increase computational resources and speed up training time.

To support distributed model training, there are two classic system architectures based on data parallelism: parameter Server (PS) and All-reduce (AR). In the PS architecture, the machines in the cluster are mainly divided into two types: worker and server. During each training iteration, the worker sends their local gradients to the server for aggregation, and then obtains aggregated gradient data from the server. In the existing PS architecture, the gradient of communication between the server and worker quickly exhausts the bandwidth resource of the server, which may even make the actual training time 8.7 times the ideal in some actual training.

To avoid communication bottlenecks in PS, studies have proposed AR architectures in which gradient aggregation is performed in a decentralized manner. The ring type is the most popular AR algorithm, and its operation can be decomposed into scatter operation and gather operation. After the scatter operation, each worker polymerized a portion of the gradient individually. Similarly, after gather operation, each worker multicasts a part of the aggregated gradient, and finally, global aggregation is realized. While AR performs gradient aggregation in a circular fashion to solve the scalability problem, it requires more network hops to complete aggregation when distributed model training is added with more worker. In some practical training scenarios, the AR takes more than 1.5 times the training time compared to the optimal time.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a gradient aggregation acceleration method and device based on clustering and XDP technologies, which realize acceleration of gradient aggregation of distributed model training.

The technical scheme of the invention is as follows:

in a first aspect, the invention provides a gradient aggregation acceleration method based on clustering and XDP technologies, which is executed by a controller in a distributed model training platform, wherein the platform further comprises a server and a plurality of computing nodes; the method comprises the following steps:

s1, determining cluster head computing nodes for pre-polymerization from the plurality of computing nodes, and clustering all the computing nodes according to the cluster head computing nodes;

s2, deploying and configuring aggregation programs on all cluster head computing nodes and servers, and configuring and deploying gradient receiving programs on all computing nodes;

s3, sending the local gradient data of the non-cluster head computing node to the appointed cluster head computing node for gradient pre-polymerization; the non-cluster-head computing node is a computing node which is not a cluster-head computing node in all computing nodes;

s4, sending the pre-polymerized gradient data to a server through the cluster head computing node so as to carry out final gradient aggregation through the server;

and S5, sending the aggregated gradient data back to the cluster head computing node through the aggregation program of the server, so that the cluster head computing node sends the gradient data to other computing nodes in the cluster.

Optionally, after the S5, the method further includes:

s6, all the computing nodes monitor the receiving condition of the receiving program, detect the packet loss condition when the packet loss occurs and start the packet loss retransmission program.

Optionally, the S1 includes:

s11, selecting a cluster head computing node for pre-aggregation from the computing nodes according to the network resource conditions of the computing nodes and the server;

s12, according to the forwarding flow requirement of the local gradient and the actual idle resources of each cluster head computing node, assigning a unique cluster head computing node for a non-cluster head computing node, and sending the gradient data of the non-cluster head computing node to the corresponding cluster head computing node for pre-aggregation.

Optionally, the S12 specifically includes:

generating corresponding constraint conditions according to the forwarding flow requirements of the local gradient and actual idle resources of the computing nodes of each cluster head, and clustering all the computing nodes according to the constraint conditions to form a dominating set;

the dominating set is composed of cluster head computing nodes.

Optionally, clustering all the computing nodes according to the constraint condition and forming a domination set, including:

and forming an undirected graph according to the actual link conditions of all the computing nodes, and selecting a degree-limited minimum control set from the undirected graph as a domination set.

Optionally, the S2 specifically includes:

s21, deploying and configuring an aggregation program based on the XDP in all cluster head computing nodes and servers;

s22, deploying XDP-based gradient receiving programs on all the computing nodes, wherein the receiving condition of the gradient receiving programs is maintained by a package receiving record value, and when the value of the package receiving record value reaches a set value, triggering a notification program to notify a user mode program to acquire the received complete gradient data.

Optionally, the implementation step of the notification program includes:

and the user mode establishes a server for monitoring the UDP port, rewrites the content of the last received gradient fragment IP packet by a receiving program, changes the content into a UDP packet, sets a corresponding destination address and a destination port, and uploads the destination address and the destination port to the kernel protocol stack, thereby informing the user program that the gradient receiving is finished.

Optionally, the S3 specifically includes:

after each round of model training is finished, the non-cluster-head computing node divides the gradient data into a plurality of segments, encapsulates each segment in an IP data packet, and sends the IP data packet to the corresponding cluster-head computing node to perform gradient pre-polymerization.

Optionally, the S5 specifically includes:

when the server finishes the aggregation of the same fragments from all cluster heads, sending the aggregated gradient fragments to all cluster head computing nodes in an XDP-based multicast mode;

and the cluster head computing node sends the received gradient fragments which are aggregated to the managed computing node in an XDP-based multicast mode.

In a second aspect, an embodiment of the present invention further provides a gradient aggregation acceleration apparatus based on clustering and XDP technologies, configured in a controller in a distributed model training platform, where the platform further includes a server and a plurality of computing nodes; the device comprises:

the cluster head computing node determining module is used for selecting a pre-polymerized cluster head computing node from the plurality of computing nodes and clustering all the computing nodes according to the cluster head computing node;

the aggregation program deployment module is used for deploying and configuring aggregation programs on all cluster head computing nodes and the server, and deploying gradient receiving programs on all the computing nodes;

the aggregation module is used for sending the local gradient data of the non-cluster-head computing node to a designated cluster-head computing node for gradient pre-aggregation; the non-cluster-head computing node is a computing node which is not a cluster-head computing node in all computing nodes;

sending the pre-polymerized gradient data to a server through the cluster head computing node so as to carry out final gradient aggregation through the server;

and the gradient issuing module is used for sending the aggregated gradient data back to the cluster head computing node through an aggregation program of the server so that the cluster head computing node sends the gradient data to other computing nodes in the cluster.

Compared with the scheme in the prior art, the invention has the advantages that:

1. the gradient aggregation of the distributed deep learning is accelerated by utilizing a clustering mode, bandwidth resources and computing resources of a worker are fully utilized in the process, and compared with the method of directly adding more hardware resources such as servers, the method is lower in cost and higher in universality.

2. The invention performs gradient pre-polymerization on the cluster head, fully utilizes the memory resource at the host end, and solves the defect that the programmable exchanger cannot cache complete gradient due to insufficient memory. In addition, compared with parameter aggregation, the simple gradient aggregation operation is more friendly to the CPU, and the gradient aggregation performance is not influenced by the memory bandwidth.

3. The pre-aggregation program of the cluster head computing node is realized by adopting a rapid Data Path (XDP) technology, and the technology carries out aggregation operation on the gradient Data packet at a network card driving position without participation of an operating system kernel network protocol stack, thereby effectively improving the gradient aggregation performance and reducing the forwarding delay of pre-aggregation. In addition, in order to utilize the parallel acceleration processing of the multi-queue technology, the invention also adds an atomic lock design aiming at the parallel processing, and prevents the related concurrent design.

4. The invention designs a clustering algorithm based on a degree-limited minimum dominating set, which selects cluster head computing nodes as few as possible according to the actual idle resources of all workers, including computing and bandwidth resources. Each cluster head dominates a portion of the worker, such that each worker is dominated by either a cluster head or a cluster head. The algorithm can maximize the resource utilization rate of the worker when the bandwidths are different, and meanwhile, the bandwidth pressure of the server is relieved.

5. The invention realizes gradient sending down after polymerization by using XDP, thereby leading the whole gradient polymerization process not to pass through a protocol stack and fully utilizing the advantages of quick processing and forwarding of XDP. In addition, a set of active reporting and reliable aggregation method is designed, so that the XDP technology which can only be operated in kernel space in best effort has reliability and can interact with user space programs.

The invention discloses a gradient polymerization acceleration method based on clustering and XDP technology. Firstly, a cluster head computing node which is pre-polymerized is selected by utilizing a clustering algorithm, an aggregation program based on a fast Data Path (XDP) is deployed on the cluster head computing node and a server, and meanwhile, the cluster head computing node is also deployed with the aggregation program based on a TC. And transmitting the trained gradient data to a cluster head computing node by the worker in the cluster, pre-polymerizing by the cluster head computing node, and transmitting the pre-polymerized gradient to a server for final aggregation. And finally, the server sends the gradient data subjected to global aggregation to all workers in an original way. The method effectively relieves the bandwidth bottleneck problem of the server, avoids the delay problem caused by excessive network hops, and finally improves the speed and performance of gradient aggregation. In addition, the invention does not need extra hardware equipment, and has the advantages of high universality, low cost and wide application prospect.

Drawings

The invention is further described with reference to the following figures and examples:

FIG. 1 is a flow chart of a gradient polymerization acceleration method based on clustering and XDP technology in the present invention;

FIG. 2 is a schematic diagram of a gradient aggregation and distribution path after worker clustering according to the present invention;

FIG. 3 is a diagram of a gradient fragment packet structure according to the present invention;

FIG. 4 is a schematic diagram of the storage structure and aggregation flow of the XDP program of the present invention;

FIG. 5 is a schematic diagram of multicast based on XDP-Multiredirect in the present invention.

Detailed Description

The above-described scheme is further illustrated below with reference to specific examples. It should be understood that these examples are for illustrative purposes and are not intended to limit the scope of the present invention. The conditions used in the examples may be further adjusted according to the conditions of the particular manufacturer, and the conditions not specified are generally the conditions in routine experiments.

Examples

The embodiment provides a gradient aggregation acceleration method based on clustering and an XDP technology, the environment of the method is a distributed model training platform running on a cloud, the method is executed by a controller in the platform, and the platform is composed of a server and a plurality of local computing nodes (workers). The flow of the present embodiment is shown in fig. 1, and the method includes the following steps:

s1, determining a cluster head computing node for pre-polymerization from the plurality of computing nodes, and clustering all the computing nodes according to the cluster head computing node.

In this embodiment, a cluster head computing node for performing pre-aggregation is selected from the plurality of computing nodes according to network resource conditions of the computing nodes and the server; and assigning a unique cluster head computing node for a non-cluster head computing node according to the forwarding flow demand of the local gradient and the actual idle resource of each cluster head computing node, so as to send the gradient data of the non-cluster head computing node to the corresponding cluster head computing node for pre-aggregation.

Specifically, generating corresponding constraint conditions according to the forwarding flow demand of the local gradient and actual idle resources of each cluster head computing node, clustering all the wokers according to the constraint conditions and forming a dominating set, wherein the dominating set is formed by the cluster head computing nodes. Each worker only has a unique identity, and either the cluster head computing node or the computing node governed by the selected cluster head computing node.

Further, generating a dominating set from the constraints, including: forming an undirected graph according to the actual link conditions of all the workers, thereby selecting a minimum control set with limited degree, wherein the limited degree means that the number of the workers which can be controlled by each cluster head is limited, and simultaneously determining the control set consisting of the cluster heads as few as possible.

Referring to fig. 2 by way of example, fig. 2 is an environment platform of this embodiment, where the environment platform includes 1 server and 4 local computing nodes (workers), where respective bandwidth resources are given.

Firstly, selecting No. 3 and No. 4 worker as cluster head computing nodes according to bandwidth resources of each worker and a server by a clustering algorithm, and simultaneously performing No. 3 control on No. 1 and No. 2 workers. In a specific link, the path from the 4 th worker to the 3 rd worker is far, so the 3 rd worker does not dominate the 4 th worker.

And S2, deploying and configuring an aggregation program on all cluster head computing nodes and servers, and configuring and deploying a gradient receiving program on all computing nodes.

Specifically, in this embodiment, an XDP-based aggregation program is deployed on the selected cluster head computing node No. 3 and cluster head computing node No. 4 and the server, and waits for the arrival of corresponding gradient data.

And the No. 1-4 worker also deploys an XDP-based gradient rapid receiving program, the receiving condition of the gradient receiving program is maintained by a packet receiving record value, and when the numerical value of the packet receiving record value reaches a set value, a notification program is triggered to notify a user mode program to obtain the received complete gradient data. The specific design of the notification program comprises the following steps: and the user mode establishes a server for monitoring the UDP port, rewrites the content of the received last gradient fragment IP packet in a receiving program, changes the content into a UDP packet, sets a corresponding destination address and a destination port, and transmits the destination address and the destination port to the kernel protocol stack, thereby informing the user program that the gradient receiving is finished.

In addition, the cluster head computing node additionally deploys an aggregation program based on TC, and the aggregation program is specially used for aggregation of sending gradient data by the cluster head computing node.

In this embodiment, the XDP technology is used to aggregate the gradient data packet at the network card driving position, and no kernel network protocol stack of the operating system is needed, so that the gradient aggregation performance can be effectively improved and the forwarding delay of the pre-aggregation can be reduced.

And S3, sending the local gradient data of the non-cluster-head computing node to the appointed cluster-head computing node for gradient pre-aggregation.

The non-cluster-head computing nodes are other computing nodes which are not cluster-head computing nodes in all the computing nodes.

Specifically, after each round of model training is finished, the worker node in the cluster divides the gradient data into a plurality of segments, each segment is encapsulated in an IP data packet and sent to the cluster head computing node for gradient pre-polymerization, and the cluster head computing node directly pre-polymerizes the gradient data locally.

Illustratively, the structure of the data packet is shown in fig. 3, and the IP data packet includes an 8-byte custom header and a 1024-byte gradient fragment. The custom packet header includes a 4-byte device number field and a 4-byte gradient fragment number.

The storage structure and aggregation flow of gradient segments in a cluster head are shown in fig. 4. When the cluster head computing node No. 3 receives the segment segments of No. 1, local segment and No. 2 in sequence, the segment segments are accumulated and stored in sequence according to the serial number 0 of the segment. When the aggregation is performed 3 times, the gradient segment is forwarded to the server. Because the arrival time of different gradient data cannot be determined, the possibility of concurrent access to storage exists, and therefore, a spin lock is added to carry out mutually exclusive data access and storage.

And S4, sending the pre-polymerized gradient data to a server through the cluster head computing node, so as to perform final gradient aggregation through the server.

Specifically, in this embodiment, the server aggregates the gradient data from the worker nos. 3 and 4. The specific polymerization procedure was similar to that of S3.

In this embodiment, when the server has aggregated the same segment from all cluster heads, a multicast mode based on a fast path (XDP-multirirect) is adopted, which does not need to pass through a protocol stack, so that the aggregated gradient segment is sent to all cluster heads; the cluster head would be sent to its dominant worker in the same manner.

Specifically, the server performs multicast in a manner as shown in fig. 5, and sends the globally aggregated gradient data to the worker nos. 3 and 4, and then the worker No. 3 further sends the gradient data to the worker nos. 1 and 2. In the multicast design in fig. 5, an XDP-multidirect-based manner is adopted to copy and forward multiple packets to the virtual port pairs with the same number as the cluster heads, then the XDP program of the opposite virtual port changes the destination address of each packet to the address of the corresponding cluster head, and finally, a physical port sends out the packets.

The embodiment utilizes the idea of timeout retransmission to trigger packet loss detection by setting a time threshold. The specific check starts from the self receiving condition of the worker and traces back to the father node, and the reason of packet loss is judged globally, so that packet loss retransmission with minimum cost is adopted.

Specifically, a server for monitoring the UDP port is established in all worker user modes, the content of the received last gradient segment IP packet is rewritten in the receiving program, and is changed into a UDP packet, and a corresponding destination address and a destination port are set and are uploaded to the kernel protocol stack, thereby informing the user that the gradient reception of the program is completed. And when the set UDP waiting time threshold is exceeded, the packet loss condition is checked from the worker to the server, and retransmission is started.

The embodiment of the invention also provides a gradient aggregation accelerating device based on clustering and XDP technology, which is characterized in that the gradient aggregation accelerating device is configured in a controller in a distributed model training platform, and the platform also comprises a server and a plurality of computing nodes; the device comprises:

and sending the pre-polymerized gradient data to a server through the cluster head computing node so as to perform final gradient aggregation through the server.

And the receiving state intercepting module is used for intercepting the receiving state of the receiving program through all the computing nodes, detecting the packet loss state when packet loss occurs and starting a packet loss retransmission program.

Wherein the cluster head computing node determination module is configured to:

The S12 is specifically configured to perform:

generating corresponding constraint conditions according to the forwarding flow requirements of the local gradient and actual idle resources of the computing nodes of each cluster head, and clustering all the computing nodes according to the constraint conditions to form a dominating set; the dominating set is composed of cluster head computing nodes.

Specifically, clustering all the computing nodes according to the constraint condition and forming a domination set, including:

The aggregation program deployment module is specifically configured to:

The implementation steps of the notification program include:

The polymerization module is specifically configured to:

The gradient issuing module is specifically configured to:

The gradient aggregation accelerating device based on the clustering and XDP technologies, provided by the embodiment of the invention, can execute the gradient aggregation accelerating method based on the clustering and XDP technologies, provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

The above examples are only for illustrating the technical idea and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the content of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A gradient aggregation acceleration method based on clustering and XDP technology is characterized in that the gradient aggregation acceleration method is executed by a controller in a distributed model training platform, and the platform further comprises a server and a plurality of computing nodes; the method comprises the following steps:

2. The method according to claim 1, further comprising, after the S5:

3. The method according to claim 1, wherein the S1 includes:

and S12, according to the forwarding flow demand of the local gradient and the actual idle resource of each cluster head computing node, assigning a unique cluster head computing node for a non-cluster head computing node, so as to send the gradient data of the non-cluster head computing node to the corresponding cluster head computing node for pre-polymerization.

4. The method according to claim 3, wherein the S12 specifically comprises:

5. The method of claim 4, wherein clustering all compute nodes according to the constraints and forming a dominating set comprises:

6. The method according to claim 1, wherein the S2 specifically includes:

7. The method of claim 6, wherein the step of implementing the notification procedure comprises:

8. The method according to claim 1, wherein the S3 specifically includes:

9. The method according to claim 1, wherein the S5 specifically includes:

and the cluster head computing node sends the received gradient fragments which are aggregated to the dominant computing node in an XDP-based multicast mode.

10. A gradient aggregation accelerating device based on clustering and XDP technology is characterized by being configured in a controller in a distributed model training platform, wherein the platform further comprises a server and a plurality of computing nodes; the device comprises: