CN115086437A - Gradient polymerization acceleration method and device based on clustering and XDP technology - Google Patents

Gradient polymerization acceleration method and device based on clustering and XDP technology Download PDF

Info

Publication number
CN115086437A
CN115086437A CN202210676787.8A CN202210676787A CN115086437A CN 115086437 A CN115086437 A CN 115086437A CN 202210676787 A CN202210676787 A CN 202210676787A CN 115086437 A CN115086437 A CN 115086437A
Authority
CN
China
Prior art keywords
gradient
computing node
cluster
cluster head
aggregation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210676787.8A
Other languages
Chinese (zh)
Other versions
CN115086437B (en
Inventor
徐宏力
杨鹏
赵功名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Original Assignee
Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Institute Of Higher Studies University Of Science And Technology Of China filed Critical Suzhou Institute Of Higher Studies University Of Science And Technology Of China
Priority to CN202210676787.8A priority Critical patent/CN115086437B/en
Publication of CN115086437A publication Critical patent/CN115086437A/en
Application granted granted Critical
Publication of CN115086437B publication Critical patent/CN115086437B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/16Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
    • H04L69/164Adaptation or special uses of UDP protocol
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a gradient polymerization acceleration method and device based on clustering and XDP technology. Wherein, the method comprises the following steps: determining a cluster head computing node for pre-polymerization from a plurality of computing nodes, and clustering all the computing nodes according to the cluster head computing node; deploying and configuring aggregation programs on all cluster head computing nodes and servers, and configuring and deploying gradient receiving programs on all the computing nodes; sending the local gradient data of the non-cluster-head computing node to a designated cluster-head computing node for gradient pre-polymerization; and sending the pre-polymerized gradient data to a server through the cluster head computing node so as to perform final gradient aggregation through the server. The technical scheme of the invention effectively relieves the bandwidth bottleneck problem of the server, simultaneously avoids the delay problem caused by excessive network hops, and finally improves the speed and performance of gradient aggregation.

Description

Gradient polymerization acceleration method and device based on clustering and XDP technology
Technical Field
The invention belongs to the field of Quality of Service (QoS), and particularly relates to a gradient aggregation acceleration method and device based on clustering and XDP technologies.
Background
In recent years, the revival of deep neural networks brings breakthrough to a plurality of application fields such as computer vision, natural language processing to recommendation systems and the like. To achieve better performance, the model is trained using a larger data set to train a deeper, more complex, deep neural network, resulting in doubling the computational effort every 3-4 months. Therefore, deploying a model training task in parallel on multiple machines, called distributed DNN training, is a practical way to increase computational resources and speed up training time.
To support distributed model training, there are two classic system architectures based on data parallelism: parameter Server (PS) and All-reduce (AR). In the PS architecture, the machines in the cluster are mainly divided into two types: worker and server. During each training iteration, the worker sends their local gradients to the server for aggregation, and then obtains aggregated gradient data from the server. In the existing PS architecture, the gradient of communication between the server and worker quickly exhausts the bandwidth resource of the server, which may even make the actual training time 8.7 times the ideal in some actual training.
To avoid communication bottlenecks in PS, studies have proposed AR architectures in which gradient aggregation is performed in a decentralized manner. The ring type is the most popular AR algorithm, and its operation can be decomposed into scatter operation and gather operation. After the scatter operation, each worker polymerized a portion of the gradient individually. Similarly, after gather operation, each worker multicasts a part of the aggregated gradient, and finally, global aggregation is realized. While AR performs gradient aggregation in a circular fashion to solve the scalability problem, it requires more network hops to complete aggregation when distributed model training is added with more worker. In some practical training scenarios, the AR takes more than 1.5 times the training time compared to the optimal time.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a gradient aggregation acceleration method and device based on clustering and XDP technologies, which realize acceleration of gradient aggregation of distributed model training.
The technical scheme of the invention is as follows:
in a first aspect, the invention provides a gradient aggregation acceleration method based on clustering and XDP technologies, which is executed by a controller in a distributed model training platform, wherein the platform further comprises a server and a plurality of computing nodes; the method comprises the following steps:
s1, determining cluster head computing nodes for pre-polymerization from the plurality of computing nodes, and clustering all the computing nodes according to the cluster head computing nodes;
s2, deploying and configuring aggregation programs on all cluster head computing nodes and servers, and configuring and deploying gradient receiving programs on all computing nodes;
s3, sending the local gradient data of the non-cluster head computing node to the appointed cluster head computing node for gradient pre-polymerization; the non-cluster-head computing node is a computing node which is not a cluster-head computing node in all computing nodes;
s4, sending the pre-polymerized gradient data to a server through the cluster head computing node so as to carry out final gradient aggregation through the server;
and S5, sending the aggregated gradient data back to the cluster head computing node through the aggregation program of the server, so that the cluster head computing node sends the gradient data to other computing nodes in the cluster.
Optionally, after the S5, the method further includes:
s6, all the computing nodes monitor the receiving condition of the receiving program, detect the packet loss condition when the packet loss occurs and start the packet loss retransmission program.
Optionally, the S1 includes:
s11, selecting a cluster head computing node for pre-aggregation from the computing nodes according to the network resource conditions of the computing nodes and the server;
s12, according to the forwarding flow requirement of the local gradient and the actual idle resources of each cluster head computing node, assigning a unique cluster head computing node for a non-cluster head computing node, and sending the gradient data of the non-cluster head computing node to the corresponding cluster head computing node for pre-aggregation.
Optionally, the S12 specifically includes:
generating corresponding constraint conditions according to the forwarding flow requirements of the local gradient and actual idle resources of the computing nodes of each cluster head, and clustering all the computing nodes according to the constraint conditions to form a dominating set;
the dominating set is composed of cluster head computing nodes.
Optionally, clustering all the computing nodes according to the constraint condition and forming a domination set, including:
and forming an undirected graph according to the actual link conditions of all the computing nodes, and selecting a degree-limited minimum control set from the undirected graph as a domination set.
Optionally, the S2 specifically includes:
s21, deploying and configuring an aggregation program based on the XDP in all cluster head computing nodes and servers;
s22, deploying XDP-based gradient receiving programs on all the computing nodes, wherein the receiving condition of the gradient receiving programs is maintained by a package receiving record value, and when the value of the package receiving record value reaches a set value, triggering a notification program to notify a user mode program to acquire the received complete gradient data.
Optionally, the implementation step of the notification program includes:
and the user mode establishes a server for monitoring the UDP port, rewrites the content of the last received gradient fragment IP packet by a receiving program, changes the content into a UDP packet, sets a corresponding destination address and a destination port, and uploads the destination address and the destination port to the kernel protocol stack, thereby informing the user program that the gradient receiving is finished.
Optionally, the S3 specifically includes:
after each round of model training is finished, the non-cluster-head computing node divides the gradient data into a plurality of segments, encapsulates each segment in an IP data packet, and sends the IP data packet to the corresponding cluster-head computing node to perform gradient pre-polymerization.
Optionally, the S5 specifically includes:
when the server finishes the aggregation of the same fragments from all cluster heads, sending the aggregated gradient fragments to all cluster head computing nodes in an XDP-based multicast mode;
and the cluster head computing node sends the received gradient fragments which are aggregated to the managed computing node in an XDP-based multicast mode.
In a second aspect, an embodiment of the present invention further provides a gradient aggregation acceleration apparatus based on clustering and XDP technologies, configured in a controller in a distributed model training platform, where the platform further includes a server and a plurality of computing nodes; the device comprises:
the cluster head computing node determining module is used for selecting a pre-polymerized cluster head computing node from the plurality of computing nodes and clustering all the computing nodes according to the cluster head computing node;
the aggregation program deployment module is used for deploying and configuring aggregation programs on all cluster head computing nodes and the server, and deploying gradient receiving programs on all the computing nodes;
the aggregation module is used for sending the local gradient data of the non-cluster-head computing node to a designated cluster-head computing node for gradient pre-aggregation; the non-cluster-head computing node is a computing node which is not a cluster-head computing node in all computing nodes;
sending the pre-polymerized gradient data to a server through the cluster head computing node so as to carry out final gradient aggregation through the server;
and the gradient issuing module is used for sending the aggregated gradient data back to the cluster head computing node through an aggregation program of the server so that the cluster head computing node sends the gradient data to other computing nodes in the cluster.
Compared with the scheme in the prior art, the invention has the advantages that:
1. the gradient aggregation of the distributed deep learning is accelerated by utilizing a clustering mode, bandwidth resources and computing resources of a worker are fully utilized in the process, and compared with the method of directly adding more hardware resources such as servers, the method is lower in cost and higher in universality.
2. The invention performs gradient pre-polymerization on the cluster head, fully utilizes the memory resource at the host end, and solves the defect that the programmable exchanger cannot cache complete gradient due to insufficient memory. In addition, compared with parameter aggregation, the simple gradient aggregation operation is more friendly to the CPU, and the gradient aggregation performance is not influenced by the memory bandwidth.
3. The pre-aggregation program of the cluster head computing node is realized by adopting a rapid Data Path (XDP) technology, and the technology carries out aggregation operation on the gradient Data packet at a network card driving position without participation of an operating system kernel network protocol stack, thereby effectively improving the gradient aggregation performance and reducing the forwarding delay of pre-aggregation. In addition, in order to utilize the parallel acceleration processing of the multi-queue technology, the invention also adds an atomic lock design aiming at the parallel processing, and prevents the related concurrent design.
4. The invention designs a clustering algorithm based on a degree-limited minimum dominating set, which selects cluster head computing nodes as few as possible according to the actual idle resources of all workers, including computing and bandwidth resources. Each cluster head dominates a portion of the worker, such that each worker is dominated by either a cluster head or a cluster head. The algorithm can maximize the resource utilization rate of the worker when the bandwidths are different, and meanwhile, the bandwidth pressure of the server is relieved.
5. The invention realizes gradient sending down after polymerization by using XDP, thereby leading the whole gradient polymerization process not to pass through a protocol stack and fully utilizing the advantages of quick processing and forwarding of XDP. In addition, a set of active reporting and reliable aggregation method is designed, so that the XDP technology which can only be operated in kernel space in best effort has reliability and can interact with user space programs.
The invention discloses a gradient polymerization acceleration method based on clustering and XDP technology. Firstly, a cluster head computing node which is pre-polymerized is selected by utilizing a clustering algorithm, an aggregation program based on a fast Data Path (XDP) is deployed on the cluster head computing node and a server, and meanwhile, the cluster head computing node is also deployed with the aggregation program based on a TC. And transmitting the trained gradient data to a cluster head computing node by the worker in the cluster, pre-polymerizing by the cluster head computing node, and transmitting the pre-polymerized gradient to a server for final aggregation. And finally, the server sends the gradient data subjected to global aggregation to all workers in an original way. The method effectively relieves the bandwidth bottleneck problem of the server, avoids the delay problem caused by excessive network hops, and finally improves the speed and performance of gradient aggregation. In addition, the invention does not need extra hardware equipment, and has the advantages of high universality, low cost and wide application prospect.
Drawings
The invention is further described with reference to the following figures and examples:
FIG. 1 is a flow chart of a gradient polymerization acceleration method based on clustering and XDP technology in the present invention;
FIG. 2 is a schematic diagram of a gradient aggregation and distribution path after worker clustering according to the present invention;
FIG. 3 is a diagram of a gradient fragment packet structure according to the present invention;
FIG. 4 is a schematic diagram of the storage structure and aggregation flow of the XDP program of the present invention;
FIG. 5 is a schematic diagram of multicast based on XDP-Multiredirect in the present invention.
Detailed Description
The above-described scheme is further illustrated below with reference to specific examples. It should be understood that these examples are for illustrative purposes and are not intended to limit the scope of the present invention. The conditions used in the examples may be further adjusted according to the conditions of the particular manufacturer, and the conditions not specified are generally the conditions in routine experiments.
Examples
The embodiment provides a gradient aggregation acceleration method based on clustering and an XDP technology, the environment of the method is a distributed model training platform running on a cloud, the method is executed by a controller in the platform, and the platform is composed of a server and a plurality of local computing nodes (workers). The flow of the present embodiment is shown in fig. 1, and the method includes the following steps:
s1, determining a cluster head computing node for pre-polymerization from the plurality of computing nodes, and clustering all the computing nodes according to the cluster head computing node.
In this embodiment, a cluster head computing node for performing pre-aggregation is selected from the plurality of computing nodes according to network resource conditions of the computing nodes and the server; and assigning a unique cluster head computing node for a non-cluster head computing node according to the forwarding flow demand of the local gradient and the actual idle resource of each cluster head computing node, so as to send the gradient data of the non-cluster head computing node to the corresponding cluster head computing node for pre-aggregation.
Specifically, generating corresponding constraint conditions according to the forwarding flow demand of the local gradient and actual idle resources of each cluster head computing node, clustering all the wokers according to the constraint conditions and forming a dominating set, wherein the dominating set is formed by the cluster head computing nodes. Each worker only has a unique identity, and either the cluster head computing node or the computing node governed by the selected cluster head computing node.
Further, generating a dominating set from the constraints, including: forming an undirected graph according to the actual link conditions of all the workers, thereby selecting a minimum control set with limited degree, wherein the limited degree means that the number of the workers which can be controlled by each cluster head is limited, and simultaneously determining the control set consisting of the cluster heads as few as possible.
Referring to fig. 2 by way of example, fig. 2 is an environment platform of this embodiment, where the environment platform includes 1 server and 4 local computing nodes (workers), where respective bandwidth resources are given.
Firstly, selecting No. 3 and No. 4 worker as cluster head computing nodes according to bandwidth resources of each worker and a server by a clustering algorithm, and simultaneously performing No. 3 control on No. 1 and No. 2 workers. In a specific link, the path from the 4 th worker to the 3 rd worker is far, so the 3 rd worker does not dominate the 4 th worker.
And S2, deploying and configuring an aggregation program on all cluster head computing nodes and servers, and configuring and deploying a gradient receiving program on all computing nodes.
Specifically, in this embodiment, an XDP-based aggregation program is deployed on the selected cluster head computing node No. 3 and cluster head computing node No. 4 and the server, and waits for the arrival of corresponding gradient data.
And the No. 1-4 worker also deploys an XDP-based gradient rapid receiving program, the receiving condition of the gradient receiving program is maintained by a packet receiving record value, and when the numerical value of the packet receiving record value reaches a set value, a notification program is triggered to notify a user mode program to obtain the received complete gradient data. The specific design of the notification program comprises the following steps: and the user mode establishes a server for monitoring the UDP port, rewrites the content of the received last gradient fragment IP packet in a receiving program, changes the content into a UDP packet, sets a corresponding destination address and a destination port, and transmits the destination address and the destination port to the kernel protocol stack, thereby informing the user program that the gradient receiving is finished.
In addition, the cluster head computing node additionally deploys an aggregation program based on TC, and the aggregation program is specially used for aggregation of sending gradient data by the cluster head computing node.
In this embodiment, the XDP technology is used to aggregate the gradient data packet at the network card driving position, and no kernel network protocol stack of the operating system is needed, so that the gradient aggregation performance can be effectively improved and the forwarding delay of the pre-aggregation can be reduced.
And S3, sending the local gradient data of the non-cluster-head computing node to the appointed cluster-head computing node for gradient pre-aggregation.
The non-cluster-head computing nodes are other computing nodes which are not cluster-head computing nodes in all the computing nodes.
Specifically, after each round of model training is finished, the worker node in the cluster divides the gradient data into a plurality of segments, each segment is encapsulated in an IP data packet and sent to the cluster head computing node for gradient pre-polymerization, and the cluster head computing node directly pre-polymerizes the gradient data locally.
Illustratively, the structure of the data packet is shown in fig. 3, and the IP data packet includes an 8-byte custom header and a 1024-byte gradient fragment. The custom packet header includes a 4-byte device number field and a 4-byte gradient fragment number.
The storage structure and aggregation flow of gradient segments in a cluster head are shown in fig. 4. When the cluster head computing node No. 3 receives the segment segments of No. 1, local segment and No. 2 in sequence, the segment segments are accumulated and stored in sequence according to the serial number 0 of the segment. When the aggregation is performed 3 times, the gradient segment is forwarded to the server. Because the arrival time of different gradient data cannot be determined, the possibility of concurrent access to storage exists, and therefore, a spin lock is added to carry out mutually exclusive data access and storage.
And S4, sending the pre-polymerized gradient data to a server through the cluster head computing node, so as to perform final gradient aggregation through the server.
Specifically, in this embodiment, the server aggregates the gradient data from the worker nos. 3 and 4. The specific polymerization procedure was similar to that of S3.
And S5, sending the aggregated gradient data back to the cluster head computing node through the aggregation program of the server, so that the cluster head computing node sends the gradient data to other computing nodes in the cluster.
In this embodiment, when the server has aggregated the same segment from all cluster heads, a multicast mode based on a fast path (XDP-multirirect) is adopted, which does not need to pass through a protocol stack, so that the aggregated gradient segment is sent to all cluster heads; the cluster head would be sent to its dominant worker in the same manner.
Specifically, the server performs multicast in a manner as shown in fig. 5, and sends the globally aggregated gradient data to the worker nos. 3 and 4, and then the worker No. 3 further sends the gradient data to the worker nos. 1 and 2. In the multicast design in fig. 5, an XDP-multidirect-based manner is adopted to copy and forward multiple packets to the virtual port pairs with the same number as the cluster heads, then the XDP program of the opposite virtual port changes the destination address of each packet to the address of the corresponding cluster head, and finally, a physical port sends out the packets.
S6, all the computing nodes monitor the receiving condition of the receiving program, detect the packet loss condition when the packet loss occurs and start the packet loss retransmission program.
The embodiment utilizes the idea of timeout retransmission to trigger packet loss detection by setting a time threshold. The specific check starts from the self receiving condition of the worker and traces back to the father node, and the reason of packet loss is judged globally, so that packet loss retransmission with minimum cost is adopted.
Specifically, a server for monitoring the UDP port is established in all worker user modes, the content of the received last gradient segment IP packet is rewritten in the receiving program, and is changed into a UDP packet, and a corresponding destination address and a destination port are set and are uploaded to the kernel protocol stack, thereby informing the user that the gradient reception of the program is completed. And when the set UDP waiting time threshold is exceeded, the packet loss condition is checked from the worker to the server, and retransmission is started.
The embodiment of the invention also provides a gradient aggregation accelerating device based on clustering and XDP technology, which is characterized in that the gradient aggregation accelerating device is configured in a controller in a distributed model training platform, and the platform also comprises a server and a plurality of computing nodes; the device comprises:
the cluster head computing node determining module is used for selecting a pre-polymerized cluster head computing node from the plurality of computing nodes and clustering all the computing nodes according to the cluster head computing node;
the aggregation program deployment module is used for deploying and configuring aggregation programs on all cluster head computing nodes and the server, and deploying gradient receiving programs on all the computing nodes;
the aggregation module is used for sending the local gradient data of the non-cluster-head computing node to a designated cluster-head computing node for gradient pre-aggregation; the non-cluster-head computing node is a computing node which is not a cluster-head computing node in all computing nodes;
and sending the pre-polymerized gradient data to a server through the cluster head computing node so as to perform final gradient aggregation through the server.
And the gradient issuing module is used for sending the aggregated gradient data back to the cluster head computing node through an aggregation program of the server so that the cluster head computing node sends the gradient data to other computing nodes in the cluster.
And the receiving state intercepting module is used for intercepting the receiving state of the receiving program through all the computing nodes, detecting the packet loss state when packet loss occurs and starting a packet loss retransmission program.
Wherein the cluster head computing node determination module is configured to:
s11, selecting a cluster head computing node for pre-aggregation from the computing nodes according to the network resource conditions of the computing nodes and the server;
s12, according to the forwarding flow requirement of the local gradient and the actual idle resources of each cluster head computing node, assigning a unique cluster head computing node for a non-cluster head computing node, and sending the gradient data of the non-cluster head computing node to the corresponding cluster head computing node for pre-aggregation.
The S12 is specifically configured to perform:
generating corresponding constraint conditions according to the forwarding flow requirements of the local gradient and actual idle resources of the computing nodes of each cluster head, and clustering all the computing nodes according to the constraint conditions to form a dominating set; the dominating set is composed of cluster head computing nodes.
Specifically, clustering all the computing nodes according to the constraint condition and forming a domination set, including:
and forming an undirected graph according to the actual link conditions of all the computing nodes, and selecting a degree-limited minimum control set from the undirected graph as a domination set.
The aggregation program deployment module is specifically configured to:
s21, deploying and configuring an aggregation program based on the XDP in all cluster head computing nodes and servers;
s22, deploying XDP-based gradient receiving programs on all the computing nodes, wherein the receiving condition of the gradient receiving programs is maintained by a package receiving record value, and when the value of the package receiving record value reaches a set value, triggering a notification program to notify a user mode program to acquire the received complete gradient data.
The implementation steps of the notification program include:
and the user mode establishes a server for monitoring the UDP port, rewrites the content of the last received gradient fragment IP packet by a receiving program, changes the content into a UDP packet, sets a corresponding destination address and a destination port, and uploads the destination address and the destination port to the kernel protocol stack, thereby informing the user program that the gradient receiving is finished.
The polymerization module is specifically configured to:
after each round of model training is finished, the non-cluster-head computing node divides the gradient data into a plurality of segments, encapsulates each segment in an IP data packet, and sends the IP data packet to the corresponding cluster-head computing node to perform gradient pre-polymerization.
The gradient issuing module is specifically configured to:
when the server finishes the aggregation of the same fragments from all cluster heads, sending the aggregated gradient fragments to all cluster head computing nodes in an XDP-based multicast mode;
and the cluster head computing node sends the received gradient fragments which are aggregated to the managed computing node in an XDP-based multicast mode.
The gradient aggregation accelerating device based on the clustering and XDP technologies, provided by the embodiment of the invention, can execute the gradient aggregation accelerating method based on the clustering and XDP technologies, provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
The above examples are only for illustrating the technical idea and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the content of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims (10)

1. A gradient aggregation acceleration method based on clustering and XDP technology is characterized in that the gradient aggregation acceleration method is executed by a controller in a distributed model training platform, and the platform further comprises a server and a plurality of computing nodes; the method comprises the following steps:
s1, determining cluster head computing nodes for pre-polymerization from the plurality of computing nodes, and clustering all the computing nodes according to the cluster head computing nodes;
s2, deploying and configuring aggregation programs on all cluster head computing nodes and servers, and configuring and deploying gradient receiving programs on all computing nodes;
s3, sending the local gradient data of the non-cluster head computing node to the appointed cluster head computing node for gradient pre-polymerization; the non-cluster-head computing node is a computing node which is not a cluster-head computing node in all computing nodes;
s4, sending the pre-polymerized gradient data to a server through the cluster head computing node so as to carry out final gradient aggregation through the server;
and S5, sending the aggregated gradient data back to the cluster head computing node through the aggregation program of the server, so that the cluster head computing node sends the gradient data to other computing nodes in the cluster.
2. The method according to claim 1, further comprising, after the S5:
s6, all the computing nodes monitor the receiving condition of the receiving program, detect the packet loss condition when the packet loss occurs and start the packet loss retransmission program.
3. The method according to claim 1, wherein the S1 includes:
s11, selecting a cluster head computing node for pre-aggregation from the computing nodes according to the network resource conditions of the computing nodes and the server;
and S12, according to the forwarding flow demand of the local gradient and the actual idle resource of each cluster head computing node, assigning a unique cluster head computing node for a non-cluster head computing node, so as to send the gradient data of the non-cluster head computing node to the corresponding cluster head computing node for pre-polymerization.
4. The method according to claim 3, wherein the S12 specifically comprises:
generating corresponding constraint conditions according to the forwarding flow requirements of the local gradient and actual idle resources of the computing nodes of each cluster head, and clustering all the computing nodes according to the constraint conditions to form a dominating set; the dominating set is composed of cluster head computing nodes.
5. The method of claim 4, wherein clustering all compute nodes according to the constraints and forming a dominating set comprises:
and forming an undirected graph according to the actual link conditions of all the computing nodes, and selecting a degree-limited minimum control set from the undirected graph as a domination set.
6. The method according to claim 1, wherein the S2 specifically includes:
s21, deploying and configuring an aggregation program based on the XDP in all cluster head computing nodes and servers;
s22, deploying XDP-based gradient receiving programs on all the computing nodes, wherein the receiving condition of the gradient receiving programs is maintained by a package receiving record value, and when the value of the package receiving record value reaches a set value, triggering a notification program to notify a user mode program to acquire the received complete gradient data.
7. The method of claim 6, wherein the step of implementing the notification procedure comprises:
and the user mode establishes a server for monitoring the UDP port, rewrites the content of the last received gradient fragment IP packet by a receiving program, changes the content into a UDP packet, sets a corresponding destination address and a destination port, and uploads the destination address and the destination port to the kernel protocol stack, thereby informing the user program that the gradient receiving is finished.
8. The method according to claim 1, wherein the S3 specifically includes:
after each round of model training is finished, the non-cluster-head computing node divides the gradient data into a plurality of segments, encapsulates each segment in an IP data packet, and sends the IP data packet to the corresponding cluster-head computing node to perform gradient pre-polymerization.
9. The method according to claim 1, wherein the S5 specifically includes:
when the server finishes the aggregation of the same fragments from all cluster heads, sending the aggregated gradient fragments to all cluster head computing nodes in an XDP-based multicast mode;
and the cluster head computing node sends the received gradient fragments which are aggregated to the dominant computing node in an XDP-based multicast mode.
10. A gradient aggregation accelerating device based on clustering and XDP technology is characterized by being configured in a controller in a distributed model training platform, wherein the platform further comprises a server and a plurality of computing nodes; the device comprises:
the cluster head computing node determining module is used for selecting a pre-polymerized cluster head computing node from the plurality of computing nodes and clustering all the computing nodes according to the cluster head computing node;
the aggregation program deployment module is used for deploying and configuring aggregation programs on all cluster head computing nodes and the server, and deploying gradient receiving programs on all the computing nodes;
the aggregation module is used for sending the local gradient data of the non-cluster-head computing node to a designated cluster-head computing node for gradient pre-aggregation; the non-cluster-head computing node is a computing node which is not a cluster-head computing node in all computing nodes;
sending the pre-polymerized gradient data to a server through the cluster head computing node so as to carry out final gradient aggregation through the server;
and the gradient issuing module is used for sending the aggregated gradient data back to the cluster head computing node through an aggregation program of the server so that the cluster head computing node sends the gradient data to other computing nodes in the cluster.
CN202210676787.8A 2022-06-15 2022-06-15 Gradient polymerization acceleration method and device based on clustering and XDP technology Active CN115086437B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210676787.8A CN115086437B (en) 2022-06-15 2022-06-15 Gradient polymerization acceleration method and device based on clustering and XDP technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210676787.8A CN115086437B (en) 2022-06-15 2022-06-15 Gradient polymerization acceleration method and device based on clustering and XDP technology

Publications (2)

Publication Number Publication Date
CN115086437A true CN115086437A (en) 2022-09-20
CN115086437B CN115086437B (en) 2023-08-22

Family

ID=83254481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210676787.8A Active CN115086437B (en) 2022-06-15 2022-06-15 Gradient polymerization acceleration method and device based on clustering and XDP technology

Country Status (1)

Country Link
CN (1) CN115086437B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033078A (en) * 2018-01-12 2019-07-19 华为技术有限公司 A kind of computing system and method based on tree topology
CN110889509A (en) * 2019-11-11 2020-03-17 安徽超清科技股份有限公司 Joint learning method and device based on gradient momentum acceleration
CN110992432A (en) * 2019-10-28 2020-04-10 北京大学 Depth neural network-based minimum variance gradient quantization compression and image processing method
CN112733932A (en) * 2021-01-08 2021-04-30 北京匠数科技有限公司 Model accelerated training method and device based on training data similarity aggregation
CN112862111A (en) * 2021-04-26 2021-05-28 之江实验室 Method and device for accelerating gradient convergence of distributed machine learning
CN113315604A (en) * 2021-05-25 2021-08-27 电子科技大学 Adaptive gradient quantization method for federated learning
CN113642736A (en) * 2021-07-29 2021-11-12 中国科学院计算技术研究所 Gradient polymerization method and system based on cold-heat separation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033078A (en) * 2018-01-12 2019-07-19 华为技术有限公司 A kind of computing system and method based on tree topology
CN110992432A (en) * 2019-10-28 2020-04-10 北京大学 Depth neural network-based minimum variance gradient quantization compression and image processing method
CN110889509A (en) * 2019-11-11 2020-03-17 安徽超清科技股份有限公司 Joint learning method and device based on gradient momentum acceleration
CN112733932A (en) * 2021-01-08 2021-04-30 北京匠数科技有限公司 Model accelerated training method and device based on training data similarity aggregation
CN112862111A (en) * 2021-04-26 2021-05-28 之江实验室 Method and device for accelerating gradient convergence of distributed machine learning
CN113315604A (en) * 2021-05-25 2021-08-27 电子科技大学 Adaptive gradient quantization method for federated learning
CN113642736A (en) * 2021-07-29 2021-11-12 中国科学院计算技术研究所 Gradient polymerization method and system based on cold-heat separation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NGUYEN VAN TU ET AL.: "Accelerating Virtual Network Functions With Fast-Slow Path Architecture Using eXpress Data Path", 《IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT》 *
李建波等: "一种密集部署传感器网络的分簇算法", 《计算机研究与发展》 *

Also Published As

Publication number Publication date
CN115086437B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
Tan et al. A new framework for DDoS attack detection and defense in SDN environment
US10454830B2 (en) System and method for load balancing in a data network
US9306840B2 (en) Securing software defined networks via flow deflection
Bhowmik et al. High performance publish/subscribe middleware in software-defined networks
US20200403904A1 (en) Data Processing Method, Apparatus, and System
Liu et al. F10: A {Fault-Tolerant} engineered network
Tahaei et al. A multi-objective software defined network traffic measurement
US10404611B2 (en) Discovering path maximum transmission unit
CN110798517A (en) Decentralized cluster load balancing method and system, mobile terminal and storage medium
TWI707560B (en) Service function chain path selection method and system
US11632288B2 (en) Determining the impact of network events on network applications
Wette et al. DCT2Gen: A traffic generator for data centers
CN112929200A (en) SDN multi-controller oriented anomaly detection method
Tang et al. Elephant Flow Detection Mechanism in SDN‐Based Data Center Networks
Basat et al. Cooperative network-wide flow selection
WO2020187295A1 (en) Monitoring of abnormal host
CN116723143B (en) Network target range resource allocation method and system based on traffic affinity
Yang et al. Machine learning based proactive flow entry deletion for openflow
CN115086437B (en) Gradient polymerization acceleration method and device based on clustering and XDP technology
CN109308210B (en) Method for optimizing NFV forwarding service chain performance on multi-core server
Wette et al. DCT ${^ 2} $ Gen: A Versatile TCP Traffic Generator for Data Centers
Cui et al. Closer: Scalable load balancing mechanism for cloud datacenters
WO2017105431A1 (en) Dataflow consistency verification
US11881997B1 (en) Determining reorder commands for remote reordering of policy rules
CN109361658A (en) Abnormal flow information storage means, device and electronic equipment based on industry control industry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant