CN111131080B - Distributed deep learning flow scheduling method, system and equipment - Google Patents

Distributed deep learning flow scheduling method, system and equipment Download PDF

Info

Publication number
CN111131080B
CN111131080B CN201911363582.9A CN201911363582A CN111131080B CN 111131080 B CN111131080 B CN 111131080B CN 201911363582 A CN201911363582 A CN 201911363582A CN 111131080 B CN111131080 B CN 111131080B
Authority
CN
China
Prior art keywords
priority
node
ddl
task
central
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911363582.9A
Other languages
Chinese (zh)
Other versions
CN111131080A (en
Inventor
虞红芳
孙罡
周攀
和新树
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201911363582.9A priority Critical patent/CN111131080B/en
Publication of CN111131080A publication Critical patent/CN111131080A/en
Application granted granted Critical
Publication of CN111131080B publication Critical patent/CN111131080B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • H04L47/62Queue scheduling characterised by scheduling criteria
    • H04L47/625Queue scheduling characterised by scheduling criteria for service slots or service orders
    • H04L47/6275Queue scheduling characterised by scheduling criteria for service slots or service orders based on priority
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/30Peripheral units, e.g. input or output ports
    • H04L49/3018Input queuing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/30Peripheral units, e.g. input or output ports
    • H04L49/3027Output queuing

Abstract

The invention discloses a distributed deep learning stream scheduling method, a distributed deep learning stream scheduling system and distributed deep learning stream scheduling equipment, and relates to the technical field of computers. The distributed deep learning stream scheduling equipment can deploy a distributed deep learning stream scheduling system and realize stream scheduling by adopting a distributed deep learning stream scheduling method. The method of the invention starts from the flow characteristics of DDL training and proposes to adopt a high-precision priority-improved scheduling mode to perform data flow scheduling on a DDL training task. The invention carries out priority division on the DDL training task and periodically updates the priority of the DDL training task. The accuracy improvement condition of a future scheduling period of the task is predicted through historical data of the DDL training task, and sequencing is carried out according to the accuracy improvement condition so as to determine the priority of the DDL training task. Meanwhile, the invention considers the condition of limited network priority and realizes the possibility of simulating infinite priority by a small amount of priority in a mode of mapping the global priority into local priority.

Description

Distributed deep learning flow scheduling method, system and equipment
Technical Field
The invention relates to the technical field of computers, in particular to a task phase perception-based distributed deep learning flow scheduling method, system and equipment.
Background
Deep Learning (DL) has raised the wave of research as an important branch of the field of Machine Learning (ML) and made a major breakthrough in a number of fields such as computer vision, speech recognition, natural language processing, etc. Deep learning carries out deep analysis on sample data by designing a neural network model, and finds an optimal configuration scheme of network structure parameters through a long-time iterative training process, so that characteristics of higher layers and abstraction of the data are extracted, and the learned abstract characteristics are applied to the classification problem and other processing of new samples. In order to find the optimal configuration scheme of the network structure parameters, deep learning often needs to design a plurality of different neural network structures, and each different neural network structure needs to be iteratively trained for a plurality of times according to a certain algorithm. These algorithms often contain several artificially set "hyper-parameters", and different "hyper-parameters" also affect the performance of the neural network model. Therefore, a plurality of different "super-parameter" configuration schemes need to be adopted, and each "super-parameter" configuration scheme is used for training the model and obtaining the corresponding optimal model parameter configuration scheme. Therefore, even the same deep learning task may include many training tasks, and different neural network structures and "hyper-parameter" configuration schemes are respectively used for model training to select the neural network model with the best performance. Under the same network structure and the 'hyper-parameter' configuration scheme, along with the continuous increase of the iteration times in the training process, the model precision is increased and finally converged, and when the precision curve is converged, the corresponding model parameter configuration scheme is the optimal neural network parameter configuration scheme under the current network structure and the 'hyper-parameter' configuration scheme which are required to be obtained by people.
With the increasing application range and task difficulty of deep learning, the data set and model scale of deep learning becomes increasingly large, and the whole training process of the whole deep learning task cannot be carried by only the storage and calculation capacity of a single calculation device. Meanwhile, sample data of the deep learning task may originate from data centers distributed in multiple regions, and due to privacy and security related regulations and considerations, the data cannot be directly copied to the same data center for training. In order to deal with the problems of limited computing power of a single computing device and inevitable sample data distribution, DDL (Distributed Deep Learning) is produced.
The DDL shares the complete training task into the distributed computer cluster. Each GPU (node) device in the cluster bears a part of learning tasks, a plurality of GPU devices independently and parallelly perform the calculation tasks of each iteration, after each iteration is completed, the devices are synchronized through mutual communication, the global model is updated, and the next iteration task is performed until the whole model is converged. In practical application, a stage-type training method is usually adopted in the training process of the DDL task, that is, the learning process of the whole task is divided into a plurality of stages, the training precision effect of the model is evaluated after each stage is finished, and some training tasks with poor precision performance are terminated, so that other model structures or hyper-parameter configuration schemes can be tried as early as possible. Taking a Parameter Server (PS) framework and a data parallel mode commonly used in the current DDL as an example, a whole model structure Parameter is distributively stored in a plurality of Parameter servers, training samples are distributed to different working nodes, each working node performs independent training according to local sample data and sends a local update value to the Parameter servers for update synchronization, the process is also called "Push", the Parameter servers update the stored global model Parameter after receiving calculation results from different working nodes, and then send the updated model Parameter to the working nodes for the next round of iterative computation, and the process is also called "Pull".
The distributed parallel computing method can effectively and simultaneously utilize the computing power of a plurality of devices, and greatly reduces the computing time and the computing complexity. However, as mentioned above, each deep learning task needs to deploy a large number of network structures and "hyper-parameter" setting different training tasks, in order to fully utilize cluster device resources, each DDL cluster deploys a large number of different deep learning tasks, and the cluster network always has a large number of data streams from different training processes of different tasks. And different data streams are scheduled through a reasonable stream scheduling scheme, so that the communication time of the training task in the iteration process can be shortened, the iteration speed is accelerated, and finally, a neural network model with better performance can be obtained in shorter time.
The current common flow scheduling schemes include a data flow layer-oriented scheme and a flow layer-oriented scheme.
The main idea of the data flow layer oriented PIAS (reactive information-adaptive flow scheduling) algorithm is to shorten the completion time of a flow by preferentially sending a small flow, so as to achieve the goal of minimizing the completion time of an average flow. The scheduling algorithm oriented to the data flow layer can minimize the average flow completion time. However, such algorithms cannot identify different data streams from the same learning task, and therefore, it is likely that data streams from different tasks are scheduled preferentially, so as to prolong the average communication time in the task update process, and cannot effectively reduce the communication time of each update in the DDL training process.
The main idea of flow scheduling algorithm facing the flow, such as Varys algorithm, is to minimize the mean completion time of the flow by reducing the flow completion time cct (flow completion time) as much as possible. However, the scheduling algorithm facing the flow layer is easy to continuously transmit data streams depending on the same task with different iteration times, and is not beneficial to training other tasks.
In summary, the network problem of communication in the DDL training process has become a major problem limiting the development of distributed machine learning.
Disclosure of Invention
The invention provides a task phase perception-based distributed deep learning flow scheduling method, system and device, which can alleviate the problems.
In order to alleviate the above problems, the technical scheme adopted by the invention is as follows:
in a first aspect, the present invention provides a distributed deep learning stream scheduling method, including the following steps:
s1, initializing central coordination node
Figure BDA0002337820420000031
S2, the central coordination node randomly selects a computing node for each DDL task and adds the computing node into a WorkerList;
s3, for each computing node in the WorkerList, the central coordination node sends a precision improvement prediction request to the proxy node where the computing node is located;
s4, for each computing node in the WorkerList, the agent node where the computing node is located predicts the DDL task information precision improvement value, and sends the precision improvement prediction value to the central coordination node;
s5, the central coordination node sorts all the received precision improvement predicted values according to a high precision improvement priority principle to obtain a global sequence of the calculation nodes;
s6, for each agent node, the central coordination node acquires the local sequence of the computing node according to the global sequence of the computing node through a local priority sequence generation algorithm;
s7, for each computing node in the WorkerList, the central coordination node obtains the priority rule of the ingress port through an ingress port priority sequence generation algorithm according to the local sequence;
s8, for each computing node in the WorkerList, the central coordination node obtains the output port priority rule through an output port priority sequence generation algorithm according to the local sequence;
s9, for each computing node in the WorkerList, the central co-regulation point sends the priority rule of the input port and the priority rule of the output port to the proxy node where the central co-regulation point is located, and the proxy node sets the priority of the data stream according to the priority rule and completes the DDL stream scheduling;
s10, waiting for time t, and then jumping to step S1.
The technical effect of the technical scheme is as follows: the method can periodically update the priority of the DDL task, wherein the accuracy improvement condition of a future scheduling period of the task is predicted through the historical data of the DDL task, and the DDL task is sequenced according to the accuracy improvement condition to further determine the priority of the DDL training task, the condition that the network priority is limited is considered, the possibility that a small number of priorities simulate infinite priorities is realized through the mode that the global priority is mapped into the local priority, and the training of all tasks is facilitated; the communication time of the training task in the iterative process can be shortened while the training quality is ensured, the super-parameter search is accelerated, and the average completion time of the task is reduced.
Further, the duration T refers to the T-Scheduler time length.
Further, data are transmitted between the central coordination node and the agent nodes through sockets.
Further, in step S4, the agent node performs prediction of the DDL task information accuracy improvement value on the computing node according to the training information in the log file read from the local, where the log file records the training information of all the computing nodes of the agent node where the agent node is located.
The technical effect of the technical scheme is as follows: through the log file, the DDL task information precision can be accurately predicted for the improvement value according to the training information of the DDL task, and therefore the stream scheduling system can be guided to complete accurate data stream scheduling.
Further, in step S6, for each proxy node, the method for acquiring the local sequence of the compute node includes the following steps:
a1, central coordination node initializing DDL task set for it
Figure BDA0002337820420000041
Set of compute nodes
Figure BDA0002337820420000042
a2, the central coordination node adds the calculation node with the same IP address into the NodeSet, each calculation node in the NodeSet is arranged according to the global sequence, the accuracy improvement prediction value of the DDL task to which the calculation node belongs is larger, and the sequencing of the calculation node in the NodeSet is earlier;
a3, adding DDL tasks belonging to each computing node in the node set into the Jobset by the central coordination point, wherein each DDL task in the Jobset is arranged according to the global sequence, and the higher the accuracy improvement prediction value of the DDL task is, the higher the ranking in the Jobset is;
a4, initializing a local sequence order of a computing node of the computing node to be 0;
a5, if the DDL task to which the computing node in its NodeSet belongs is the first in its JobSet, the central coordination adjusting point sets the local sequence of the computing node corresponding to the DDL task as order +1, and eliminates the first DDL task in its JobSet, so that the original second DDL task in its JobSet becomes the first DDL task;
a6, repeating the step a5 until the JobSet is empty, and finishing the acquisition of the local sequence of the computing node.
The technical effect of the technical scheme is as follows: the transmission queue of the node can be effectively controlled in a limited transmission queue.
Furthermore, in the step S7, for each compute node in the WorkerList, the method for acquiring ingress port priority rules includes the following steps:
b1, initializing the central coordination node
Figure BDA0002337820420000051
b2, the central coordinating node adds other computing nodes which belong to a DDL task together with the central coordinating node into the PeerNodes;
b3, setting the priority of data flow from each computation node to itself in PeerNodes, if the local sequence is less than or equal to the maximum priority queue number supported by the switch, the central coordination adjusting point sets the priority of its ingress port priority rule as its local sequence, otherwise, the central coordination adjusting point sets the priority of its ingress port priority rule as the maximum priority queue number supported by the switch, and at this time, the central coordination adjusting point completes the acquisition of its ingress port priority rule.
The technical effect of the technical scheme is as follows: ingress port priority rule setting for DDL tasks greater than the maximum number of priority queues can be done under the limited priority queue conditions supported by the switch.
Further, in step S8, for each computing node, the method for acquiring the egress port priority rule includes:
if the local sequence is less than or equal to the maximum number of priority queues supported by the switch, the central co-regulation point sets the priority of the egress port priority rule as the local sequence, otherwise, the central co-regulation point sets the priority of the egress port priority rule as the maximum number of priority queues supported by the switch, and at this time, the acquisition of the egress port priority rule is completed.
The technical effect of the technical scheme is as follows: the setting of the egress port priority rule for DDL tasks larger than the maximum number of priority queues can be done under the limited priority queue conditions supported by the switch.
In a second aspect, the present invention provides a distributed deep learning stream scheduling system, which includes a central coordination node and a plurality of agent nodes, where the agent nodes can run a plurality of computing nodes respectively subordinate to different DDL tasks.
The technical effect of the technical scheme is as follows: through the central coordination node and the agent node, the distributed deep learning flow scheduling can be realized without modifying the distributed deep learning system at the bottom layer in a large range.
Further, the agent node comprises a training information collection module, a lifting precision prediction module and a flow priority strategy execution module;
the training information collection module is used for collecting DDL task information data streams of the computing nodes;
the lifting precision prediction module is used for predicting a DDL task information precision improvement value of the calculation node in a curve fitting mode;
the flow priority strategy execution module is used for receiving the rules of the priority of the inlet port and the priority of the outlet port and setting the priority of the data flow according to the rules;
the central coordination node comprises a global information receiving module and a priority rule generating module;
the global information receiving module is used for receiving the precision improvement predicted values, sequencing the precision improvement predicted values and sequencing all the computing nodes according to the precision improvement predicted values;
the priority rule generating module is used for generating an input port priority rule and an output port priority rule for each computing node.
The technical effect of the technical scheme is as follows: the training information collection module and the precision improvement prediction module jointly act to accurately predict the training stage of the DDL task, so as to guide the central co-regulation point to make a flow scheduling strategy.
In a third aspect, the present invention provides an apparatus, including a switch and a plurality of servers, where the servers are connected through the switch, one of the servers is provided with a central coordination node, and the other servers are provided with a proxy node.
The technical effect of the technical scheme is as follows: provided is a device capable of deploying a distributed deep learning stream scheduling system and realizing distributed deep learning stream scheduling.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a flowchart of a distributed deep learning flow scheduling method described in embodiment 1;
fig. 2 is a flowchart of a local sequence acquisition method for a compute node of a proxy node in embodiment 1;
fig. 3 is a flowchart of a method for acquiring a priority rule of a compute node ingress port in embodiment 1;
FIG. 4 is a diagram illustrating mapping of global task priority and local queue priority in embodiment 1;
fig. 5 is a schematic diagram of incoming flow scheduling using a switch priority queue in embodiment 1;
fig. 6 is a schematic diagram of outgoing stream scheduling using a Linux flow controller in embodiment 1;
fig. 7 is an architecture diagram of the distributed deep learning flow scheduling system described in embodiment 2;
FIG. 8 is a schematic view of the apparatus described in example 3.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
Referring to fig. 1, the present embodiment provides a method for scheduling a distributed deep learning stream, including the following steps:
s1, initializing central coordination node
Figure BDA0002337820420000071
S2, the central coordination node randomly selects a computing node for each DDL task and adds the computing node into a WorkerList;
s3, for each computing node in the WorkerList, the central coordination node sends a precision improvement prediction request to the proxy node where the computing node is located;
s4, for each computing node in the WorkerList, the agent node where the computing node is located predicts the DDL task information precision improvement value, and sends the precision improvement prediction value to the central coordination node;
s5, the central coordination node sorts all the received precision improvement predicted values according to a high precision improvement priority principle to obtain a global sequence of the calculation nodes;
s6, for each agent node, the central coordination node acquires the local sequence of the computing node according to the global sequence of the computing node through a local priority sequence generation algorithm;
s7, for each computing node in the WorkerList, the central coordination node obtains the priority rule of the ingress port through an ingress port priority sequence generation algorithm according to the local sequence;
s8, for each computing node in the WorkerList, the central coordination node obtains the output port priority rule through an output port priority sequence generation algorithm according to the local sequence;
s9, for each computing node in the WorkerList, the central co-regulation point sends the priority rule of the input port and the priority rule of the output port to the proxy node where the central co-regulation point is located, and the proxy node sets the priority of the data stream according to the priority rule and completes the DDL stream scheduling;
s10, waiting for time t, and then jumping to step S1.
In this embodiment, the duration T refers to a T-Scheduler time length, that is, after waiting for a scheduling period, the method goes to step S1 to implement periodic updating of the priority of the DDL task, where the scheduling period is based on a time reference for DDL task training, and a user can adjust the scheduling period according to the training condition of the user.
In this embodiment, cross-node information interaction only occurs between the central coordination node and the proxy node, and in short, initiation and response of the request only occur between the central coordination node and the proxy node. Data are transmitted between the central coordination node and the agent nodes through sockets, and the sockets are sockets through which the application programs can send or receive the data.
In step S4 of this embodiment, the proxy node performs prediction of the accuracy improvement value of the DDL task information on the computing node according to the training information in the log file read from the local, and the log file records the training information of all the computing nodes of the proxy node where the proxy node is located. Each computing node generates a log file corresponding to the computing node and is used for recording training information, and after training is started, each computing node writes the training information into the corresponding log file according to new training information generated each time.
Referring to fig. 2, in step S6 of the present embodiment, for each proxy node, the method for acquiring the local sequence of the compute node includes the following steps:
a1, central coordination node initializing DDL task set for it
Figure BDA0002337820420000081
Set of compute nodes
Figure BDA0002337820420000082
a2, the central coordination node adds the calculation node with the same IP address into the NodeSet, each calculation node in the NodeSet is arranged according to the global sequence, the accuracy improvement prediction value of the DDL task to which the calculation node belongs is larger, and the sequencing of the calculation node in the NodeSet is earlier;
a3, adding DDL tasks belonging to each computing node in the node set into the Jobset by the central co-regulation point, arranging each DDL task in the Jobset according to the global sequence, wherein the higher the accuracy improvement prediction value of the DDL task is, the more the sequencing in the Jobset is;
a4, initializing a local sequence order of a computing node of the computing node to be 0;
a5, if the DDL task to which the computing node in its NodeSet belongs is the first in its JobSet, the central coordination adjusting point sets the local sequence of the computing node corresponding to the DDL task as order +1, and eliminates the first DDL task in its JobSet, so that the original second DDL task in its JobSet becomes the first DDL task;
a6, repeating the step a5 until the JobSet is empty, and finishing the acquisition of the local sequence of the computing node.
In this embodiment, the data is transmitted between the central coordination node and the proxy node through the switch, please refer to fig. 3, and in step S7, the method for acquiring the ingress port priority rule of each compute node in the WorkerList includes the following steps:
b1, initializing the central coordination node
Figure BDA0002337820420000083
b2, the central coordinating node adds other computing nodes which belong to a DDL task together with the central coordinating node into the PeerNodes;
b3, setting the priority of data flow from each computation node to itself in PeerNodes, if the local sequence is less than or equal to the maximum priority queue number supported by the switch, the central coordination adjusting point sets the priority of its ingress port priority rule as its local sequence, otherwise, the central coordination adjusting point sets the priority of its ingress port priority rule as the maximum priority queue number supported by the switch, and at this time, the central coordination adjusting point completes the acquisition of its ingress port priority rule.
In this embodiment, the peeernodes store the same computing nodes as the DDL task executed by the current computing node. Setting the priority of the input port of the current computing node, namely setting the priority of the data sent by the computing node in the peerNodes to the current computing node, wherein the direction is from other computing nodes to the current computing node, namely the input port. The compute nodes that are subordinate to other tasks do not communicate with the current compute node, so the nodes in the PeerNodes are set to their own data flow priority. That is, for each computing node, after the priority of the data stream from each computing node to itself in the PeerNodes is set, the acquisition of the priority rule of the ingress port is completed.
In general, the central coordination node and the agent node are deployed on different servers, and the servers are connected through a switch. Since the TCP/IP stack and NIC of the server do not support the multi-priority queue, when the present device is used to perform the operations of the foregoing embodiments, it is necessary to perform data flow scheduling by using the priority queue of the edge switch directly connected to the server. In order to enable the edge switch to identify the priority of the data flow, we use a Linux kernel tool, IPTables, to modify the DSCP field value in the packet IP header. The switch maps the data packets onto different priority queues according to their DSCP values. Current commercial switches support only 4-8 priority queues per port, but the amount of tasks that need to be transferred is far more than this, and thus the task priority cannot simply be its corresponding queue priority. A greedy strategy is used to approximate the ideal target for a limited number of priority queues. And the central coordination node performs local sequencing on the tasks on the same server according to the priority of the global tasks. As shown in fig. 4, 4 different nodes are connected to the same server and different DDL tasks are deployed respectively. The central coordination nodes set the global task priority for them to be (4,9,20,28), and map their global priority to the local priority (1,2,3,4) according to the size of the global task priority. And the server forwards different data streams on the server according to the local priority. If the amount of tasks on the local server exceeds the number M of priority queues, then data streams with local ordering priorities greater than M are transmitted according to the mth priority queue, and the priority scheduling of incoming interfaces is shown in fig. 5.
In step S8 of this embodiment, for each computing node, the method for acquiring the egress port priority rule is as follows: if the local sequence is less than or equal to the maximum number of priority queues supported by the switch, the central co-regulation point sets the priority of the egress port priority rule as the local sequence, otherwise, the central co-regulation point sets the priority of the egress port priority rule as the maximum number of priority queues supported by the switch, and at this time, the acquisition of the egress port priority rule is completed.
For a switch, there are multiple physical ports, each port has multiple transmission queues for forwarding data packets, and for a switch supporting priority scheduling, it may adopt different forwarding strategies according to the priorities of the transmission queues, such as a priority forwarding priority is high. However, the transmit queues per port are limited due to the switches. Therefore, the upper limit of the priority queues that can be supported by the switch is described in the present embodiment by the maximum number of priority queues.
The proxy node manages outgoing data streams using Linux flow controllers. The flow controller on each server may generate multiple priority queues for outgoing data streams. The flow controller maps the data flow to different priority queues according to the information carried by the TCP/IP head information field of the data flow. Similarly to the incoming interface, the outgoing interface also has a limited number of priority queues, for all data streams coming out of the same server, the task priority of the data stream is mapped to a local priority, if the number of local priorities exceeds the number M of priority queues, all priority values greater than M are mapped to M, and the priority scheduling of the transmission ports is as shown in fig. 6. By setting priorities for the incoming interface and the outgoing interface at the same time, the data stream of any training task can be scheduled preferentially according to the task progress improvement value. For any training task, the data stream transmission task of the task is temporarily blocked only after the incoming interface and the outgoing interface are simultaneously occupied by the data streams generated by other tasks with higher precision improvement values. In fact, the application scope of the present invention is not limited to the traditional TCP multi-priority transmission queue framework, and it can be deployed on any transport layer control mechanism supporting priority scheduling.
The distributed deep learning flow scheduling method has the following characteristics:
1) is more suitable for DDL training
And based on the flow characteristics of DDL training, a scheduling mode of improving priority with high precision is provided, and the data flow scheduling of work task level perception is carried out on the DDL training task.
2) Shorten average completion time of DDL task and accelerate DDL training
According to the staged training mode of the DDL task, the embodiment predicts the performance improvement condition of the model by using the precision condition at the end of the stage through a mathematical method, preferentially schedules the task with the highest precision improvement, avoids the task with smaller precision improvement from occupying cluster network resources for a long time, shortens the average stage completion time of the training task, and achieves the purpose of accelerating the DDL training.
3) Low deployment cost
The embodiment does not need to modify the deep learning framework at the top layer and the network at the bottom layer, does not depend on a specific DDL training platform, and can be directly deployed on a general DDL training platform;
4) adapting multiple underlying networks
The application scope of the present embodiment is not limited to the traditional TCP multi-priority transmission queue framework, and it can be deployed in any transport layer control mechanism supporting priority scheduling, and it is easier to deploy in other priority scheduling mechanisms similar to pFabric, because pFabric supports infinite number of priority levels, so that mapping between task priority and local priority is not required.
Example 2
Referring to fig. 7, the present embodiment provides a distributed deep learning stream scheduling system, which includes a central coordination node and a plurality of agent nodes, where the agent nodes can run a plurality of computing nodes respectively subordinate to different DDL tasks.
In this embodiment, the agent node includes a training information collection module, a lifting precision prediction module, and a flow priority policy execution module.
The training information collection module is used for collecting DDL task information data streams of the computing nodes, wherein the DDL task information data streams comprise IP addresses and port numbers used by the nodes for communication, iteration rounds, start time, end time, training and verification precision and the like of each iteration training.
And the lifting precision prediction module is used for predicting the DDL task information precision improvement value of the calculation node in a curve fitting mode, namely predicting the precision which can be improved in the next rounds of training.
The flow priority strategy execution module is used for receiving the ingress port priority and the egress port priority rule and setting the priority of the data flow according to the ingress port priority and the egress port priority rule.
In this embodiment, the central coordination node includes a global information receiving module and a priority rule generating module.
The global information receiving module is used for receiving the precision improvement predicted values, sequencing the precision improvement predicted values and sequencing all the computing nodes according to the precision improvement predicted values.
The priority rule generating module is used for generating an ingress port priority rule and an egress port priority rule for each computing node, and converting the global sequence priority into the local priority of the proxy node and the like.
In this embodiment, the stream priority policy execution module sets the stream priority for the data stream through TC and Iptables tools.
The TC is a support module of Linux kernel network protocol stack for QoS, and is a tool for adding QoS function at the upper layer protocol. The TC controls the sending and the receiving, so that the packet sending rate at the position where the bottleneck network card is generated can be controlled only. The flow control of the TC includes restriction, scheduling, policy, discard, and queue rules, etc. The use of TC is mainly: configuring a queue for the network card; establishing a classification on the queue; establishing a sub-queue and a sub-classification according to needs; a filter is established for each classification. In this embodiment, outgoing traffic in DDL training is set regularly by the TC, so as to achieve the purpose of scheduling outgoing traffic.
The Iptables is application software running in a user space, and manages the processing and forwarding of network data packets by controlling a Linux kernel network data packet management module. There are three levels of Iptables, tables, chains and rules. The tables refer to different types of data packet processing flows, for example, a filter table indicates that data packet filtering is performed, and a nat table performs address translation operation on connection. Multiple chains may exist in each table, and the system passes packets through some built-in chain according to a predetermined rule, such as passing locally-issued data through an "OUTPUT" chain. There may be several "rules" in the chain, which are matched one by one, and if matched, corresponding actions are performed, such as modifying a data packet. In this embodiment, the IP tables are used to modify the DSCP field of the IP packet, so that the switch in the connected cluster can schedule the incoming flow of the proxy node through the DSCP field and the like.
Example 3
Referring to fig. 8, the present embodiment provides an apparatus, which includes a switch and a plurality of servers, where the servers are connected through the switch, one of the servers is provided with a central coordination node, and the other servers are provided with a proxy node. In the present invention, there is no special requirement for the server, but the present invention requires that the switch can support the priority scheduling of its transmission queue, and perform differentiated transmission according to the priority of the queue.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A distributed deep learning stream scheduling method is characterized by comprising the following steps:
s1, initializing central coordination node
Figure FDA0003068170610000011
S2, the central coordination node randomly selects a computing node for each DDL task and adds the computing node into a WorkerList;
s3, for each computing node in the WorkerList, the central coordination node sends a precision improvement prediction request to the proxy node where the computing node is located;
s4, for each computing node in the WorkerList, the agent node where the computing node is located predicts the DDL task information precision improvement value, and sends the precision improvement prediction value to the central coordination node;
s5, the central coordination node sorts all the received precision improvement predicted values according to a high precision improvement priority principle to obtain a global sequence of the calculation nodes;
s6, for each agent node, the central coordination node acquires the local sequence of the computing node according to the global sequence of the computing node through a local priority sequence generation algorithm;
s7, for each computing node in the WorkerList, the central coordination node obtains the priority rule of the ingress port through an ingress port priority sequence generation algorithm according to the local sequence;
s8, for each computing node in the WorkerList, the central coordination node obtains the output port priority rule through an output port priority sequence generation algorithm according to the local sequence;
s9, for each computing node in the WorkerList, the central co-regulation point sends the priority rule of the input port and the priority rule of the output port to the proxy node where the central co-regulation point is located, and the proxy node sets the priority of the data stream according to the priority rule and completes the DDL stream scheduling;
s10, after waiting for the duration t, jumping to the step S1;
in step S7, for each compute node in the WorkerList, the method for acquiring ingress port priority rules includes the following steps:
b1, initializing the central coordination node
Figure FDA0003068170610000012
b2, the central coordinating node adds other computing nodes which belong to a DDL task together with the central coordinating node into the PeerNodes;
b3, setting the priority of data flow from each computing node to itself in PeerNodes, if the local sequence is less than or equal to the maximum priority queue number supported by the switch, the central coordination adjusting point sets the priority of its ingress port priority rule as its local sequence, otherwise, the central coordination adjusting point sets the priority of its ingress port priority rule as the maximum priority queue number supported by the switch, and at this time, the acquisition of its ingress port priority rule is completed;
in step S8, for each compute node, the method for acquiring the egress port priority rule includes:
if the local sequence is less than or equal to the maximum number of priority queues supported by the switch, the central co-regulation point sets the priority of the egress port priority rule as the local sequence, otherwise, the central co-regulation point sets the priority of the egress port priority rule as the maximum number of priority queues supported by the switch, and at this time, the acquisition of the egress port priority rule is completed.
2. The method of claim 1, wherein the time duration T refers to a T-Scheduler time length.
3. The method of claim 1, wherein data is transmitted between the central coordination node and the agent nodes via sockets.
4. The method according to claim 1, wherein in step S4, the agent node performs the prediction of the DDL task information precision improvement value for the computing node according to the training information in the log file read from the local, wherein the log file records the training information of all the computing nodes of the agent node where the agent node is located.
5. The method according to claim 1, wherein in step S6, for each proxy node, the method for acquiring the local sequence of its computing node comprises the following steps:
a1, central coordination node initializing DDL task set for it
Figure FDA0003068170610000021
Set of compute nodes
Figure FDA0003068170610000022
a2, the central coordination node adds the calculation node with the same IP address into the NodeSet, each calculation node in the NodeSet is arranged according to the global sequence, the accuracy improvement prediction value of the DDL task to which the calculation node belongs is larger, and the sequencing of the calculation node in the NodeSet is earlier;
a3, adding DDL tasks belonging to each computing node in the node set into the Jobset by the central coordination point, wherein each DDL task in the Jobset is arranged according to the global sequence, and the higher the accuracy improvement prediction value of the DDL task is, the higher the ranking in the Jobset is;
a4, initializing a local sequence order of a computing node of the computing node to be 0;
a5, if the DDL task to which the computing node in its NodeSet belongs is the first in its JobSet, the central coordination adjusting point sets the local sequence of the computing node corresponding to the DDL task as order +1, and eliminates the first DDL task in its JobSet, so that the original second DDL task in its JobSet becomes the first DDL task;
a6, repeating the step a5 until the JobSet is empty, and finishing the acquisition of the local sequence of the computing node.
6. A distributed deep learning stream scheduling system is characterized by comprising a central coordination node and a plurality of agent nodes, wherein the agent nodes can run a plurality of computing nodes respectively subordinate to different DDL tasks;
the agent node comprises a training information collection module, a lifting precision prediction module and a flow priority strategy execution module;
the training information collection module is used for collecting DDL task information data streams of the computing nodes;
the lifting precision prediction module is used for predicting a DDL task information precision improvement value of the calculation node in a curve fitting mode;
the flow priority strategy execution module is used for receiving the rules of the priority of the inlet port and the priority of the outlet port and setting the priority of the data flow according to the rules;
the central coordination node comprises a global information receiving module and a priority rule generating module;
the global information receiving module is used for receiving the precision improvement predicted values, sequencing the precision improvement predicted values and sequencing all the computing nodes according to the precision improvement predicted values;
the priority rule generating module is used for generating an input port priority rule and an output port priority rule for each computing node;
the method for acquiring the priority rule of the ingress port comprises the following steps:
b1, initializing the central coordination node
Figure FDA0003068170610000031
b2, the central coordinating node adds other computing nodes which belong to a DDL task together with the central coordinating node into the PeerNodes;
b3, setting the priority of data flow from each computing node to itself in PeerNodes, if the local sequence is less than or equal to the maximum priority queue number supported by the switch, the central coordination adjusting point sets the priority of its ingress port priority rule as its local sequence, otherwise, the central coordination adjusting point sets the priority of its ingress port priority rule as the maximum priority queue number supported by the switch, and at this time, the acquisition of its ingress port priority rule is completed;
the method for acquiring the output port priority rule comprises the following steps:
if the local sequence is less than or equal to the maximum number of priority queues supported by the switch, the central co-regulation point sets the priority of the egress port priority rule as the local sequence, otherwise, the central co-regulation point sets the priority of the egress port priority rule as the maximum number of priority queues supported by the switch, and at this time, the acquisition of the egress port priority rule is completed.
CN201911363582.9A 2019-12-26 2019-12-26 Distributed deep learning flow scheduling method, system and equipment Active CN111131080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911363582.9A CN111131080B (en) 2019-12-26 2019-12-26 Distributed deep learning flow scheduling method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911363582.9A CN111131080B (en) 2019-12-26 2019-12-26 Distributed deep learning flow scheduling method, system and equipment

Publications (2)

Publication Number Publication Date
CN111131080A CN111131080A (en) 2020-05-08
CN111131080B true CN111131080B (en) 2021-09-07

Family

ID=70502832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911363582.9A Active CN111131080B (en) 2019-12-26 2019-12-26 Distributed deep learning flow scheduling method, system and equipment

Country Status (1)

Country Link
CN (1) CN111131080B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695675B (en) * 2020-05-14 2024-05-07 平安科技(深圳)有限公司 Federal learning model training method and related equipment
CN113127169B (en) * 2021-04-07 2023-05-02 中山大学 Efficient link scheduling method for dynamic workflow in data center network
CN114205300B (en) * 2021-12-02 2023-09-22 南开大学 Flow scheduling method capable of guaranteeing coflow transmission deadline under condition of incomplete flow information
CN115499306B (en) * 2022-07-29 2024-03-12 天翼云科技有限公司 Method and device for constructing flow scheduling model, electronic equipment and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179494B (en) * 2007-12-03 2010-09-01 浙江大学 Resource distribution method facing to network multimedia transmission service
CN107580023B (en) * 2017-08-04 2020-05-12 山东大学 Stream processing job scheduling method and system for dynamically adjusting task allocation
US11120368B2 (en) * 2017-09-27 2021-09-14 Oracle International Corporation Scalable and efficient distributed auto-tuning of machine learning and deep learning models
CN108712292B (en) * 2018-05-29 2021-04-02 广州大学 Network flow type prediction method based on deep learning
CN109032671B (en) * 2018-06-25 2022-05-03 电子科技大学 Distributed deep learning method and system based on data parallel strategy
CN109271015B (en) * 2018-10-10 2020-07-24 杭州电子科技大学 Method for reducing energy consumption of large-scale distributed machine learning system
CN109710404B (en) * 2018-12-20 2023-02-07 上海交通大学 Task scheduling method in distributed system
CN109710289A (en) * 2018-12-21 2019-05-03 南京邮电大学 The update method of distributed parameters server based on deeply learning algorithm
CN110147547A (en) * 2019-04-09 2019-08-20 苏宁易购集团股份有限公司 A kind of intelligence auxiliary mask method and system based on iterative study
CN110413391B (en) * 2019-07-24 2022-02-25 上海交通大学 Deep learning task service quality guarantee method and system based on container cluster

Also Published As

Publication number Publication date
CN111131080A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN111131080B (en) Distributed deep learning flow scheduling method, system and equipment
CN110505099B (en) Service function chain deployment method based on migration A-C learning
Jin et al. Latency-aware VNF chain deployment with efficient resource reuse at network edge
Zhang et al. Adaptive interference-aware VNF placement for service-customized 5G network slices
Guo et al. Intelligent task offloading in vehicular edge computing networks
CN114338504B (en) Micro-service deployment and routing method based on network edge system
Baek et al. Heterogeneous task offloading and resource allocations via deep recurrent reinforcement learning in partial observable multifog networks
CN108260169B (en) QoS guarantee-based dynamic service function chain deployment method
CN113708972B (en) Service function chain deployment method and device, electronic equipment and storage medium
CN114172937B (en) Dynamic service function chain arrangement method and system based on deep reinforcement learning
CN113472597B (en) Distributed convolutional neural network fine-grained parameter transmission scheduling method and device
WO2023124947A1 (en) Task processing method and apparatus, and related device
EP4024212B1 (en) Method for scheduling inference workloads on edge network resources
Xiaolong et al. MTSS: multi-path traffic scheduling mechanism based on SDN
CN115033359A (en) Internet of things agent multi-task scheduling method and system based on time delay control
CN112073237B (en) Large-scale target network construction method in cloud edge architecture
Saravanan et al. Advance Map Reduce Task Scheduling algorithm using mobile cloud multimedia services architecture
Li et al. Co-Scheduler: A coflow-aware data-parallel job scheduler in hybrid electrical/optical datacenter networks
CN107979540A (en) A kind of load-balancing method and system of SDN network multi-controller
CN115665258B (en) Priority perception deployment method of multi-target service function chain based on deep reinforcement learning
CN109298932B (en) OpenFlow-based resource scheduling method, scheduler and system
Wang et al. Clustering of virtual network function instances oriented to compatibility in 5G network
Singh et al. Optimal routing for delay-sensitive traffic in overlay networks
Pan et al. Deep reinforcement learning-based dynamic bandwidth allocation in weighted fair queues of routers
Shang Performance evaluation of the control plane in openflow networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant