CN111966513A

CN111966513A - Priori-knowledge-free Coflow multi-stage queue scheduling method and device and scheduling equipment thereof

Info

Publication number: CN111966513A
Application number: CN202010895515.8A
Authority: CN
Inventors: 陈晓露; 顾荣斌; 卢士达; 张露维; 方晓蓉; 潘晨灵; 黄君; 王肖薇; 刘云飞; 李静; 祝蓓
Original assignee: Nanjing University of Aeronautics and Astronautics; State Grid Corp of China SGCC; State Grid Shanghai Electric Power Co Ltd
Current assignee: Nanjing University of Aeronautics and Astronautics; State Grid Corp of China SGCC; State Grid Shanghai Electric Power Co Ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-11-20
Anticipated expiration: 2040-08-31
Also published as: CN111966513B

Abstract

The invention discloses a multi-stage queue scheduling method, a device and a scheduling device without prior knowledge Coflow, wherein the method comprises the following steps: when executing a job generating the flow, selecting a host node which already contains data required by the job and/or a host node with the minimum network load as a computing node; when the flow in the flow is scheduled according to the priority sequence of the multi-level queue, if the sending port of the host node generates an idle space, the idle space is preferentially adopted for flow scheduling. The invention reasonably arranges the flow of the flow based on the state information of the nodes, optimizes the idle space generated by the sending port of the host node in the multi-stage queue scheduling process, reduces the overall flow completion time and improves the performance of the data center.

Description

Priori-knowledge-free Coflow multi-stage queue scheduling method and device and scheduling equipment thereof

Technical Field

The invention relates to a queue scheduling technology, in particular to a multi-stage queue scheduling method and device without priori knowledge flow and scheduling equipment thereof.

Background

In a cloud data center, a distributed parallel computing framework such as MapReduce, Spark and the like is generally adopted to process large-scale data. Due to the adoption of a distributed computing framework, one job is often divided into a plurality of subtasks and then handed over to a plurality of computers in a data center to be completed, and a large amount of intermediate communication data streams are generated when the subtasks are distributed and subtask results are combined. If some data stream cannot be completed in time, the subsequent subtasks depending on the result of the data stream cannot be continued, and finally the completion time of the operation is prolonged.

In the present study, a set of communication data flows with semantic correlation is called Coflow. The flow is a set of data flows (flow), taking a MapReduce parallel computing framework as an example, a Map mapping stage needs to divide and distribute tasks of jobs (shuffle) to generate intermediate communication data flows, and a Reduce merging stage needs to read intermediate results after Map stage processing is completed and also generates intermediate communication data flows. In order to improve the performance of the cloud data center and the Completion Time of the job therein, it is necessary to optimize the Flow Completion Time (CCT) instead of the Flow Completion Time (FCT) of a single data communication Flow.

The method for optimizing the Coflow is mainly a Coflow scheduling method with the prior knowledge of the Coflow size and the like. The main scheduling effect in this aspect is best by Varys, which adopts a minimum-Effective-boltleeck-First (SEBF) mechanism to preferentially schedule flows with smaller bottlenecks, controls the sending speeds of different internal flows, saves the bandwidth of a port and frees space for other Coflow, and can make the completion time of all internal flows consistent. The disadvantage is that information such as the size of the flow needs to be obtained in advance, and this point can be known only after the flow is completed, so the practicability is not strong.

Disclosure of Invention

The invention provides a multi-stage queue scheduling method and device without priori knowledge Coflow, which reduce the average completion time of Coflow and ensure the availability by calculating a flow placement strategy of a node state and optimizing an idle space in multi-stage queue scheduling under the condition without priori knowledge.

In order to achieve the purpose, the invention provides a multi-stage queue scheduling method without prior knowledge Coflow, which comprises the following steps: when executing a job which generates a flow, selecting a host node which already contains data required for executing the job and/or a host node with the minimum network load as a computing node to execute the job; when the flow in each flow is scheduled according to the priority sequence of the multi-level queue, if the sending port of the host node generates an idle space, the idle space is preferentially adopted for flow scheduling.

Preferably, the method for selecting the host with the minimum network load as the compute node includes judging the accumulated to-be-processed data traffic of the receive port of the host node, and selecting the host node with the minimum accumulated to-be-processed data traffic of the receive port as the compute node; the calculation formula of the accumulated to-be-processed data flow of the receiving port of the host node is as follows:

in the formula,

for the amount of data received by the receive port of host node j during the t-th time interval,

is the available bandwidth of host node j during the t-th time interval.

Preferably, when the host node schedules the traffic in each flow according to the priority order of the multi-level queues, if the sending port of the host node generates an idle space, the flow scheduling in the low-priority queue is started in advance in the idle space, so as to completely utilize the sending port of the host node to perform traffic scheduling.

Preferably, when the host node schedules the traffic in each flow according to the priority order of the multi-level queues, the same level of queues in the multi-level queues adopt an FIFO scheduling mode, and different levels of queues in the multi-level queues adopt a weighted fair queue scheduling mode.

Preferably, the priority order of the multi-stage queues is determined by the time sequence of the flow in each queue reaching the sending port of the host node.

Preferably, when the host node schedules the traffic in the cofow in each queue, and when the host node schedules the traffic in the cofow in each queue according to the priority order of the multi-level queue, if the scheduled traffic of the cofow with the high priority exceeds the threshold of the current priority queue, the cofow and the priority of the queue in which the cofow is located are reduced.

The invention also provides a scheduling device of the multi-stage queue scheduling method without prior knowledge Coflow, which comprises the following steps:

the global coordinator is arranged on the central host node and is used for screening proper computer nodes for the flow in the flow of the operation by adopting the flow placing strategy so as to execute the operation, determining the priority of each flow and optimizing the idle space generated by the sending port of the host node in the flow scheduling process;

and the scheduling module is arranged on each host node and used for sending the flow information scheduled by each flow in the host node to the global coordinator and scheduling the flow in the host node according to the priority of each flow determined by the global coordinator.

Preferably, the flow rate placement strategy is as follows: and screening the host nodes which already possess the data required by the execution of the operation and/or the host nodes with the minimum network load of the receiving port as the computing nodes of the flow.

Preferably, the global coordinator performs statistics on the size of the scheduled traffic of each flow, and if the scheduled traffic of the flow exceeds the threshold of the current priority queue, the priority of the flow is reduced.

Preferably, when the sending port of the host node schedules the traffic in the low-priority queues, if the sending port generates an idle space, the global coordinator directly starts the flow scheduling in the low-priority queues in the idle space.

The invention further provides a scheduling device, which includes a plurality of host nodes communicatively connected to each other, where the host nodes include a processor, a memory, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the aforementioned multi-level queue scheduling method without prior knowledge flow.

The invention has the following advantages:

according to the invention, based on the state information of the nodes, a proper calculation node is selected for the flow in each flow, and the flow in the flow is reasonably placed, so that the data transmission is reduced, and the flow completion time is reduced. The invention also improves and utilizes the sending port which generates the idle space in the multi-stage queue scheduling, directly starts the flow scheduling in the low-priority queue in the sending port, reduces the overall flow completion time, improves the performance of the data center and ensures the availability.

Drawings

Fig. 1 is a flowchart of a multi-stage queue scheduling method without prior knowledge flow according to an embodiment of the present invention;

fig. 2 is a schematic diagram of data center traffic provided in an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a multi-stage queue scheduling apparatus without a priori knowledge flow according to an embodiment of the present invention;

fig. 4 is a comparison diagram of a multi-stage queue scheduling method without prior knowledge flow according to an embodiment of the present invention and an existing scheduling method.

Detailed Description

The following describes a multi-stage queue method without a priori knowledge flow, an apparatus and a scheduling device thereof in detail with reference to the accompanying drawings and specific embodiments. Advantages and features of the present invention will become apparent from the following description and from the claims. It is to be noted that the drawings are in a very simplified form and are all used in a non-precise ratio for the purpose of facilitating and distinctly aiding in the description of the embodiments of the invention.

It should be noted that the data center of the present invention includes a plurality of host computers (host nodes) for parallel computing of large-scale data. Each host node comprises a sending port and a receiving port, and when the host node is used as a computing node, the receiving port of the host node receives the flow traffic sent by other host nodes, executes the subtask corresponding to the flow traffic and sends the computing result to the corresponding host node through the sending port; when the host node is used as a sending node, one or more flow may exist at a sending port of the host node, and the sending node schedules the multi-level flow queues in the sending port in the process of sending the flow in the computing node to other host nodes so as to improve the completion time of the operation and improve the performance of the data center.

As shown in fig. 1, the invention discloses a multi-stage queue scheduling method without prior knowledge flow, comprising the following steps:

and S1, the global coordinator screens a proper computing node for the flow in each flow by adopting a flow setting strategy, and informs each sending node to send the flow in the flow to the corresponding computing node.

Specifically, the global coordinator monitors whether each job generates a flow, and different placement of the flow in the flow can allocate the subtasks (i.e., data included in the data flow) included in the job on different host nodes (i.e., computing nodes), and reasonable allocation will also reduce the transmission amount of the data, reduce the time overhead, and improve the performance of the cloud data center. The flow placing strategy adopted by the invention screens proper computing nodes for the flow in each flow by judging the states of the computing nodes so as to reduce the time overhead. The state of the compute node is: whether the computing node owns the data needed to execute the job and the network load condition of the computing node.

Since the HDFS file system of the cloud data center generally has redundant backup, that is, data blocks of a file may exist on a plurality of different hosts, screening nodes that already have data required for executing a job as a computing node can avoid additional data transmission overhead. Suppose a job generates 1 flow (C) when scheduled_n) Which includes k data streams (corresponding to k subtasks), the entire flow is only completed when all of the k data streams are completed. The placement strategy of the flow is:

wherein,

is represented by C_nWhether the ith data stream in (a) can select the host node j as a potential computing node, if the value is 1, the host node j is indicated to be ok, otherwise, the host node j is 0. All of

The node will be selected as the set of candidate compute nodes for task i.

In that

In this case, it can only be shown that the computing node j can directly execute the subtask i to avoid redundant data transmission traffic, but the host node j still needs to transmit back the result obtained by the computation after receiving the execution command of the subtask i and completing the execution, which still generates a large amount of data flow, and at this Time, the flow Completion Time (CCT) is affected by the network bandwidth, network load, and flow scheduling algorithm of the computing node in addition to the computing performance of the computing node. Therefore, screening the node with the least network load of the receiving port as the computing node of the flow can reduce the CCT. Since the receiving port of compute node j may receive traffic from multiple flows over a period of time, the receiving port of compute node is abstracted as a queue Arr_j，

For the amount of data received by compute node j during the t-th time interval,

if the time interval is the available bandwidth of the node j in the tth time interval, after the tth time interval is over, the cumulative data volume to be processed of the receiving port of the node j is calculated and expressed as:

the larger the value of (b), the larger the port network load of the computing node j.

And after the global coordinator screens a proper computing node for each Coflow, the Coflow placing scheme informs each sending node to send the flow in the Coflow to the corresponding computing node according to the Coflow placing scheme, and simultaneously, the flow in the Coflow is arranged in a high-priority queue.

S2, the global coordinator determines the priority of each Coflow according to the size of the data volume already sent in each Coflow and sends the priority to each sending node;

specifically, each sending node sends the flow in the cofow in the high-priority queue to the corresponding computing node, and sends the size of the data volume sent by each cofow in the sending node to the global coordinator, so that the global coordinator can adjust the priority of each cofow in time. And if the data volume sent by the flow exceeds the threshold value of the current priority queue, the global coordinator down-regulates the flow to the next priority queue to reduce the priority of the flow, and sends the priority information of each flow to each sending node.

And S3, each sending node schedules the flow in the local multi-stage queue according to the priority information of each flow.

Specifically, each sending node receives the flow priority information sent by the global coordinator, and schedules the flow in the local multi-level queue according to the priority information of each flow: FIFO (First Input First Output) scheduling mode is adopted in the same stage of queues, and weighted fair queue scheduling mode is adopted between different stages of queues.

S4, the sending node generates an idle space during the multi-level queue scheduling, and the global coordinator starts the flow scheduling in the low-priority queue in advance in the idle space.

Specifically, as shown in fig. 2, a data center including 3 host nodes according to an embodiment of the present invention is abstracted as a large non-blocking switch, and in this switch model, each of the sending port and the receiving port is an abstraction of traffic input and output ports of each host in the data center. In the present example there were 3 coflows, indicated by C1 (grey), C2 (white) and C3 (black), respectively. Wherein, C2 contains two data streams, the data size is 7, each of C1 and C3 contains only one data stream, and the data sizes are 2 and 3, respectively; c2 and C3 arrive at the same time, and C1 arrives after C2. The virtual queue on the left of the sending port in fig. 2 is used to express the source endpoint and the destination endpoint of the data flow, for example, in sending port 1, the data flow of C3 needs to send 3 units of data to receiving port 1, and one of the data flows of C2 needs to send 3 units of data to receiving port 3; in the transmitting port 3, C2 and C1 need to transmit 4 units and 2 units of data to the receiving port 2, respectively. Since the flow size in this embodiment is relatively small, it is assumed here that the thresholds of the multi-stage queues are 1, 2, 4, 8, etc., respectively. The flow of the embodiment is scheduled by steps S2-S3, the queue with high priority is scheduled preferentially, and an FIFO policy is adopted inside the queue, so that when the amount of data to be transmitted of the flow in the queue with high priority reaches a threshold, the flow is put down into the queue with low priority, and the flow in the queue with low priority is scheduled only after the flow in the queue with high priority is scheduled completely. As shown in fig. 2, C2 and C3 arrive at the same time and at different ports, so they are equivalent to being parallel in a queue; when the first time has elapsed, 1 unit of traffic, C2 and C3, sent by both Port 1 and Port 3, are placed in the next level of priority queue; when the sending port 2 schedules C1, because both C3 and C2 in the sending port 1 are already placed in the next-level priority queue, some free space exists in the sending port 1; the flow scheduling of this embodiment is shown in table 1 by class analogy.

TABLE 1 optimizing Coflow scheduling before sending port free space

As can be seen from table 1, when scheduling C1, there is a free space in the period of time for transmission by transmission port 1 in transmission port 2, and in these cases, the CCT becomes large, and is (8+4+4)/3 — 5.33.

The global coordinator directly starts the flow in the low-priority queue in the sending port 1 for scheduling, so as to optimize the idle space of the sending port, thereby completely utilizing the sending port of the host node to schedule the flow in the flow and reducing the time for completing the flow. After directly starting the traffic scheduling in the flow in the low-priority queue, the case of the flow scheduling in this embodiment is as shown in table 2, and the final CCT is (6+4+4)/3 — 4.66.

TABLE 2 Coflow scheduling after optimization of Transmit Port Idle space

When the flow scheduling in the low-priority queue in the sending port with the idle space is directly started, the flow rate in the same-level queue is still scheduled by adopting an FIFO scheduling method. In this embodiment, after the flow scheduling in the low-priority queue is directly started, the flow completion time is reduced by 0.66.

As shown in fig. 3, the present invention further provides a multi-stage queue scheduling apparatus without a priori knowledge flow, which includes a global coordinator disposed on a central host node, and a plurality of scheduling modules communicatively connected to the global coordinator.

The global coordinator is used for generating a flow placing scheme for the flow in the operation and informing each sending node to send the flow in the flow to the corresponding computing node according to the flow placing scheme; and analyzing the priority of each Colfow according to the size of the data volume sent by each Coflow, analyzing whether a sending port of each sending node generates an idle space during multi-stage queue scheduling, and directly starting Coflow scheduling in a low-priority queue in the sending port with the idle space.

Specifically, the global coordinator monitors whether each job generates a flow, and screens a suitable computing node for the flow in each flow by using a flow placement strategy, so as to generate a flow placement scheme, where the flow placement strategy is: and screening the nodes which already have the data required by the execution of the operation and/or the nodes with the minimum network load of the receiving port as the computing nodes of the flow. And after screening proper computing nodes for the flow in the flow, the global coordinator determines the priority of the flows according to the time sequence of the flow reaching the sending port of the sending node, arranges the flow in the flow which reaches the sending port of the sending node firstly into a high-priority queue, and informs each sending node to send the flow in the flow to the corresponding computing node. And the global coordinator also collects the size of the data volume sent by each flow in real time, and if the data volume sent by the flow exceeds the threshold value of the current priority queue, the priority of the flow is reduced. And meanwhile, the global coordinator also analyzes whether the sending port of each sending node generates an idle space when the multi-stage queue is scheduled, and if the sending port of the sending node has the idle space, the flow rate of the low-priority queue is directly started in the idle space of the sending port, so that the idle space is completely utilized, and the flow completion time is reduced.

Specifically, when the scheduling module schedules the Coflow in the host node, the Coflow inside the same-level queue adopts an FIFO scheduling mode, and the Coflow between different-level queues adopts a weighted fair queue scheduling mode.

The invention provides a multi-stage queue scheduling method without prior knowledge Coflow, which adopts a Coflow flow placing strategy to select a proper computing node for the flow in each Coflow and reasonably places the flow in the Coflow so as to reduce the data transmission.

The multi-stage queue scheduling method without priori knowledge Coflow provided by the invention is named as an E-Aalo scheduling method, and a public Facebook data set is used for comparing the E-Aalo scheduling method with the prior priori known Coflow scheduling method Varys, the prior unknown Coflow scheduling method Aalo and the traditional FAIR method, so that the effectiveness of the E-Aalo scheduling method is verified. The data set includes 526 coflows that arrive at different times for the actual workload provided by Facebook, which is synthesized from data-intensive applications in the real world. FIG. 4 is a comparison graph of average completion times obtained by different flow scheduling methods for the same data set. The Varys scheduling method is to perform minimum bottleneck priority scheduling according to known prior knowledge so as to achieve an optimal result; the FS scheduling method is a strategy for carrying out average distribution on network bandwidth according to the quantity of traffic; the Aalo scheduling method is a method for scheduling the flow without prior knowledge, the flow is judged according to the number of bytes currently sent by the flow, queues with different priorities are set, and the flow is scheduled according to the priorities, however, for the flow with larger data volume, the Aalo scheduling method is inferior to the Varys scheduling method due to lack of prior knowledge. Therefore, the Varys scheduling method is the best of all the comparison methods. The average completion time CCT of 526 Coflow scheduled by the FS scheduling method is 70592.82ms, and the FS scheduling method has the worst scheduling effect among the 4 methods; the average completion time CCT of 526 Coflow scheduled by the Varys scheduling method is 28528.46 ms; the average completion time CCT of 526 Coflow scheduled by the Aalo scheduling method is 46097.70ms, which is increased by 61.58% compared with the CCT of the Varys scheduling method; the average completion time CCT of 526 coflows scheduled by the E-Aalo scheduling method is 40437.61ms, which is 12.28% lower than the CCT of the Aalo scheduling method. The Aalo scheduling method provided by the invention is substantially an improvement of the existing Aalo scheduling method, and the flow completion time of the flow can be obviously reduced by adopting a flow placing strategy and scheduling the flow in the flow of a low-priority queue in an idle sending port in advance.

In addition, the present invention also includes a scheduling apparatus, including: a processor, a memory, and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the a priori knowledge free Coflow multi-level queue scheduling method described above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A multi-stage queue scheduling method without priori knowledge flow is characterized in that when a job generating the flow is executed, a host node which already contains data required by the job and/or a host node with the minimum network load is selected as a computing node to execute the job; when the flow in each flow is scheduled according to the priority sequence of the multi-level queue, if the sending port of the host node generates an idle space, the idle space is preferentially adopted for flow scheduling.

2. The method for scheduling the multi-stage queues without the prior knowledge flow according to claim 1, wherein the method for selecting the host node with the minimum network load as the computing node comprises: judging the accumulated to-be-processed data flow of a receiving port of the host node, and selecting the host node with the smallest accumulated to-be-processed data flow of the receiving port as a computing node; the calculation formula of the accumulated to-be-processed data flow of the receiving port of the host node is as follows:

in the formula,

is the available bandwidth of host node j during the t-th time interval.

3. The multi-stage queue scheduling method without prior knowledge flow of claim 1, wherein when the host node schedules the flow in each Coflow according to the priority order of the multi-stage queue, if the sending port of the host node generates an idle space, the Coflow scheduling in the low-priority queue is directly started in the idle space.

4. The method for scheduling multiple stages of queues without the priori knowledge Coflow according to claim 1, wherein the priority order of the multiple stages of queues is determined by the time sequence of the Coflow in each queue reaching the sending port of the host node.

5. The method as claimed in claim 4, wherein when the host node schedules the flow in the cofow in each queue according to the priority order of the multi-level queues, if the flow scheduled by the cofow in the high-priority queue exceeds the threshold of the current priority queue, the priority of the cofow and the queue where the cofow is located is decreased.

6. The multi-stage queue scheduling method without prior knowledge Coflow as claimed in claim 1, wherein when the host node schedules the flow in each Coflow according to the priority order of the multi-stage queues, the FIFO scheduling mode is adopted inside the same stage of queues in the multi-stage queues, and the weighted fair queue scheduling mode is adopted between different stages of queues in the multi-stage queues.

7. A scheduling apparatus for implementing the multi-stage queue scheduling method without prior knowledge flow according to any one of claims 1 to 6, comprising:

8. The multi-stage queue scheduling device without a priori knowledge Coflow of claim 7, wherein the Coflow traffic placement policy is: and screening the host nodes which already possess the data required by the execution of the operation and/or the host nodes with the minimum network load of the receiving port as the computing nodes of the flow.

9. The apparatus as claimed in claim 7, wherein the global coordinator performs statistics on the amount of scheduled traffic of each flow, and if the scheduled traffic of the flow exceeds the threshold of the current priority queue, the priority of the flow is decreased.

10. The apparatus as claimed in claim 7, wherein when the transmit port of the host node schedules the flow in the multi-stage queue, if the transmit port generates an idle space, the global coordinator directly starts the flow scheduling in the low-priority queue in the idle space.

11. A scheduling device comprising a plurality of communicatively interconnected host nodes, said host nodes comprising a processor, a memory, and a computer program stored on said memory and executable on said processor, said processor executing said computer program to implement the multi-level queue scheduling method without a priori knowledge flow of any of claims 1-5.