CN113128668B - Link scheduling method considering high throughput and fairness in data center network - Google Patents
Link scheduling method considering high throughput and fairness in data center network Download PDFInfo
- Publication number
- CN113128668B CN113128668B CN202110372715.XA CN202110372715A CN113128668B CN 113128668 B CN113128668 B CN 113128668B CN 202110372715 A CN202110372715 A CN 202110372715A CN 113128668 B CN113128668 B CN 113128668B
- Authority
- CN
- China
- Prior art keywords
- neural network
- connected neural
- fully
- coflow
- fairness
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a link scheduling method for considering high throughput and fairness in a data center network, which comprises the following steps: s1: for the same time, the number of coflows is n and each coflow C k Size d of the corresponding normalized stream on the link kj An internal scheduler is realized by adopting a multi-layer fully-connected neural network; s2: self-adaptive adjustment of the multi-layer fully-connected neural network based on element learning, namely initializing the weight of the multi-layer fully-connected neural network by the element parameter theta obtained in the element training stageUpdating by using a meta-test set stage to obtain a final multi-layer fully-connected neural network; s3: and (3) performing parallel distributed job resource scheduling on the input coflow by utilizing the multi-layer fully-connected neural network obtained in the step (S2) to obtain a final resource scheduling scheme. The invention can make smooth trade-off between fairness and high efficiency.
Description
Technical Field
The invention relates to the field of high-performance calculation, in particular to a link scheduling method for achieving both high throughput and fairness in a data center network.
Background
In many parallel distributed jobs (e.g., mapReduce), there is a communication phase in which a set of streams transfer data between a set of machines, and the amount of data each stream needs to transfer can be known before the stream begins, choedhury et al propose the concept of Coflow as an abstraction of the set of streams between such a set of two sets of machines in parallel and with associated semantics and co-targeting [ Choedhury, mosharaf & Stoica, ion. (2012). Coflow: a networking abstraction for cluster applications.31-36.10.1145/2390231.2390237]. coflow captures various communication modes including computing applications, enabling applications to more easily pass their communication semantics to the network, thereby enabling the network to better optimize common communication modes. These associative semantics allow the network to perform different actions on this set of flows to achieve a common final goal
We consider a network model in which the entire data center structure is abstracted into a non-blocking switch, connecting all machines, as shown in fig. 1. We focus only on its entry (upload channel) and exit (download channel). This network model abstraction is quite simple, but it is compatible with the development and application of the full bisection bandwidth topology [22,32] and multi-tenant model, so it is practical and also attracts the attention of many researchers in this field. Assuming that each coflow has a corresponding flow in the corresponding upload and download channels and that each coflow comes from a different application, the coflow scheduler should ideally provide isolation guarantees on a minimum coflow schedule so that each coflow will be allocated bandwidth resources relatively fairly (i.e., guaranteeing fairness) for purposes of improving the application level network performance of today's data centers. On the other hand, network operators strive to reduce the coflow completion time (coflow completion time, CCT). However, optimal isolation guarantees and minimum average CCT are contradictory goals, and cannot be optimized at the same time. Many coflow schedulers thus exist to either optimize isolation guarantees at the expense of long CCT (e.g., HUG [ Mosharaf Chowdhury, zhenhua Liu, ali Ghodsi, and Ion Stoica.2016.HUG: multi-resource fairness for correlated and elastic demands. In Proceedings of the. FIG. 13th Usenix Conference on Networked Systems Design and Implementation (NSDI' 16). USENIX Association, USA,407-424 ]), or to reduce average CCT without performance isolation (e.g., varys [ M.Chowdhury, Y.Zhong, and I. Stoica, "Efficient coflow scheduling with Varys," in ACM SIGCOMM,2014] and Aalo [ M.Choldhury and I.Stoica, "Efficient coflow scheduling without prior knowledge," in Proc. Of ACM SIGCOMM,2015 ]). Coflex [ W.Wang, S.Ma, B.Li and b.li, "Coflex: navigating the fairness-efficiency tradeoff for coflow scheduling," IEEE info com 2017-IEEE Conference on Computer Communications, atlanta, GA, USA,2017, pp.1-9, doi:10.1109/info com.2017.8057172] is a mere trade-off between these two contradictory goals that allows the network operator to specify the required level of isolation assurance using one balance parameter α while reducing the average CCT.
However, while Coflex considers both the goals of optimizing isolation guarantees and reducing average CCT, its effectiveness is largely dependent on the choice of the balance parameter α, and no investigation is made in this work as to how to choose a better α value. In addition, since the coflows received by the network in different time periods have different modes (such as size distribution of the flows, etc.), a constant balance parameter α in the Coflex model cannot adapt to the coflows with continuously changing modes, and thus, there is a great disadvantage in practical application.
Disclosure of Invention
The invention provides a link scheduling method which gives consideration to high throughput and fairness in a data center network, and can smoothly balance fairness and high efficiency.
In order to achieve the above purpose of the present invention, the following technical scheme is adopted: a link scheduling method for considering both high throughput and fairness in a data center network comprises the following steps:
s1: for the same time, the number of coflow is n and each coflow C k Size d of the corresponding normalized stream on the link kj An internal scheduler is realized by adopting a multi-layer fully-connected neural network;
s2: self-adaptive adjustment of the multi-layer fully-connected neural network based on element learning, namely initializing the weight of the multi-layer fully-connected neural network by the element parameter theta obtained in the element training stageUpdating by using a meta-test set stage to obtain a final multi-layer fully-connected neural network;
s3: and (3) performing parallel distributed job resource scheduling on the input coflow by utilizing the multi-layer fully-connected neural network obtained in the step (S2) to obtain a final resource scheduling scheme.
Preferably, the fully connected neural network is based on parametersUniquely and definitively output->Wherein x is the input of a fully connected neural network, < >>Is the weight of the fully connected neural network;
further, the input of the fully-connected neural network is e j D corresponding to each coflow kj Is equal to x=e j ||[d 1j ,d 2j ,...d kj ]Wherein, I is a spliced symbol, which means that the front vector and the rear vector are connected into one vector, e j Representing the one-hot encoding of port j.
Still further, the output of the fully connected neural networkBandwidth duty cycle r allocated for port j for each coflow kj I.e. y= [ r ] 1j ,r 2j ,...,r kj ]。
Still further, the meta-training stage is specifically as follows:
s201: task is carried outAs training set, updating step number K, internal learning rate alpha and external learning rate beta;
s202: the parameters of the fully connected neural network are initialized first randomly with the meta-parameters θ, and then for each task, each time the parameters of the fully connected neural network are initialized first with θ;
S203: and then use the data pairs in the taskPerforming K-step updating, and recording gradients in the process;
s204: and finally updating the meta-parameter theta according to the obtained gradient.
Still further, the meta-test stage is specifically as follows:
d1: inputting a task T;
d2: initializing parameters based on meta-parameters θ obtained during meta-trainingObtaining a preliminary fully-connected neural network;
d3: and carrying out K-step updating according to the task T so as to obtain the final fully-connected neural network.
The beneficial effects of the invention are as follows:
the invention utilizes a multi-layer fully-connected neural network to realize the internal scheduler, and adopts element learning to carry out self-adaptive adjustment on the multi-layer fully-connected neural network, thereby constructing a coflow fairness element scheduler. The invention can make smooth trade-off between fairness and high efficiency. Meanwhile, compared with other models, the invention does not need to manually specify the superparameters which have larger influence on the final performance of the model, so that the problem of model performance degradation caused by the specification of poorer balance parameters is avoided. The invention adopts a meta learning framework, can adaptively adjust the parameters of the fully connected neural network, further adapt to the coflow with continuously changed modes, and keep stable performance, and cannot be changed once the balance parameters are set like Coflex.
Drawings
Fig. 1 is a prior art network abstraction diagram.
Fig. 2 is a flow chart of the steps of the method according to the present embodiment.
Fig. 3 is a flowchart of the meta training phase described in this embodiment.
Fig. 4 is a flowchart of the meta-test phase described in this embodiment.
Detailed Description
The invention is described in detail below with reference to the drawings and the detailed description.
Example 1
In order to achieve reasonable trade-off between fairness and efficiency and overcome the possibly continuously changing coflow mode in practical application, the embodiment provides an end-to-end coflow fairness meta-scheduler (MFCS) based on meta-learning, which comprises an internal scheduler implemented by a multi-layer fully-connected network and a meta-learning framework built on the internal scheduler, wherein the implementation method of the parallel distributed job resource scheduling is as follows:
as shown in fig. 2, a link scheduling method for achieving both high throughput and fairness in a data center network includes the following steps:
s1: for the same time, the number of coflow is n and each coflow C k Size d of the corresponding normalized stream on the link kj An internal scheduler is realized by adopting a multi-layer fully-connected neural network;
s2: self-adaptive adjustment of the multi-layer fully-connected neural network based on element learning, namely initializing the weight of the multi-layer fully-connected neural network by the element parameter theta obtained in the element training stageUpdating by using a meta-test set stage to obtain a final multi-layer fully-connected neural network;
s3: and (3) performing parallel distributed job resource scheduling on the input coflow by utilizing the multi-layer fully-connected neural network obtained in the step (S2) to obtain a final resource scheduling scheme.
In a specific embodiment, the fully-connected neural network is based on parametersUniquely and definitively outputWherein x is the input of a fully connected neural network, < >>Is the weight of the fully connected neural network;
in a specific embodiment, the fully-connected nerveThe input of the network is e j D corresponding to each coflow kj Is equal to x=e j ||[d 1j ,d 2j ,...d kj ]Wherein, I is a spliced symbol, which means that the front vector and the rear vector are connected into one vector, e j Indicating the one-hot code corresponding to port j.
In a specific embodiment, the output of the fully-connected neural networkBandwidth duty cycle r allocated for port j for each coflow kj I.e. y= [ r ] 1j ,r 2j ,...,r kj ]。
In a specific embodiment, since the pattern of coflow may change over time, our multi-layer fully connected neural network model (i.e., internal scheduler) Its parameters should also be updated adaptively +.>. The training process of the multi-layer fully-connected neural network is to update the network weight value by a gradient descent method>Thereby finding a better weight +.>As a parameter of the final multi-layer full connection. The element learning is to search a weight value theta which is relatively close to the optimal solution for the multi-layer fully connected neural network, and when the theta is used for initializing +.>In this case, the multi-layer fully connected neural network model is already in the vicinity of the optimal solution, so that the multi-layer fully connected neural network model can achieve better performance after fine tuning.
Specifically, the meta learning is divided into two stages: meta-training stage and meta-testing stage.
The meta training stage specifically comprises the following steps:
s201: task is carried outAs training set, updating step number K, internal learning rate alpha and external learning rate beta;
s202: randomly initializing element parameters theta, and then for each task, initializing parameters of the fully-connected neural network with theta each timeThe present embodiment can be assigned directly, i.e. let +.>Finishing initialization;
s203: and then use the data pairs in the taskPerforming K-step updating, and recording gradients in the process; parameter->The update formula of one iteration update of (a) is as follows:
in the method, in the process of the invention,representing the weight of the loss function L with respect to the fully connected neural network>Is a gradient of (2); sign->Representing a derivative function; l (L) x And (phi, theta) represents the total loss function of the fully connected neural network, and the calculation formula is as follows:
wherein τ represents the average completion time of the input coflow; IG represents a fairness metric IG;
s204: finally, the meta-parameter theta is updated according to the obtained gradient, and the updating formula is as follows:
the program corresponding to the meta training stage is expressed as follows:
in a specific embodiment, the meta-test phase is specifically as follows:
d1: inputting a task T;
d2: initializing parameters based on meta-parameters θ obtained during meta-trainingObtaining a preliminary fully-connected neural network;
d3: and carrying out K-step updating according to the task T, and further obtaining the final fully-connected neural network.
The program corresponding to the meta-test stage is expressed as follows:
the embodiment further provides a specific embodiment based on the parallel distributed job resource scheduling method described above: the number of tasks of the input training set is 100, the number of tasks of the test set is 1, and the number of update steps k=5. The meta training process adopted by the MFCS model is shown in fig. 3, and the updated meta parameter theta is finally output at the stage; the meta-test procedure is shown in FIG. 4, and the meta-parameters θ obtained in the meta-training phase are used in this stage to initialize the parameters of the model MFCSParameters of the MFCS model are +.>And 5, updating to obtain a final MFCS model, and finally, carrying out resource scheduling on the input coflow by using the final MFCS model to output a final resource scheduling scheme.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.
Claims (5)
1. A link scheduling method for considering both high throughput and fairness in a data center network is characterized in that: the method comprises the following steps:
s1: for the same time, the number of coflow is n and each coflow C k Size d of the corresponding normalized stream on the link kj An internal scheduler is realized by adopting a multi-layer fully-connected neural network;
s2: self-adaptive adjustment of the multi-layer fully-connected neural network based on element learning, namely initializing the weight of the multi-layer fully-connected neural network by the element parameter theta obtained in the element training stageUpdating by using a meta-test set stage to obtain a final multi-layer fully-connected neural network;
s3: performing parallel distributed job resource scheduling on the input coflow by utilizing the multi-layer fully-connected neural network obtained in the step S2 to obtain a final resource scheduling scheme;
the meta training stage specifically comprises the following steps:
s201: task is carried outAs training set, updating step number K, internal learning rate alpha and external learning rate beta;
s202: randomly initializing element parameters theta, and then for each task, initializing parameters of the fully-connected neural network with theta each timeThe present embodiment may be directly assigned, i.e. let +.>Finishing initialization;
s203: and then use the data pairs in the taskPerforming K-step updating, and recording gradients in the process;
parameters (parameters)The update formula of one iteration update of (a) is as follows:
in the method, in the process of the invention,representing the weight of the loss function L with respect to the fully connected neural network>Is a gradient of (2); sign->Representing a derivative function; l (L) x And (phi, theta) represents the total loss function of the fully connected neural network, and the calculation formula is as follows:
wherein τ represents the average completion time of the input coflow; IG represents a fairness metric IG;
s204: finally, the meta-parameter theta is updated according to the obtained gradient, and the updating formula is as follows:
2. the link scheduling method for a data center network according to claim 1, wherein the link scheduling method is capable of achieving both high throughput and fairness, and is characterized in that: the fully connected neural network is based on parametersUniquely and definitively output->Wherein x is the input of a fully connected neural network, < >>Is the weight of the fully connected neural network.
3. The link scheduling method for a data center network according to claim 2, wherein the link scheduling method is capable of achieving both high throughput and fairness, and is characterized in that: the input of the fully-connected neural network is e j D corresponding to each coflow kj Is equal to x=e j ||[d 1j ,d 2j ,...d kj ]Wherein, I is a spliced symbol, which means that the front vector and the rear vector are connected into one vector, e j Representing the one-hot encoding of port j.
4. A method for scheduling links in a data center network that combines high throughput and fairness as claimed in claim 3, wherein: the output of the fully-connected neural networkBandwidth duty cycle r allocated for port j for each coflow kj I.e. y= [ r ] 1j ,r 2j ,...,r kj ]。
5. The link scheduling method for a data center network according to claim 1, wherein the link scheduling method is capable of achieving both high throughput and fairness, and is characterized in that: the meta-test stage is specifically as follows:
d1: inputting a task T;
d2: initializing parameters based on meta-parameters θ obtained during meta-trainingObtaining a preliminary fully-connected neural network;
d3: and carrying out K-step updating according to the task T so as to obtain the final fully-connected neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110372715.XA CN113128668B (en) | 2021-04-07 | 2021-04-07 | Link scheduling method considering high throughput and fairness in data center network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110372715.XA CN113128668B (en) | 2021-04-07 | 2021-04-07 | Link scheduling method considering high throughput and fairness in data center network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113128668A CN113128668A (en) | 2021-07-16 |
CN113128668B true CN113128668B (en) | 2023-07-25 |
Family
ID=76775243
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110372715.XA Active CN113128668B (en) | 2021-04-07 | 2021-04-07 | Link scheduling method considering high throughput and fairness in data center network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113128668B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114666283B (en) * | 2022-03-07 | 2023-11-24 | 国家电网有限公司信息通信分公司 | Application-aware multi-tenant Coflow scheduling method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165724A (en) * | 2018-08-06 | 2019-01-08 | 哈工大大数据(哈尔滨)智能科技有限公司 | A kind of gradient neural network based decline the number of iterations prediction technique and device |
CN109190795A (en) * | 2018-08-01 | 2019-01-11 | 中山大学 | A kind of interregional Travel Demand Forecasting method and device |
CN110443364A (en) * | 2019-06-21 | 2019-11-12 | 深圳大学 | A kind of deep neural network multitask hyperparameter optimization method and device |
CN110636020A (en) * | 2019-08-05 | 2019-12-31 | 北京大学 | Neural network equalization method for adaptive communication system |
CN111353582A (en) * | 2020-02-19 | 2020-06-30 | 四川大学 | Particle swarm algorithm-based distributed deep learning parameter updating method |
-
2021
- 2021-04-07 CN CN202110372715.XA patent/CN113128668B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109190795A (en) * | 2018-08-01 | 2019-01-11 | 中山大学 | A kind of interregional Travel Demand Forecasting method and device |
CN109165724A (en) * | 2018-08-06 | 2019-01-08 | 哈工大大数据(哈尔滨)智能科技有限公司 | A kind of gradient neural network based decline the number of iterations prediction technique and device |
CN110443364A (en) * | 2019-06-21 | 2019-11-12 | 深圳大学 | A kind of deep neural network multitask hyperparameter optimization method and device |
CN110636020A (en) * | 2019-08-05 | 2019-12-31 | 北京大学 | Neural network equalization method for adaptive communication system |
CN111353582A (en) * | 2020-02-19 | 2020-06-30 | 四川大学 | Particle swarm algorithm-based distributed deep learning parameter updating method |
Non-Patent Citations (1)
Title |
---|
基于改进卷积神经网络的短时公交客流预测;陈深进;薛洋;;计算机科学(第05期);第1-4页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113128668A (en) | 2021-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tang et al. | Joint multiuser DNN partitioning and computational resource allocation for collaborative edge intelligence | |
Towles et al. | Guaranteed scheduling for switches with configuration overhead | |
US9584430B2 (en) | Traffic scheduling device | |
Serpanos et al. | Architecture of network systems | |
CN111030835B (en) | Task scheduling model of TTFC network and message scheduling table generation method | |
CN110990140B (en) | Method for scheduling distributed machine learning flow in photoelectric switching network | |
Li et al. | Leveraging endpoint flexibility when scheduling coflows across geo-distributed datacenters | |
CN113128668B (en) | Link scheduling method considering high throughput and fairness in data center network | |
CN111131080B (en) | Distributed deep learning flow scheduling method, system and equipment | |
CN111865668A (en) | Network slicing method based on SDN and NFV | |
Bhowmik et al. | Distributed control plane for software-defined networks: A case study using event-based middleware | |
Li et al. | Endpoint-flexible coflow scheduling across geo-distributed datacenters | |
CN109936505B (en) | Method and apparatus in data-centric software-defined networks | |
CN115033359A (en) | Internet of things agent multi-task scheduling method and system based on time delay control | |
CN116680070A (en) | Edge-end cooperative deep neural network training method based on hybrid pipeline parallelism | |
Zhang et al. | Minimizing coflow completion time in optical circuit switched networks | |
Zhao et al. | Joint reducer placement and coflow bandwidth scheduling for computing clusters | |
Duan et al. | Mercury: A simple transport layer scheduler to accelerate distributed DNN training | |
He et al. | Beamer: stage-aware coflow scheduling to accelerate hyper-parameter tuning in deep learning clusters | |
CN113946455A (en) | Multi-stage feedback queue flow scheduling method based on bottleneck perception | |
Zhang et al. | Reco: Efficient regularization-based coflow scheduling in optical circuit switches | |
Kogan et al. | Towards software-defined buffer management | |
Takahashi et al. | Unisonflow: A software-defined coordination mechanism for message-passing communication and computation | |
CN112422651A (en) | Cloud resource scheduling performance bottleneck prediction method based on reinforcement learning | |
WO2020245864A1 (en) | Distributed processing system and distributed processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |