CN113128668B - Link scheduling method considering high throughput and fairness in data center network - Google Patents

Link scheduling method considering high throughput and fairness in data center network Download PDF

Info

Publication number
CN113128668B
CN113128668B CN202110372715.XA CN202110372715A CN113128668B CN 113128668 B CN113128668 B CN 113128668B CN 202110372715 A CN202110372715 A CN 202110372715A CN 113128668 B CN113128668 B CN 113128668B
Authority
CN
China
Prior art keywords
neural network
connected neural
fully
coflow
fairness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110372715.XA
Other languages
Chinese (zh)
Other versions
CN113128668A (en
Inventor
沈鸿
黄宏斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110372715.XA priority Critical patent/CN113128668B/en
Publication of CN113128668A publication Critical patent/CN113128668A/en
Application granted granted Critical
Publication of CN113128668B publication Critical patent/CN113128668B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a link scheduling method for considering high throughput and fairness in a data center network, which comprises the following steps: s1: for the same time, the number of coflows is n and each coflow C k Size d of the corresponding normalized stream on the link kj An internal scheduler is realized by adopting a multi-layer fully-connected neural network; s2: self-adaptive adjustment of the multi-layer fully-connected neural network based on element learning, namely initializing the weight of the multi-layer fully-connected neural network by the element parameter theta obtained in the element training stageUpdating by using a meta-test set stage to obtain a final multi-layer fully-connected neural network; s3: and (3) performing parallel distributed job resource scheduling on the input coflow by utilizing the multi-layer fully-connected neural network obtained in the step (S2) to obtain a final resource scheduling scheme. The invention can make smooth trade-off between fairness and high efficiency.

Description

Link scheduling method considering high throughput and fairness in data center network
Technical Field
The invention relates to the field of high-performance calculation, in particular to a link scheduling method for achieving both high throughput and fairness in a data center network.
Background
In many parallel distributed jobs (e.g., mapReduce), there is a communication phase in which a set of streams transfer data between a set of machines, and the amount of data each stream needs to transfer can be known before the stream begins, choedhury et al propose the concept of Coflow as an abstraction of the set of streams between such a set of two sets of machines in parallel and with associated semantics and co-targeting [ Choedhury, mosharaf & Stoica, ion. (2012). Coflow: a networking abstraction for cluster applications.31-36.10.1145/2390231.2390237]. coflow captures various communication modes including computing applications, enabling applications to more easily pass their communication semantics to the network, thereby enabling the network to better optimize common communication modes. These associative semantics allow the network to perform different actions on this set of flows to achieve a common final goal
We consider a network model in which the entire data center structure is abstracted into a non-blocking switch, connecting all machines, as shown in fig. 1. We focus only on its entry (upload channel) and exit (download channel). This network model abstraction is quite simple, but it is compatible with the development and application of the full bisection bandwidth topology [22,32] and multi-tenant model, so it is practical and also attracts the attention of many researchers in this field. Assuming that each coflow has a corresponding flow in the corresponding upload and download channels and that each coflow comes from a different application, the coflow scheduler should ideally provide isolation guarantees on a minimum coflow schedule so that each coflow will be allocated bandwidth resources relatively fairly (i.e., guaranteeing fairness) for purposes of improving the application level network performance of today's data centers. On the other hand, network operators strive to reduce the coflow completion time (coflow completion time, CCT). However, optimal isolation guarantees and minimum average CCT are contradictory goals, and cannot be optimized at the same time. Many coflow schedulers thus exist to either optimize isolation guarantees at the expense of long CCT (e.g., HUG [ Mosharaf Chowdhury, zhenhua Liu, ali Ghodsi, and Ion Stoica.2016.HUG: multi-resource fairness for correlated and elastic demands. In Proceedings of the. FIG. 13th Usenix Conference on Networked Systems Design and Implementation (NSDI' 16). USENIX Association, USA,407-424 ]), or to reduce average CCT without performance isolation (e.g., varys [ M.Chowdhury, Y.Zhong, and I. Stoica, "Efficient coflow scheduling with Varys," in ACM SIGCOMM,2014] and Aalo [ M.Choldhury and I.Stoica, "Efficient coflow scheduling without prior knowledge," in Proc. Of ACM SIGCOMM,2015 ]). Coflex [ W.Wang, S.Ma, B.Li and b.li, "Coflex: navigating the fairness-efficiency tradeoff for coflow scheduling," IEEE info com 2017-IEEE Conference on Computer Communications, atlanta, GA, USA,2017, pp.1-9, doi:10.1109/info com.2017.8057172] is a mere trade-off between these two contradictory goals that allows the network operator to specify the required level of isolation assurance using one balance parameter α while reducing the average CCT.
However, while Coflex considers both the goals of optimizing isolation guarantees and reducing average CCT, its effectiveness is largely dependent on the choice of the balance parameter α, and no investigation is made in this work as to how to choose a better α value. In addition, since the coflows received by the network in different time periods have different modes (such as size distribution of the flows, etc.), a constant balance parameter α in the Coflex model cannot adapt to the coflows with continuously changing modes, and thus, there is a great disadvantage in practical application.
Disclosure of Invention
The invention provides a link scheduling method which gives consideration to high throughput and fairness in a data center network, and can smoothly balance fairness and high efficiency.
In order to achieve the above purpose of the present invention, the following technical scheme is adopted: a link scheduling method for considering both high throughput and fairness in a data center network comprises the following steps:
s1: for the same time, the number of coflow is n and each coflow C k Size d of the corresponding normalized stream on the link kj An internal scheduler is realized by adopting a multi-layer fully-connected neural network;
s2: self-adaptive adjustment of the multi-layer fully-connected neural network based on element learning, namely initializing the weight of the multi-layer fully-connected neural network by the element parameter theta obtained in the element training stageUpdating by using a meta-test set stage to obtain a final multi-layer fully-connected neural network;
s3: and (3) performing parallel distributed job resource scheduling on the input coflow by utilizing the multi-layer fully-connected neural network obtained in the step (S2) to obtain a final resource scheduling scheme.
Preferably, the fully connected neural network is based on parametersUniquely and definitively output->Wherein x is the input of a fully connected neural network, < >>Is the weight of the fully connected neural network;
further, the input of the fully-connected neural network is e j D corresponding to each coflow kj Is equal to x=e j ||[d 1j ,d 2j ,...d kj ]Wherein, I is a spliced symbol, which means that the front vector and the rear vector are connected into one vector, e j Representing the one-hot encoding of port j.
Still further, the output of the fully connected neural networkBandwidth duty cycle r allocated for port j for each coflow kj I.e. y= [ r ] 1j ,r 2j ,...,r kj ]。
Still further, the meta-training stage is specifically as follows:
s201: task is carried outAs training set, updating step number K, internal learning rate alpha and external learning rate beta;
s202: the parameters of the fully connected neural network are initialized first randomly with the meta-parameters θ, and then for each task, each time the parameters of the fully connected neural network are initialized first with θ
S203: and then use the data pairs in the taskPerforming K-step updating, and recording gradients in the process;
s204: and finally updating the meta-parameter theta according to the obtained gradient.
Still further, the meta-test stage is specifically as follows:
d1: inputting a task T;
d2: initializing parameters based on meta-parameters θ obtained during meta-trainingObtaining a preliminary fully-connected neural network;
d3: and carrying out K-step updating according to the task T so as to obtain the final fully-connected neural network.
The beneficial effects of the invention are as follows:
the invention utilizes a multi-layer fully-connected neural network to realize the internal scheduler, and adopts element learning to carry out self-adaptive adjustment on the multi-layer fully-connected neural network, thereby constructing a coflow fairness element scheduler. The invention can make smooth trade-off between fairness and high efficiency. Meanwhile, compared with other models, the invention does not need to manually specify the superparameters which have larger influence on the final performance of the model, so that the problem of model performance degradation caused by the specification of poorer balance parameters is avoided. The invention adopts a meta learning framework, can adaptively adjust the parameters of the fully connected neural network, further adapt to the coflow with continuously changed modes, and keep stable performance, and cannot be changed once the balance parameters are set like Coflex.
Drawings
Fig. 1 is a prior art network abstraction diagram.
Fig. 2 is a flow chart of the steps of the method according to the present embodiment.
Fig. 3 is a flowchart of the meta training phase described in this embodiment.
Fig. 4 is a flowchart of the meta-test phase described in this embodiment.
Detailed Description
The invention is described in detail below with reference to the drawings and the detailed description.
Example 1
In order to achieve reasonable trade-off between fairness and efficiency and overcome the possibly continuously changing coflow mode in practical application, the embodiment provides an end-to-end coflow fairness meta-scheduler (MFCS) based on meta-learning, which comprises an internal scheduler implemented by a multi-layer fully-connected network and a meta-learning framework built on the internal scheduler, wherein the implementation method of the parallel distributed job resource scheduling is as follows:
as shown in fig. 2, a link scheduling method for achieving both high throughput and fairness in a data center network includes the following steps:
s1: for the same time, the number of coflow is n and each coflow C k Size d of the corresponding normalized stream on the link kj An internal scheduler is realized by adopting a multi-layer fully-connected neural network;
s2: self-adaptive adjustment of the multi-layer fully-connected neural network based on element learning, namely initializing the weight of the multi-layer fully-connected neural network by the element parameter theta obtained in the element training stageUpdating by using a meta-test set stage to obtain a final multi-layer fully-connected neural network;
s3: and (3) performing parallel distributed job resource scheduling on the input coflow by utilizing the multi-layer fully-connected neural network obtained in the step (S2) to obtain a final resource scheduling scheme.
In a specific embodiment, the fully-connected neural network is based on parametersUniquely and definitively outputWherein x is the input of a fully connected neural network, < >>Is the weight of the fully connected neural network;
in a specific embodiment, the fully-connected nerveThe input of the network is e j D corresponding to each coflow kj Is equal to x=e j ||[d 1j ,d 2j ,...d kj ]Wherein, I is a spliced symbol, which means that the front vector and the rear vector are connected into one vector, e j Indicating the one-hot code corresponding to port j.
In a specific embodiment, the output of the fully-connected neural networkBandwidth duty cycle r allocated for port j for each coflow kj I.e. y= [ r ] 1j ,r 2j ,...,r kj ]。
In a specific embodiment, since the pattern of coflow may change over time, our multi-layer fully connected neural network model (i.e., internal scheduler) Its parameters should also be updated adaptively +.>. The training process of the multi-layer fully-connected neural network is to update the network weight value by a gradient descent method>Thereby finding a better weight +.>As a parameter of the final multi-layer full connection. The element learning is to search a weight value theta which is relatively close to the optimal solution for the multi-layer fully connected neural network, and when the theta is used for initializing +.>In this case, the multi-layer fully connected neural network model is already in the vicinity of the optimal solution, so that the multi-layer fully connected neural network model can achieve better performance after fine tuning.
Specifically, the meta learning is divided into two stages: meta-training stage and meta-testing stage.
The meta training stage specifically comprises the following steps:
s201: task is carried outAs training set, updating step number K, internal learning rate alpha and external learning rate beta;
s202: randomly initializing element parameters theta, and then for each task, initializing parameters of the fully-connected neural network with theta each timeThe present embodiment can be assigned directly, i.e. let +.>Finishing initialization;
s203: and then use the data pairs in the taskPerforming K-step updating, and recording gradients in the process; parameter->The update formula of one iteration update of (a) is as follows:
in the method, in the process of the invention,representing the weight of the loss function L with respect to the fully connected neural network>Is a gradient of (2); sign->Representing a derivative function; l (L) x And (phi, theta) represents the total loss function of the fully connected neural network, and the calculation formula is as follows:
wherein τ represents the average completion time of the input coflow; IG represents a fairness metric IG;
s204: finally, the meta-parameter theta is updated according to the obtained gradient, and the updating formula is as follows:
the program corresponding to the meta training stage is expressed as follows:
in a specific embodiment, the meta-test phase is specifically as follows:
d1: inputting a task T;
d2: initializing parameters based on meta-parameters θ obtained during meta-trainingObtaining a preliminary fully-connected neural network;
d3: and carrying out K-step updating according to the task T, and further obtaining the final fully-connected neural network.
The program corresponding to the meta-test stage is expressed as follows:
the embodiment further provides a specific embodiment based on the parallel distributed job resource scheduling method described above: the number of tasks of the input training set is 100, the number of tasks of the test set is 1, and the number of update steps k=5. The meta training process adopted by the MFCS model is shown in fig. 3, and the updated meta parameter theta is finally output at the stage; the meta-test procedure is shown in FIG. 4, and the meta-parameters θ obtained in the meta-training phase are used in this stage to initialize the parameters of the model MFCSParameters of the MFCS model are +.>And 5, updating to obtain a final MFCS model, and finally, carrying out resource scheduling on the input coflow by using the final MFCS model to output a final resource scheduling scheme.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (5)

1. A link scheduling method for considering both high throughput and fairness in a data center network is characterized in that: the method comprises the following steps:
s1: for the same time, the number of coflow is n and each coflow C k Size d of the corresponding normalized stream on the link kj An internal scheduler is realized by adopting a multi-layer fully-connected neural network;
s2: self-adaptive adjustment of the multi-layer fully-connected neural network based on element learning, namely initializing the weight of the multi-layer fully-connected neural network by the element parameter theta obtained in the element training stageUpdating by using a meta-test set stage to obtain a final multi-layer fully-connected neural network;
s3: performing parallel distributed job resource scheduling on the input coflow by utilizing the multi-layer fully-connected neural network obtained in the step S2 to obtain a final resource scheduling scheme;
the meta training stage specifically comprises the following steps:
s201: task is carried outAs training set, updating step number K, internal learning rate alpha and external learning rate beta;
s202: randomly initializing element parameters theta, and then for each task, initializing parameters of the fully-connected neural network with theta each timeThe present embodiment may be directly assigned, i.e. let +.>Finishing initialization;
s203: and then use the data pairs in the taskPerforming K-step updating, and recording gradients in the process;
parameters (parameters)The update formula of one iteration update of (a) is as follows:
in the method, in the process of the invention,representing the weight of the loss function L with respect to the fully connected neural network>Is a gradient of (2); sign->Representing a derivative function; l (L) x And (phi, theta) represents the total loss function of the fully connected neural network, and the calculation formula is as follows:
wherein τ represents the average completion time of the input coflow; IG represents a fairness metric IG;
s204: finally, the meta-parameter theta is updated according to the obtained gradient, and the updating formula is as follows:
2. the link scheduling method for a data center network according to claim 1, wherein the link scheduling method is capable of achieving both high throughput and fairness, and is characterized in that: the fully connected neural network is based on parametersUniquely and definitively output->Wherein x is the input of a fully connected neural network, < >>Is the weight of the fully connected neural network.
3. The link scheduling method for a data center network according to claim 2, wherein the link scheduling method is capable of achieving both high throughput and fairness, and is characterized in that: the input of the fully-connected neural network is e j D corresponding to each coflow kj Is equal to x=e j ||[d 1j ,d 2j ,...d kj ]Wherein, I is a spliced symbol, which means that the front vector and the rear vector are connected into one vector, e j Representing the one-hot encoding of port j.
4. A method for scheduling links in a data center network that combines high throughput and fairness as claimed in claim 3, wherein: the output of the fully-connected neural networkBandwidth duty cycle r allocated for port j for each coflow kj I.e. y= [ r ] 1j ,r 2j ,...,r kj ]。
5. The link scheduling method for a data center network according to claim 1, wherein the link scheduling method is capable of achieving both high throughput and fairness, and is characterized in that: the meta-test stage is specifically as follows:
d1: inputting a task T;
d2: initializing parameters based on meta-parameters θ obtained during meta-trainingObtaining a preliminary fully-connected neural network;
d3: and carrying out K-step updating according to the task T so as to obtain the final fully-connected neural network.
CN202110372715.XA 2021-04-07 2021-04-07 Link scheduling method considering high throughput and fairness in data center network Active CN113128668B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110372715.XA CN113128668B (en) 2021-04-07 2021-04-07 Link scheduling method considering high throughput and fairness in data center network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110372715.XA CN113128668B (en) 2021-04-07 2021-04-07 Link scheduling method considering high throughput and fairness in data center network

Publications (2)

Publication Number Publication Date
CN113128668A CN113128668A (en) 2021-07-16
CN113128668B true CN113128668B (en) 2023-07-25

Family

ID=76775243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110372715.XA Active CN113128668B (en) 2021-04-07 2021-04-07 Link scheduling method considering high throughput and fairness in data center network

Country Status (1)

Country Link
CN (1) CN113128668B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114666283B (en) * 2022-03-07 2023-11-24 国家电网有限公司信息通信分公司 Application-aware multi-tenant Coflow scheduling method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165724A (en) * 2018-08-06 2019-01-08 哈工大大数据(哈尔滨)智能科技有限公司 A kind of gradient neural network based decline the number of iterations prediction technique and device
CN109190795A (en) * 2018-08-01 2019-01-11 中山大学 A kind of interregional Travel Demand Forecasting method and device
CN110443364A (en) * 2019-06-21 2019-11-12 深圳大学 A kind of deep neural network multitask hyperparameter optimization method and device
CN110636020A (en) * 2019-08-05 2019-12-31 北京大学 Neural network equalization method for adaptive communication system
CN111353582A (en) * 2020-02-19 2020-06-30 四川大学 Particle swarm algorithm-based distributed deep learning parameter updating method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190795A (en) * 2018-08-01 2019-01-11 中山大学 A kind of interregional Travel Demand Forecasting method and device
CN109165724A (en) * 2018-08-06 2019-01-08 哈工大大数据(哈尔滨)智能科技有限公司 A kind of gradient neural network based decline the number of iterations prediction technique and device
CN110443364A (en) * 2019-06-21 2019-11-12 深圳大学 A kind of deep neural network multitask hyperparameter optimization method and device
CN110636020A (en) * 2019-08-05 2019-12-31 北京大学 Neural network equalization method for adaptive communication system
CN111353582A (en) * 2020-02-19 2020-06-30 四川大学 Particle swarm algorithm-based distributed deep learning parameter updating method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于改进卷积神经网络的短时公交客流预测;陈深进;薛洋;;计算机科学(第05期);第1-4页 *

Also Published As

Publication number Publication date
CN113128668A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
Tang et al. Joint multiuser DNN partitioning and computational resource allocation for collaborative edge intelligence
Towles et al. Guaranteed scheduling for switches with configuration overhead
US9584430B2 (en) Traffic scheduling device
Serpanos et al. Architecture of network systems
CN111030835B (en) Task scheduling model of TTFC network and message scheduling table generation method
CN110990140B (en) Method for scheduling distributed machine learning flow in photoelectric switching network
Li et al. Leveraging endpoint flexibility when scheduling coflows across geo-distributed datacenters
CN113128668B (en) Link scheduling method considering high throughput and fairness in data center network
CN111131080B (en) Distributed deep learning flow scheduling method, system and equipment
CN111865668A (en) Network slicing method based on SDN and NFV
Bhowmik et al. Distributed control plane for software-defined networks: A case study using event-based middleware
Li et al. Endpoint-flexible coflow scheduling across geo-distributed datacenters
CN109936505B (en) Method and apparatus in data-centric software-defined networks
CN115033359A (en) Internet of things agent multi-task scheduling method and system based on time delay control
CN116680070A (en) Edge-end cooperative deep neural network training method based on hybrid pipeline parallelism
Zhang et al. Minimizing coflow completion time in optical circuit switched networks
Zhao et al. Joint reducer placement and coflow bandwidth scheduling for computing clusters
Duan et al. Mercury: A simple transport layer scheduler to accelerate distributed DNN training
He et al. Beamer: stage-aware coflow scheduling to accelerate hyper-parameter tuning in deep learning clusters
CN113946455A (en) Multi-stage feedback queue flow scheduling method based on bottleneck perception
Zhang et al. Reco: Efficient regularization-based coflow scheduling in optical circuit switches
Kogan et al. Towards software-defined buffer management
Takahashi et al. Unisonflow: A software-defined coordination mechanism for message-passing communication and computation
CN112422651A (en) Cloud resource scheduling performance bottleneck prediction method based on reinforcement learning
WO2020245864A1 (en) Distributed processing system and distributed processing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant