CN111274036A - Deep learning task scheduling method based on speed prediction - Google Patents

Deep learning task scheduling method based on speed prediction Download PDF

Info

Publication number
CN111274036A
CN111274036A CN202010068852.XA CN202010068852A CN111274036A CN 111274036 A CN111274036 A CN 111274036A CN 202010068852 A CN202010068852 A CN 202010068852A CN 111274036 A CN111274036 A CN 111274036A
Authority
CN
China
Prior art keywords
task
speed
training
tasks
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010068852.XA
Other languages
Chinese (zh)
Other versions
CN111274036B (en
Inventor
曹春
马晓星
徐经纬
李青坪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202010068852.XA priority Critical patent/CN111274036B/en
Publication of CN111274036A publication Critical patent/CN111274036A/en
Application granted granted Critical
Publication of CN111274036B publication Critical patent/CN111274036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a deep learning task scheduling method based on speed prediction, which comprises two parts of speed model construction and task scheduling. The speed model building part is used for building a neural network model to predict the speed of processing pictures when each task runs in a cluster, and comprises a training stage and a prediction stage: in the training stage, firstly, profiling is carried out on each task, namely, the training speed of each task under different distributed configurations in a cluster is collected, and a data set is constructed; the features of each task trained in the cluster are collected, a velocity model is constructed and trained using the data set constructed in the previous step. The prediction phase is integrated into the task scheduling portion. The task scheduling part predicts the training speed of the tasks under different configurations by using a speed model and determines the resource allocation of the clusters by using a customized simulated annealing algorithm, thereby achieving the purpose of effectively utilizing the cluster resources.

Description

Deep learning task scheduling method based on speed prediction
Technical Field
The invention relates to a scheduling method of a deep learning task based on speed prediction, in particular to a scheduling method of a distributed deep learning training task combining resource allocation and task placement.
Background
The deep learning technology is a technology for simulating a human neural network to complete a series of complex decisions or predictions, and with more and more diverse deep learning application scenes, neural network models are more and more complex, and meanwhile, a data set required by training a complex model is more and more large, so that the traditional single GPU training is difficult to meet the training of the complex model. Distributed Deep Learning (DDL) aims to improve the training efficiency of a complex model, and a plurality of GPUs are used for accelerating the training process so that the complex model can finish training and start service in a short time. However, in a GPU cluster, there may be multiple DDL training tasks, and an unreasonable resource allocation may result in that each task cannot complete training at the fastest speed, thereby affecting the training efficiency of the tasks. Therefore, it is important to study how to schedule the DDL task so that the cluster resources can be effectively utilized.
Parameter server architecture
In distributed deep learning, a parameter server architecture is an architecture responsible for parameter synchronization among a plurality of working nodes and is divided into two types of nodes, namely a parameter server node (PS) and a computing node (Worker): the PS is responsible for storing global model parameters, receiving the gradient pushed by each Worker, updating the gradient and allowing each Worker to pull the updated parameters; and each Worker locally stores a copy of the sub-global parameters, and is responsible for processing the data set and pushing the calculated gradient to the PS, and after the PS finishes updating the parameters, the updated parameters are pulled to the local and the next iteration is started. The parameter server architecture is proposed in 2010, and in 2014, a Li bath brings the parameter server to the public view, the parameter server architecture realizes communication among nodes at the bottom layer, a transparent use interface is provided for a user, and the user can conveniently start distributed training by using multiple GPUs (graphics processing units) by only writing codes of a PS (packet switched) end and a Worker end according to the interface and giving global configuration, so that the training efficiency of a complex model is improved.
Cluster scheduler
The cluster scheduler is mainly responsible for cluster resource management, task placement, monitoring of the execution state of tasks and the like, and in the traditional scheduler, the customized scheduling is not performed according to the characteristics of deep learning tasks, so that in a deep learning task cluster, the maximized utilization of cluster resources is difficult. Many solutions have been proposed in both academia and industry for deep learning tasks. The method comprises the steps that Optimus is a scheduler which takes optimization of average completion time of tasks as a target, in a scheduling method of the Optimus, a mathematical model is fitted to predict the training speed of the DDL tasks under a parameter server architecture, and a model is fitted to predict the number of epochs left by the tasks after completion of training; then, when the tasks are scheduled, a speed model and a residual epoch prediction model are used for calculating to allocate one PS or Worker to each task, which brings higher profit (namely, the training time of the task is reduced more), so that the PS or workers are allocated to all the tasks in a greedy manner until the PS or workers are added, no profit is brought, or the resources of the clusters are exhausted. A Gaia scheduler of a Tencent open source utilizes a mechanism of GPU point-to-point communication and guarantees that a DDL task is trained under the conditions of high bandwidth and low delay by detecting PCIe connection types (SYS, PIX and the like) during point-to-point communication between GPUs. However, none of these methods takes into account the effect of task placement on resource allocation, and in fact the former has a large effect on the latter. For example, two Workers are allocated to one task, and the training effects of the tasks with the two Workers on the same node and different nodes are different. It is very necessary to consider the resource allocation and task placement in the scheduling method.
Disclosure of Invention
The purpose of the invention is as follows: in the existing scheduling method, the influence on task placement is not considered when resource allocation is performed on a DDL task, and a common method is to allocate resources to the DDL task based on some heuristic rules or based on a certain model, and then to place the allocated resources on each node based on some heuristic rules (for example, considering node load balancing, considering packing more tightly, and the like). In the methods, a heuristic rule is selected to replace the influence of task placement on resource allocation, and an optimal scheduling effect is difficult to achieve in an actual scene. Aiming at the defects of the existing scheduling method, the invention provides a deep learning task scheduling method based on speed prediction, and the method has the advantages of accurate speed prediction, high resource allocation efficiency, quick task training completion and the like.
The technical scheme is as follows: a scheduling method of a deep learning task based on speed prediction comprises a speed model construction stage and a task scheduling stage:
velocity model construction phase
(1) Speed model data set construction: acquiring a task training speed by using the training state of the task in the cluster, and preparing a training speed model as a data set;
(2) implementation of the velocity model: establishing a training device of a speed model based on deep learning, wherein input data of the training device is a data set constructed in the previous step, and output is the training speed of a task (namely the speed of processing pictures);
task scheduling phase
(1) Resource allocation and task placement: the scheduler considers the resource allocation and the task placement in a combined manner, predicts the training speed of each task under different configurations (namely different resource allocation amounts and placement nodes) through a speed model, and determines the optimal configuration for each task, so that the aim of effectively utilizing cluster resources is fulfilled;
(2) and (3) task operation: and after the scheduler calculates and configures each task, scheduling the tasks to the cluster to run, and monitoring the running state of the tasks at the same time.
In the implementation of the velocity model dataset construction in the velocity model construction phase: tasks run in the cluster with different configurations and different training speeds; and sampling to obtain the training speed of the task under different configurations according to all possible running configurations of the task in the cluster, thereby constructing a data set.
Has the advantages that: compared with the prior art, the scheduling method of the deep learning task based on the speed prediction utilizes the speed model to combine resource allocation and task placement for scheduling, so that each task can complete training at a higher speed. Compared with the traditional DDL task scheduling method, the method has the advantages of high accuracy, more reasonable resource allocation, less interference of task operation and the like, and can be applied to the scheduling method of the deep learning cluster. The method has the following advantages:
(1) the problem of resource allocation in a multi-task environment is solved.
In a cluster, multi-task scheduling is always a key research direction, and whether the resource utilization rate of the cluster can be maximized is related to the cost input by a user. In deep learning, the GPU is commonly used as a computing resource for training, and the cost of the GPU is very high, which can be used for users to maximize the utilization of GPU resources. The method provided by the invention can well carry out multitask GPU scheduling, improve the utilization rate of the GPU and solve the problem of GPU resource allocation.
(2) High accuracy of speed prediction
The invention constructs a neural network model to predict the training speed of the DDL task, and adds the characteristics of the task under different placing conditions into the characteristics of the speed model, so that the training speed of the DDL task under different placing conditions can be predicted, and the accuracy is high.
(3) Less disturbance of task operation
Because one characteristic in the speed model is the number of tasks running on each node, the influence and interference of other tasks on the task can be reflected through the characteristic, and therefore the interference caused by other tasks when the task runs can be reduced.
(4) High portability
The invention is constructed on a Kubernetes resource manager, and takes Docker as the operation of a task. Meanwhile, the speed model can use various open source deep learning frames (MXNet/Tensorflow/PyTorch), can be calculated by a CPU (central processing unit) and can also be calculated by a GPU (graphics processing unit).
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a diagram of a model of a velocity-predicted deep neural network.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
As shown in FIG. 1, in the scheduling method of the deep learning task based on the speed prediction, a customized simulated annealing algorithm is used to perform scheduling by considering resource allocation and task placement, and a better configuration is obtained by searching in the simulated annealing process by taking the sum of the speeds of all tasks as an evaluation function, so that the purpose of effectively utilizing cluster resources is achieved. The method comprises a speed model construction stage and a task scheduling stage:
■ speed model building phase
1. Velocity model dataset construction
Based on different configurations of tasks running in the cluster, obtaining characteristics of a speed model as input characteristics of a speed model trainer; the parameter synchronization architecture of the distributed deep learning adopted by the invention is a parameter server architecture, and the characteristics of the speed model comprise the number of Parameter Servers (PS), the number of computing nodes (Worker), the type of the model used by the task, the batch _ size, the placement of the Worker on the nodes and the number of other tasks which are already run on the nodes. From the features of the velocity model, a data set is sampled from all possible cases of task training in the cluster.
2. Implementation of a velocity model
And establishing a trainer based on a deep learning speed model, wherein input data of the trainer is a data set constructed in the previous step, and output is the training speed (namely the speed of processing pictures) of the task. A sequence model is constructed by adopting a Tensorflow Keras deep learning framework, and a two-layer fully-connected network model is constructed. The input of the model is each sample in the data set, the sample characteristics are input in the form of a group of vectors, and the label of the model is the training speed of the task (the speed is normalized during training, which is beneficial to improving the accuracy of the training). The model iterates through a Back-propagation Algorithm (Back-propagation Algorithm) loop, a ReLU function is used as an activation function, and Adam is used as an optimizer, so that the error level between the speed predicted value and the true value output by the model on the test data set is in an expected range, and the deep learning training process is completed.
■ task scheduling phase
1. Resource allocation and task placement
The scheduler considers the resource allocation and the task placement in a combined manner, predicts the training speed of each task under different configurations (namely different resource allocation amounts and placement nodes) through a speed model, and determines the optimal configuration for each task, so that the aim of effectively utilizing cluster resources is fulfilled. The whole process comprises the following steps: (1) constructing a task queue, and placing all tasks submitted by a user in the task queue to wait for scheduling; (2) when each scheduling period starts, the scheduler takes out all tasks from the scheduling queue and generates initial configuration for the tasks, wherein the initial configuration comprises the steps of allocating PS and Worker for each task and generating a group of initial placement nodes for the allocated tasks; (3) the scheduler starts to adjust the configuration of each task, and the configuration of each task is adjusted to a better configuration by using a customized simulated annealing algorithm, so that each task can be trained at a higher speed, the aim is to ensure that the shorter the average completion time of all tasks is, the better the average completion time of all tasks is, and the resource utilization rate of the cluster is also at a higher level.
When the initial configuration is generated, one PS and one Worker are allocated to each task initially, and each PS and one Worker can occupy a certain amount of CPU, GPU and memory resources until all tasks are allocated to one PS and one Worker or the number of the remaining resources in the cluster is insufficient. This uniform resource allocation scheme facilitates convergence of the task configuration adjustment phase to a better configuration. After the initial configuration is set, using a customized simulated annealing algorithm, namely obtaining a current solution through the initial configuration, generating a group of new solutions by generating neighborhoods, comparing evaluation functions of the current solution and the new solutions, and selecting to accept the solution if the evaluation functions are met; if not, selecting to accept the solution with a certain probability; the customized simulated annealing algorithm comprises three modes of generating neighborhoods, namely, distribution of Worker, exchange of Worker and transfer of Worker, and a better configuration is obtained by generating the configuration of a neighborhood adjusting task. The distribution and placement of the PS are not in a search space of a simulated annealing algorithm, the number of the PS is determined to be slightly larger than the number of the Worker by default, and the PS are uniformly placed on nodes containing the Worker; if the cluster resources are insufficient, the number of PS is reduced, and the difference between the number of PS and the number of Worker of each task is not larger than a given value. In the generation of the three neighborhoods, the distribution of the Worker refers to dividing one Worker from the residual resources of the cluster into tasks; the exchange of the Worker means that for two different tasks, the nodes placed by a certain Worker of the two different tasks are exchanged, and the training effect of task prediction under different placing conditions is explored; the transfer of the Worker refers to transferring one Worker of a certain task to another task. And the scheduler explores each neighborhood, takes the sum of the predicted speeds of all tasks as an evaluation function, and finishes the searching process and obtains a group of better configurations when the temperature of the simulated annealing is reduced to a certain value. Meanwhile, in order to avoid that the task with low training speed is in a starvation state for a long time, the invention sets a waiting time threshold value for each task, and if the waiting time of the task exceeds the threshold value, at least one PS and one Worker are allocated to the task.
2. Task execution
After the configuration of each task is calculated, the scheduler deploys the tasks into a cluster, runs each task in a container mode by taking Kubernetes as a cluster resource manager, submits the task with effective configuration (namely the task distributed with PS and Worker) to the Kubernetes cluster to start training, and sets a temporary storage point for each task to store the model parameters of task training; after the tasks are started, the scheduler is responsible for periodically checking the health condition of each task, and is responsible for restarting the task if any task crashes; and if a new task is added into the queue or the training of the task is completed, starting a new round of scheduling to ensure that each task completes the training smoothly.
In summary, the invention can be applied to deep learning cluster resource scheduling, and has great effects on improving resource utilization rate and saving cost. The method can accurately predict the training speed of the task under different configurations, so that an accurate result can be obtained more reliably when scheduling decisions are made. The technology is based on the Kubernetes cluster resource manager, so that a user can conveniently reproduce the technology without considering a plurality of inconvenient factors brought by a bottom environment, and the technology is easy to deploy and maintain. Meanwhile, the speed prediction model can be realized by various deep learning frames, and a user can select a frame which is good at. Therefore, the technology has higher application value.

Claims (10)

1. A scheduling method of a deep learning task based on speed prediction is characterized by comprising a speed model construction stage and a task scheduling stage:
velocity model construction phase
(1) Speed model data set construction: acquiring a task training speed by using the training state of the task in the cluster, and preparing for a training speed model;
(2) implementation of the velocity model: establishing a training device of a speed model based on deep learning, wherein input data of the training device is a data set constructed in the previous step, and output is the training speed of a task;
task scheduling phase
(1) Resource allocation and task placement: the scheduler considers the resource allocation and the task placement in a combined manner, predicts the training speed of each task under different configurations through a speed model, and determines the optimal configuration for each task, so that the purpose of effectively utilizing cluster resources is achieved; wherein different configurations, i.e. different resource allocation amounts and placement nodes;
(2) and (3) task operation: and after the scheduler calculates and configures each task, scheduling the tasks to the cluster to run, and monitoring the running state of the tasks at the same time.
2. The method for scheduling a deep learning task based on speed prediction as claimed in claim 1, wherein in the implementation of speed model data set construction in the speed model construction stage: tasks run in the cluster with different configurations and different training speeds; and sampling to obtain the training speed of the task under different configurations according to all possible running configurations of the task in the cluster, thereby constructing a data set.
3. The method for scheduling deep learning tasks based on speed prediction as claimed in claim 1, wherein the characteristics of the speed model are obtained based on different configurations of the tasks running in the cluster and are used as the input characteristics of the speed model trainer; the method adopts a parameter synchronization architecture of distributed deep learning, which is a parameter server architecture, and the characteristics of a speed model comprise the number of Parameter Servers (PS), the number of computing nodes (Worker), the type of a model used by a task, the batch size, the placement of the Worker on the nodes and the number of other tasks which are already run on the nodes; the speed model is based on deep learning, and a two-layer fully-connected network is constructed to predict the training speed of the task.
4. The method for scheduling a deep learning task based on speed prediction as claimed in claim 1, wherein in the implementation of the speed model in the speed model construction stage: adopting a Tensorflow Keras deep learning framework to construct a Sequential model and construct a two-layer fully-connected network model; the input of the model is each sample in the data set, the sample characteristics are input in the form of a group of vectors, and the label of the model is the training speed of the task.
5. The method for scheduling the deep learning task based on the speed prediction as claimed in claim 4, wherein the model is iterated through a back propagation algorithm loop, and Adam is used as an optimizer, so that the error level between the speed predicted value and the true value output by the model on the test data set is within an expected range, thereby completing the training process of the deep learning.
6. The method for scheduling deep learning tasks based on speed prediction as claimed in claim 1, wherein in the implementation of the resource allocation and task placement in the task scheduling stage: constructing a task queue, and placing all tasks submitted by a user in the task queue to wait for scheduling; when each scheduling period starts, the scheduler takes out all tasks from the scheduling queue and generates initial configuration for the tasks, wherein the initial configuration comprises the steps of allocating PS and Worker for each task and generating a group of initial placement nodes for the allocated tasks; the scheduler then begins adjusting the configuration of each task, using a customized simulated annealing algorithm to adjust the configuration of each task so that each task can be trained at a faster rate with the goal of making the average completion time of all tasks as short as possible.
7. The method for scheduling deep learning tasks based on speed prediction according to claim 6, wherein the generation of the initial configuration is implemented by: initially, one PS and one Worker are allocated to each task, and each PS and Worker can occupy resources of a CPU, a GPU and a memory until all tasks are allocated to one PS and one Worker, or the number of remaining resources in a cluster is insufficient.
8. The method for scheduling deep learning tasks based on speed prediction according to claim 6, wherein the adjustment of task configuration is implemented by: using a customized simulated annealing algorithm, namely obtaining a current solution through initial configuration, generating a group of new solutions through generating a neighborhood, comparing evaluation functions of the current solution and the new solutions, and selecting to accept the solution if the evaluation functions are met; if not, selecting to accept the solution according to a preset probability; the customized simulated annealing algorithm comprises three modes of generating neighborhoods, namely, distribution of a Worker, exchange of the Worker and transfer of the Worker, and a better configuration is obtained by generating the configuration of a neighborhood adjusting task; the distribution and placement of the PS are not in a search space of a simulated annealing algorithm, the number of the PS is determined to be larger than the number of the Worker by default, and the PS are uniformly placed on nodes containing the Worker; if the cluster resources are insufficient, the number of the PS is reduced, and the difference between the number of the PS and the number of the Worker of each task is not larger than a given value.
9. The scheduling method of deep learning task based on speed prediction as claimed in claim 8, wherein in the generation of three neighborhoods of the simulated annealing algorithm, the distribution of Worker refers to dividing one Worker from the rest resources of the cluster into tasks; the exchange of the Worker means that for two different tasks, the nodes placed by a certain Worker of the two different tasks are exchanged, and the training effect of task prediction under different placing conditions is explored; the transfer of the Worker refers to transferring one Worker of a certain task to another task; the scheduler explores each neighborhood, takes the sum of the predicted speeds of all tasks as an evaluation function, and finishes the searching process and obtains a group of better configurations when the temperature of the simulated annealing is reduced to a preset value; meanwhile, in order to avoid the situation that the task with the low training speed is starved for a long time, a waiting time threshold value is set for each task, and if the waiting time of the task exceeds the threshold value, at least one PS and one Worker are allocated to the task.
10. The method for scheduling deep learning tasks based on speed prediction according to claim 1, wherein in the implementation of the task running of the task scheduling stage: after the configuration of each task is calculated, the task is deployed into a cluster by a scheduler, each task is operated in a container mode by adopting Kubernetes as a cluster resource manager, the task with effective configuration is submitted into the Kubernetes cluster to start training, and a temporary storage point is set for each task to store a model parameter of task training; after the tasks are started, the scheduler is responsible for periodically checking the health condition of each task, and is responsible for restarting the task if any task crashes; if a new task is added into the queue or the training of the task is completed, starting a new round of scheduling to ensure that each task completes the training smoothly; the task of effective configuration is that of PS and Worker are allocated.
CN202010068852.XA 2020-01-21 2020-01-21 Scheduling method of deep learning task based on speed prediction Active CN111274036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010068852.XA CN111274036B (en) 2020-01-21 2020-01-21 Scheduling method of deep learning task based on speed prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010068852.XA CN111274036B (en) 2020-01-21 2020-01-21 Scheduling method of deep learning task based on speed prediction

Publications (2)

Publication Number Publication Date
CN111274036A true CN111274036A (en) 2020-06-12
CN111274036B CN111274036B (en) 2023-11-07

Family

ID=70997634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010068852.XA Active CN111274036B (en) 2020-01-21 2020-01-21 Scheduling method of deep learning task based on speed prediction

Country Status (1)

Country Link
CN (1) CN111274036B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101536A (en) * 2020-08-30 2020-12-18 西南电子技术研究所(中国电子科技集团公司第十研究所) Lightweight distributed multi-task collaboration framework
CN112433819A (en) * 2020-11-30 2021-03-02 中国科学院深圳先进技术研究院 Heterogeneous cluster scheduling simulation method and device, computer equipment and storage medium
CN112463340A (en) * 2020-12-10 2021-03-09 武汉工程大学 Tensorflow-based multi-task flexible scheduling method and system
CN112667591A (en) * 2021-01-12 2021-04-16 北京工业大学 Data center task interference prediction method based on mass logs
CN113033098A (en) * 2021-03-26 2021-06-25 山东科技大学 Ocean target detection deep learning model training method based on AdaRW algorithm
CN113282411A (en) * 2021-05-19 2021-08-20 复旦大学 Distributed neural network training system based on edge equipment
WO2022048557A1 (en) * 2020-09-07 2022-03-10 华为云计算技术有限公司 Ai model training method and apparatus, and computing device and storage medium
CN116812427A (en) * 2023-08-28 2023-09-29 河北因朵科技有限公司 Automatic file taking and archiving control system and method for unmanned warehouse
CN117349775A (en) * 2023-10-30 2024-01-05 浙江大学 Cluster computing-oriented abnormal subtask identification method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446816A (en) * 2015-11-11 2016-03-30 华南理工大学 Heterogeneous platform oriented energy consumption optimization scheduling method
CN106529682A (en) * 2016-10-28 2017-03-22 北京奇虎科技有限公司 Method and apparatus for processing deep learning task in big-data cluster
CN107888669A (en) * 2017-10-31 2018-04-06 武汉理工大学 A kind of extensive resource scheduling system and method based on deep learning neutral net
CN108694090A (en) * 2018-04-16 2018-10-23 江苏润和软件股份有限公司 A kind of cloud computing resource scheduling method of Based on Distributed machine learning
CN108920259A (en) * 2018-03-30 2018-11-30 华为技术有限公司 Deep learning job scheduling method, system and relevant device
CN109710406A (en) * 2018-12-21 2019-05-03 腾讯科技(深圳)有限公司 Data distribution and its model training method, device and computing cluster
CN110413391A (en) * 2019-07-24 2019-11-05 上海交通大学 Deep learning task service method for ensuring quality and system based on container cluster
WO2019226652A1 (en) * 2018-05-22 2019-11-28 Pure Storage, Inc. Auto-scaling a software application
CN110533183A (en) * 2019-08-30 2019-12-03 东南大学 The model partition and task laying method of heterogeneous network perception in a kind of assembly line distribution deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446816A (en) * 2015-11-11 2016-03-30 华南理工大学 Heterogeneous platform oriented energy consumption optimization scheduling method
CN106529682A (en) * 2016-10-28 2017-03-22 北京奇虎科技有限公司 Method and apparatus for processing deep learning task in big-data cluster
CN107888669A (en) * 2017-10-31 2018-04-06 武汉理工大学 A kind of extensive resource scheduling system and method based on deep learning neutral net
CN108920259A (en) * 2018-03-30 2018-11-30 华为技术有限公司 Deep learning job scheduling method, system and relevant device
CN108694090A (en) * 2018-04-16 2018-10-23 江苏润和软件股份有限公司 A kind of cloud computing resource scheduling method of Based on Distributed machine learning
WO2019226652A1 (en) * 2018-05-22 2019-11-28 Pure Storage, Inc. Auto-scaling a software application
CN109710406A (en) * 2018-12-21 2019-05-03 腾讯科技(深圳)有限公司 Data distribution and its model training method, device and computing cluster
CN110413391A (en) * 2019-07-24 2019-11-05 上海交通大学 Deep learning task service method for ensuring quality and system based on container cluster
CN110533183A (en) * 2019-08-30 2019-12-03 东南大学 The model partition and task laying method of heterogeneous network perception in a kind of assembly line distribution deep learning

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101536A (en) * 2020-08-30 2020-12-18 西南电子技术研究所(中国电子科技集团公司第十研究所) Lightweight distributed multi-task collaboration framework
WO2022048557A1 (en) * 2020-09-07 2022-03-10 华为云计算技术有限公司 Ai model training method and apparatus, and computing device and storage medium
CN112433819A (en) * 2020-11-30 2021-03-02 中国科学院深圳先进技术研究院 Heterogeneous cluster scheduling simulation method and device, computer equipment and storage medium
CN112433819B (en) * 2020-11-30 2024-04-19 中国科学院深圳先进技术研究院 Simulation method and device for heterogeneous cluster scheduling, computer equipment and storage medium
CN112463340A (en) * 2020-12-10 2021-03-09 武汉工程大学 Tensorflow-based multi-task flexible scheduling method and system
CN112667591A (en) * 2021-01-12 2021-04-16 北京工业大学 Data center task interference prediction method based on mass logs
CN113033098B (en) * 2021-03-26 2022-05-17 山东科技大学 Ocean target detection deep learning model training method based on AdaRW algorithm
CN113033098A (en) * 2021-03-26 2021-06-25 山东科技大学 Ocean target detection deep learning model training method based on AdaRW algorithm
CN113282411A (en) * 2021-05-19 2021-08-20 复旦大学 Distributed neural network training system based on edge equipment
CN113282411B (en) * 2021-05-19 2022-03-22 复旦大学 Distributed neural network training system based on edge equipment
CN116812427A (en) * 2023-08-28 2023-09-29 河北因朵科技有限公司 Automatic file taking and archiving control system and method for unmanned warehouse
CN116812427B (en) * 2023-08-28 2023-11-14 河北因朵科技有限公司 Automatic file taking and archiving control system and method for unmanned warehouse
CN117349775A (en) * 2023-10-30 2024-01-05 浙江大学 Cluster computing-oriented abnormal subtask identification method and device
CN117349775B (en) * 2023-10-30 2024-04-26 浙江大学 Cluster computing-oriented abnormal subtask identification method and device

Also Published As

Publication number Publication date
CN111274036B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
CN111274036B (en) Scheduling method of deep learning task based on speed prediction
CN111756812B (en) Energy consumption perception edge cloud cooperation dynamic unloading scheduling method
Wang et al. Distributed machine learning with a serverless architecture
CN108958916B (en) Workflow unloading optimization method under mobile edge environment
Peng et al. Dl2: A deep learning-driven scheduler for deep learning clusters
CN104636204B (en) A kind of method for scheduling task and device
CN110389816B (en) Method, apparatus and computer readable medium for resource scheduling
CN112416585B (en) Deep learning-oriented GPU resource management and intelligent scheduling method
CN115248728A (en) Distributed training task scheduling method, system and device for intelligent computing
CN110058924A (en) A kind of container dispatching method of multiple-objection optimization
CN113794748B (en) Performance-aware service function chain intelligent deployment method and device
CN112685153A (en) Micro-service scheduling method and device and electronic equipment
CN116069512B (en) Serverless efficient resource allocation method and system based on reinforcement learning
CN115237581A (en) Heterogeneous computing power-oriented multi-strategy intelligent scheduling method and device
Bian et al. Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clusters
CN115543626A (en) Power defect image simulation method adopting heterogeneous computing resource load balancing scheduling
CN114356578A (en) Parallel computing method, device, equipment and medium for natural language processing model
CN109976873A (en) The scheduling scheme acquisition methods and dispatching method of containerization distributed computing framework
CN116996941A (en) Calculation force unloading method, device and system based on cooperation of cloud edge ends of distribution network
CN115756789A (en) GPU scheduling optimization method for deep learning inference service system
CN110928683B (en) Edge computing resource allocation method based on two types of intensive virtual machines
CN108053026A (en) A kind of mobile application background request adaptive scheduling algorithm
Oberthür et al. Flexible resource management for self-x systems: An evaluation
CN113656494A (en) Synchronization method and system of parameter server and readable storage medium
CN113821313A (en) Task scheduling method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant