WO2021115082A1 - 作业调度方法以及作业调度装置 - Google Patents

作业调度方法以及作业调度装置 Download PDF

Info

Publication number
WO2021115082A1
WO2021115082A1 PCT/CN2020/129971 CN2020129971W WO2021115082A1 WO 2021115082 A1 WO2021115082 A1 WO 2021115082A1 CN 2020129971 W CN2020129971 W CN 2020129971W WO 2021115082 A1 WO2021115082 A1 WO 2021115082A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
tasks
candidate
candidate node
job
Prior art date
Application number
PCT/CN2020/129971
Other languages
English (en)
French (fr)
Inventor
徐华
陈明龙
包小明
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010407994.4A external-priority patent/CN113037800B/zh
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20899083.8A priority Critical patent/EP4057142A4/en
Publication of WO2021115082A1 publication Critical patent/WO2021115082A1/zh
Priority to US17/835,143 priority patent/US20220300323A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5033Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering data affinity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/502Proximity

Definitions

  • This application relates to the field of network communication technology, and more specifically, to a job scheduling method and a job scheduling device.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Deep learning has made breakthrough progress in image, voice and other fields, mainly due to the acquisition of massive data, the continuous optimization of algorithms, and the continuous growth of computing power.
  • Deep learning currently mainly refers to the deep neural network model. As the network model becomes more and more complex, the amount of data becomes larger and larger, and the amount of calculation for model training becomes extremely large.
  • distributed training is usually used to meet the timeliness requirements of operations with network transmission requirements; for example, AI training operations.
  • a distributed training method is adopted, different jobs may compete for the same hardware resources; therefore, a scheduler is needed to schedule hardware resources for different jobs of multiple users, so as to allocate suitable nodes (for example, servers) for different jobs Used to run tasks included in the job.
  • the current scheduler usually allocates nodes with suitable hardware resources based on the hardware resource requirements of the task, while ignoring the requirements for network performance in AI training jobs. For example, in AI training, there will be networks between multiple tasks in the same job. Transmission requirements, the existing technology ignores this part of the requirements, resulting in low operating efficiency of AI training operations.
  • the present application provides a job scheduling method and a job scheduling device, which can shorten the running time of a target job and improve the running efficiency of the target job.
  • a job scheduling method including: receiving a target job, the target job including n tasks; performing node screening in a node cluster according to the n tasks of the target job, to obtain a set of n candidate nodes, Among them, each candidate node set includes multiple candidate nodes; the candidate node with the highest network transmission performance score is selected as the target of the mth task from the mth candidate node set corresponding to the mth task in the n tasks Node, where the target node of the mth task is used to process the mth task, and the network transmission performance score is determined by the aggregation degree of the n tasks in the same rack, the affinity between the n tasks, and the One or any combination of node span and node idleness of n tasks is determined, n is an integer greater than or equal to 1, and m is any positive integer between 1 and n.
  • n candidate node sets are obtained; the mth candidate node set corresponding to the mth task in the n tasks The candidate node with the highest network transmission performance score is selected as the target node of the mth task.
  • the target node of the mth task is used to process the mth task.
  • the network transmission performance score is determined by the aggregation degree of n tasks in the same rack, n
  • One or any combination of the affinity between the tasks, the cross-node degree of the n tasks, and the node idleness is determined; in the embodiment of the present application, when allocating resources for the target job, not only the target can be considered
  • the job demand information can also take into account the network transmission performance of multiple tasks in the same job, which can increase the network transmission speed of the target node when running the target job, thereby shortening the running time of the target job and improving the running efficiency of the target job .
  • m is any positive integer between 1 and n.
  • the initial value of m can be set to 1, and then set to 2, 3, 4...n, so that n tasks can be traversed through m and N candidate node sets, and n target nodes are selected from the n candidate node sets.
  • the higher the aggregation degree of the n tasks in the same rack, the higher the network transmission performance score, and the mth task in the n tasks corresponds to Select the candidate node with the highest performance score from the set of m-th candidate nodes as the target node of the m-th task, including:
  • the network transmission performance score of the candidate node mentioned above may refer to the decision based on the aggregation degree of n tasks in the same rack; among them, the goal of scoring n tasks in the same rack aggregation degree dimension is to try to make a single job as much as possible. Multiple tasks are placed in the same rack, thereby avoiding the transmission of data across racks between tasks, which can effectively improve the network transmission efficiency of the job.
  • multiple tasks included in the target job can be placed in one or more nodes managed by the same rack as much as possible, so as Reduce the network transmission bandwidth occupied by the cross-rack when running the target job, thereby shortening the running time of the target job and improving the running efficiency of the target job.
  • the higher the affinity between the n tasks, the higher the network transmission performance score, and the mth task in the n tasks corresponds to Select the candidate node with the highest network transmission performance score from the set of m-th candidate nodes as the target node of the m-th task, including:
  • the type of the m-th task is a parameter node task
  • tasks include work node tasks and parameter node tasks.
  • the work node tasks are used to perform the iterative operation of the neural network.
  • the neural network model involves input parameters and output parameters, and the parameter node is used to manage the input parameters and output parameters of the work node.
  • the network transmission performance score of the candidate node mentioned above is determined according to the affinity between different types of tasks in n tasks, where the goal of scoring is based on the affinity between different types of tasks in n tasks It is to make the work node tasks and parameter node tasks of the same job be placed in one node as much as possible, so as to ensure that the internal data transmission in the job occurs in the same node as much as possible; at the same time, try to avoid multiple parameter nodes of the same job
  • the tasks are concentrated in the same node to avoid that multiple parameter node tasks are stopped when the node fails, so that the input parameters and output parameters of multiple work node tasks of the same job cannot be effectively managed.
  • affinity means that if two applications of application A and application B interact frequently, it is necessary to use affinity to make the two applications as close as possible, even on one node, to reduce network communication. And the performance loss brought about; the opposite of affinity is anti-affinity, which means that when the application is deployed with multiple copies, it is necessary to use anti-affinity to disperse each application instance. Node to improve reliability. Therefore, between work node tasks, between work node tasks and parameter node tasks, the affinity needs to be improved, so that the tasks are as close as possible, for example, set on the same node, while the affinity between parameter node tasks needs to be reduced (That is to improve the anti-affinity), so that the parameter node task is set as many different nodes as possible.
  • the candidate node with the highest network transmission performance score is selected as the mth candidate node set corresponding to the mth task among the n tasks.
  • the target node of a task including:
  • the above-mentioned performance scoring of candidate nodes may be determined according to the span-node degree of n tasks, where the goal of scoring based on the span-node degree of n tasks is to consider the bandwidth occupancy between nodes by the allocated tasks. .
  • the occupancy of the transmission bandwidth between nodes can be evaluated through the resource-allocated job, so that when allocating resources for the target job, Considering not only the demand information of the target job but also the network transmission information, the network transmission performance when running the target job can be improved, thereby shortening the running time of the target job and improving the running efficiency of the target job.
  • the n tasks can all be placed in a candidate node in the m-th candidate node set, the greater the number of cross-nodes of the candidate node, it means that other jobs that the candidate node is running frequently exchange data with other nodes.
  • Select the candidate node as the target node of the current task. After the current task is assigned to the target node, it can be ensured that the candidate node does not need to increase the number of interactions with other nodes. Therefore, the performance score of the candidate node is increased by adding points Amplitude can ensure that the candidate node is selected as the target node.
  • the candidate node when the number of cross-nodes of the candidate node is smaller, it means that the candidate node is running other jobs and has few interactions with other nodes, so reduce the candidate node.
  • the bonus range of the node's performance score can ensure that the candidate node will not be preferentially selected as the target node.
  • the candidate node is selected The node is the target node of the current task. After the current task is assigned to the target node, the candidate node will continue to increase the number of interactions with other nodes, thereby causing the degradation of the candidate node’s network performance. Therefore, by reducing the candidate node’s The performance score bonus range can ensure that the candidate node will not be selected as the target node.
  • the number of cross-nodes of the candidate node when the number of cross-nodes of the candidate node is smaller, it indicates the number of interactions between other jobs and other nodes that the candidate node is running. Rarely, by increasing the bonus range of the performance score of the candidate node, it can be ensured that the candidate node is preferentially selected as the target node. After the task of the target job is assigned to the candidate node, the number of interactions between the candidate node and other nodes can be appropriately increased , Thereby optimizing the distribution efficiency.
  • the node span of the n tasks is determined according to the number of different candidate nodes to which the n tasks are allocated.
  • the network transmission load of a node can be determined based on the number of cross-nodes.
  • the spanning degree of the n tasks is determined according to the real-time bandwidth usage by monitoring the network.
  • the number of cross-nodes of n tasks can be obtained by monitoring the smooth value of the real-time bandwidth usage of the network.
  • a data packet can be obtained, and the task ID corresponding to the data packet can be determined by checking the IP address of the data packet; according to the task ID, it can be determined whether the corresponding job is running; the more jobs are running, the greater the real-time network bandwidth will be used. , It means that the span of n tasks is larger.
  • the foregoing smooth value of the real-time bandwidth used may refer to the bandwidth load at a certain moment; or, it may also refer to the bandwidth load obtained after smoothing the bandwidth used at multiple moments within a preset time period, where: Smoothing can refer to data processing methods such as taking the average value, or taking the maximum value, or taking the minimum value.
  • the smaller the idleness of the node, the higher the network transmission performance score, and the m-th candidate node set corresponding to the m-th task among the n tasks Select the candidate node with the highest network transmission performance score as the target node of the mth task, including:
  • the performance scoring of candidate nodes described above can be determined by node idleness, where the goal of scoring based on node idleness is to try to reserve nodes whose hardware resources for job training are completely idle, so as to cope with subsequent large-scale tasks. , So that large-scale tasks can be placed in the same node as much as possible to avoid resource fragmentation. Therefore, by adding points to the performance score of the candidate node when the hardware resources used for job training are used, it can be ensured that the candidate node is preferentially selected as the target node, while the hardware resources used for job training are not used candidates. Nodes will not be preferentially selected as target nodes, so that candidate nodes whose hardware resources for job training are not used remain idle, and candidate nodes whose hardware resources for job training have been used are fully used, which can avoid resource fragmentation. ⁇ .
  • the hardware resources include an image processor and a central processing unit.
  • the candidate node with the highest network transmission performance score is selected as the mth candidate node set corresponding to the mth task among the n tasks.
  • the target node of a task also includes:
  • the hardware resource usage that is, the allocation rate of hardware resources.
  • the allocation rate the more fully used hardware resources of the candidate node.
  • each task of the target job carries hardware resource requirements
  • node selection is performed on the node cluster according to the n tasks of the target job, and n A set of candidate nodes, including:
  • each task performs node screening in the node cluster respectively to obtain the n candidate node sets, where the hardware resources of each candidate node set in the n candidate node sets are corresponding to the corresponding tasks.
  • the hardware resource needs to be carried are matched.
  • the target job includes a training job of an artificial intelligence model.
  • target job refers to a job that has network transmission load requirements at runtime; the target job may refer to the training job of an artificial intelligence model, or may also refer to other jobs; this application does not make any limitation on this.
  • a job scheduling device which includes: a receiving unit for receiving a target job, the target job including n tasks; node selection is performed in the node cluster according to the n tasks of the target job, and n Candidate node sets, where each candidate node set includes multiple candidate nodes; the processing unit is used to select the m-th candidate node set corresponding to the m-th task among the n tasks, with the highest network transmission performance score
  • the candidate node serves as the target node of the m-th task, where the target node of the m-th task is used to process the m-th task, and the network transmission performance score is determined by the aggregation degree of the n tasks in the same rack and the n
  • One or any combination of the affinity between the tasks, the cross-node degree of the n tasks, and the node idleness is determined by any combination, n is an integer greater than or equal to 1, and m is any positive integer between 1 and n .
  • n candidate node sets are obtained; the mth candidate node set corresponding to the mth task in the n tasks The candidate node with the highest network transmission performance score is selected as the target node of the mth task.
  • the target node of the mth task is used to process the mth task.
  • the network transmission performance score is determined by the aggregation degree of n tasks in the same rack, n
  • One or any combination of the affinity between the tasks, the cross-node degree of the n tasks, and the node idleness is determined; in the embodiment of the present application, when allocating resources for the target job, not only the target can be considered
  • the job demand information can also take into account the network transmission performance of multiple tasks in the same job, which can increase the network transmission speed of the target node when running the target job, thereby shortening the running time of the target job and improving the running efficiency of the target job .
  • m is any positive integer between 1 and n.
  • the initial value of m can be set to 1, and then set to 2, 3, 4...n, so that n tasks can be traversed through m and N candidate node sets, and n target nodes are selected from the n candidate node sets.
  • the higher the aggregation degree of the n tasks in the same rack, the higher the network transmission performance score, and the processing unit is specifically used for:
  • the network transmission performance score of the candidate node mentioned above may refer to the decision based on the aggregation degree of n tasks in the same rack; among them, the goal of scoring n tasks in the same rack aggregation degree is to maximize the number of single jobs. All tasks are placed in the same rack, so as to avoid transferring data across racks between tasks, which can effectively improve the network transmission efficiency of the job.
  • multiple tasks included in the target job can be placed in one or more nodes managed by the same rack as much as possible, so as Reduce the network transmission bandwidth occupied by the cross-rack when running the target job, thereby shortening the running time of the target job and improving the running efficiency of the target job.
  • the higher the affinity between the n tasks, the higher the network transmission performance score, and the processing unit is specifically configured to:
  • the type of the m-th task is a parameter node task
  • tasks include work node tasks and parameter node tasks.
  • the work node tasks are used for iterative operations of the neural network.
  • the neural network model involves input parameters and output parameters, and the parameter node is used to manage the input parameters and output parameters of the work node.
  • the network transmission performance score of the candidate node mentioned above is determined according to the affinity between different types of tasks in n tasks, where the goal of scoring is based on the affinity between different types of tasks in n tasks It is to make the work node tasks and parameter node tasks of the same job be placed in one node as much as possible, so as to ensure that the internal data transmission in the job occurs in the same node as much as possible; at the same time, try to avoid multiple parameter nodes of the same job
  • the tasks are concentrated in the same node to avoid that multiple parameter node tasks are stopped when the node fails, so that the input parameters and output parameters of multiple work node tasks of the same job cannot be effectively managed.
  • affinity means that if two applications of application A and application B interact frequently, it is necessary to use affinity to make the two applications as close as possible, even on one node, to reduce network communication. And the performance loss brought about; the opposite of affinity is anti-affinity, which means that when the application is deployed with multiple copies, it is necessary to use anti-affinity to disperse each application instance. Node to improve reliability. Therefore, between work node tasks, between work node tasks and parameter node tasks, the affinity needs to be improved, so that the tasks are as close as possible, for example, set on the same node, while the affinity between parameter node tasks needs to be reduced. (That is to improve the anti-affinity), so that the parameter node task is set as many different nodes as possible.
  • the processing unit is specifically configured to:
  • the above-mentioned performance scoring of candidate nodes may be determined according to the span-node degree of n tasks, where the goal of scoring based on the span-node degree of n tasks is to consider the bandwidth occupancy between nodes by the allocated tasks. .
  • the occupancy of the transmission bandwidth between nodes can be evaluated through the resource-allocated job, so that when allocating resources for the target job, Considering not only the demand information of the target job but also the network transmission information, the network transmission performance when running the target job can be improved, thereby shortening the running time of the target job and improving the running efficiency of the target job.
  • the n tasks can all be placed in a candidate node in the m-th candidate node set, the greater the number of cross-nodes of the candidate node, it means that other jobs that the candidate node is running frequently exchange data with other nodes.
  • Select the candidate node as the target node of the current task. After the current task is assigned to the target node, it can be ensured that the candidate node does not need to increase the number of interactions with other nodes. Therefore, the performance score of the candidate node is increased by adding points Amplitude can ensure that the candidate node is selected as the target node first.
  • the candidate node when the number of cross-nodes of the candidate node is smaller, it means that the candidate node is running other jobs and has few interactions with other nodes, so reduce the candidate node.
  • the bonus range of the node's performance score can ensure that the candidate node will not be preferentially selected as the target node.
  • the candidate node is selected The node is the target node of the current task. After the current task is assigned to the target node, the candidate node will continue to increase the number of interactions with other nodes, thereby causing the degradation of the candidate node’s network performance. Therefore, by reducing the candidate node’s The performance score bonus range can ensure that the candidate node will not be selected as the target node.
  • the number of cross-nodes of the candidate node when the number of cross-nodes of the candidate node is smaller, it indicates the number of interactions between other jobs and other nodes that the candidate node is running. Rarely, by increasing the bonus range of the performance score of the candidate node, it can be ensured that the candidate node is preferentially selected as the target node. After the task of the target job is assigned to the candidate node, the number of interactions between the candidate node and other nodes can be appropriately increased , Thereby optimizing the distribution efficiency.
  • the node span of the n tasks is determined according to the number of different candidate nodes to which the n tasks are allocated.
  • the network transmission load of a node can be determined based on the number of cross-nodes.
  • the spanning degree of the n tasks is determined based on the real-time bandwidth usage by monitoring the network.
  • the number of cross-nodes of n tasks can be obtained by monitoring the smooth value of the real-time bandwidth usage of the network.
  • a data packet can be obtained, and the task ID corresponding to the data packet can be determined by checking the IP address of the data packet; according to the task ID, it can be determined whether the corresponding job is running; the more jobs are running, the greater the real-time network bandwidth will be used. , It means that the span of n tasks is larger.
  • the foregoing smooth value of the real-time bandwidth used may refer to the bandwidth load at a certain moment; or, it may also refer to the bandwidth load obtained after smoothing the bandwidth used at multiple moments within a preset time period, where: Smoothing can refer to data processing methods such as taking the average value, or taking the maximum value, or taking the minimum value.
  • the processing unit is specifically configured to:
  • the performance scoring of candidate nodes described above can be determined by node idleness, where the goal of scoring based on node idleness is to try to reserve nodes whose hardware resources for job training are completely idle, so as to cope with subsequent large-scale tasks. , So that large-scale tasks can be placed in the same node as much as possible to avoid resource fragmentation. Therefore, by adding points to the performance score of the candidate node when the hardware resources used for job training are used, it can be ensured that the candidate node is preferentially selected as the target node, while the hardware resources used for job training are not used candidates. Nodes will not be preferentially selected as target nodes, so that candidate nodes whose hardware resources for job training are not used remain idle, and candidate nodes whose hardware resources for job training have been used are fully used, which can avoid resource fragmentation. ⁇ .
  • the hardware resources include an image processor and a central processing unit.
  • the processing unit is further configured to:
  • the hardware resource usage that is, the allocation rate of hardware resources.
  • the allocation rate the more fully used hardware resources of the candidate node.
  • each task of the target job carries hardware resource requirements, and the processing unit is specifically used for:
  • each task performs node screening in the node cluster respectively to obtain the n candidate node sets, where the hardware resources of each candidate node set in the n candidate node sets correspond to the corresponding tasks.
  • the hardware resource needs to be carried are matched.
  • the target job includes a training job of an artificial intelligence model.
  • target job refers to a job that has network transmission load requirements at runtime; the target job may refer to the training job of an artificial intelligence model, or may also refer to other jobs; this application does not make any limitation on this.
  • a job scheduling device including: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute: receiving A target job, the target job includes n tasks; according to the n tasks of the target job, node selection is performed in the node cluster to obtain n candidate node sets, wherein each candidate node set includes multiple candidate nodes; Select the candidate node with the highest network transmission performance score from the set of candidate nodes corresponding to the mth task among n tasks as the target node of the mth task, where the target node of the mth task is used for processing For the mth task, the network transmission performance score is determined by one of the aggregation degree of the n tasks in the same rack, the affinity between the n tasks, the cross-node degree of the n tasks, and the node idleness Or determined by any combination, n is an integer greater than or equal to 1, and m is any positive integer
  • the foregoing job scheduling apparatus includes a processor that is further configured to execute the first aspect and the method in any one of the implementation manners of the first aspect.
  • n candidate node sets are obtained; the mth candidate node set corresponding to the mth task in the n tasks The candidate node with the highest network transmission performance score is selected as the target node of the mth task.
  • the target node of the mth task is used to process the mth task.
  • the network transmission performance score is determined by the aggregation degree of n tasks in the same rack, n
  • One or any combination of the affinity between the tasks, the cross-node degree of the n tasks, and the node idleness is determined; in the embodiment of the present application, when allocating resources for the target job, not only the target can be considered
  • the demand information of the job can also take into account the network transmission performance of multiple tasks in the same job, which can increase the network transmission speed of the target node when running the target job, thereby shortening the running time of the target job and improving the running efficiency of the target job .
  • a computer storage medium stores program code, and the program code includes steps for executing the first aspect and the steps in the job scheduling method in any one of the implementations of the first aspect instruction.
  • the above-mentioned storage medium may specifically be a non-volatile storage medium.
  • a chip in a fifth aspect, includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface, and executes the first aspect and any one of the implementations of the first aspect. Job scheduling method.
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the first Aspect and the job scheduling method in any one of the implementation manners in the first aspect.
  • FIG. 1 is a schematic diagram of a typical fully connected network model provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a training process of a neural network model provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of distributed training in a parameter node mode provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of distributed training in a decentralized parameter synchronization manner provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of the system architecture of AI training provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the physical architecture of the AI training task provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a job scheduling method provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a job scheduling method provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a job scheduling method provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a job scheduling device provided by an embodiment of the present application.
  • Fig. 11 is a schematic diagram of a job scheduling device provided by an embodiment of the present application.
  • Deep neural network also known as multi-layer neural network
  • the DNN is divided according to the positions of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the number of layers in the middle are all hidden layers.
  • the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
  • Figure 1 is a typical fully connected network model, which includes an input layer 110, a hidden layer 120, a hidden layer 130, and an output layer 140; data flows in from the input layer 110, gradually passes through calculations, and finally comes from the output layer 140.
  • the result is obtained; among them, each layer in the middle will have several parameters, and the input of the previous layer will be calculated to get the output; the model parameters require a large amount of data to be fitted through model training to obtain the best model effect.
  • FIG. 2 is a schematic diagram of a training process of a neural network model provided in an embodiment of the present application.
  • the training process includes steps S210 to S280, and steps S210 to S280 are described in detail below.
  • the network model is loaded for the first time.
  • the forward propagation algorithm refers to the use of a number of weight coefficient matrices W, bias vector b and input value vector x to perform a series of linear operations and activation operations; that is, starting from the input layer layer by layer backward calculation, until the operation Go to the output layer and get the output result as a value.
  • the neural network can use an error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller.
  • BP error back propagation
  • forwarding the input signal to the output will cause error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss is converged.
  • the back-propagation algorithm is a back-propagation motion dominated by error loss, and aims to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • the network model parameters are constantly updated.
  • Distributed training refers to collaborative training through central processing unit (CPU) or GPU equipment of multiple nodes.
  • CPU central processing unit
  • GPU GPU equipment
  • the mainstream distributed training methods include centralized parameter node method and decentralized ALLReduce method; the following describes the distributed training of GPU. It should be understood that CPU is similar to GPU, except that only CPU is used as the computing device of the worker node. .
  • Fig. 3 is a schematic diagram of a parameter node mode provided by an embodiment of the present application.
  • a parameter node 310 (Parameter, PS), a working node 320 and a working node 330 may be included.
  • the parameter node and the working node may be implemented by a server, the server for implementing the parameter node may include at least one CPU, and the server for implementing the working node may include at least one CPU and at least one GPU, wherein at least one GPU is used for job training.
  • the parameter node 310 refers to the central synchronization node of the model during machine learning model training. It is responsible for maintaining the parameters of the model, and updating during iterative training, distributing the parameters to different devices to update the model, and continuing training.
  • Each GPU participating in the training has a copy of the same neural network model, and they can be on different nodes, and the CPU of each node (for example, the working node 320 or the working node 330) issues instructions to call the GPU to perform calculation processing of the model.
  • different GPUs process different batches of data. After the iteration is completed, they need to synchronize parameters with the parameter node 310 to ensure that the parameters on different GPUs are consistent during the model training process.
  • Fig. 4 is a schematic diagram of a decentralized parameter synchronization manner provided by an embodiment of the present application.
  • multiple working nodes for example, working node 401 to working node 405 can directly communicate with each other to send synchronization parameters or gradient values, and there is no need to pass the parameter node ( Also known as parameter server) to synchronize parameters or gradient values.
  • a dedicated scheduler needs to schedule jobs for different users and choose different tasks for the job Appropriate node operation; on the one hand, it needs to meet the requirements of the operation for the hardware and software environment of the operation, on the other hand, it also needs to improve the utilization of resources to achieve the core purpose of resource sharing-time-sharing and multiplexing.
  • different jobs may compete for network resources in the same link; at this time, the scheduler needs to schedule resources for different jobs of multiple users, which is different The job selects the appropriate node and GPU to place the task.
  • distributed training is usually used to meet the timeliness requirements of operations with network transmission requirements; for example, AI training operations.
  • a distributed training method is adopted, different jobs may compete for the same hardware resources; therefore, a scheduler is needed to schedule hardware resources for different jobs of multiple users, so as to allocate suitable nodes (for example, servers) for different jobs Used to run tasks included in the job.
  • the current scheduler usually allocates nodes with suitable hardware resources based on the hardware resource requirements of the task, while ignoring the requirements for network performance in AI training jobs. For example, in AI training, there will be networks between multiple tasks in the same job. Transmission requirements, the existing technology ignores this part of the requirements, resulting in low operating efficiency of AI training operations.
  • the present application proposes a job scheduling method and a job scheduling device.
  • a job scheduling method and a job scheduling device By performing node screening in a node cluster according to the n tasks of the target job, a set of n candidate nodes is obtained; Select the candidate node with the highest network transmission performance score from the set of m-th candidate nodes corresponding to each task as the target node of the m-th task.
  • the target node of the m-th task is used to process the m-th task.
  • the network transmission performance score is determined by n
  • n One or any combination of the degree of aggregation of the tasks in the same rack, the affinity between n tasks, the degree of cross-nodes of n tasks, and the idleness of nodes is determined by one or any combination; in the embodiment of the present application, the target is When assigning resources to a job, not only can the demand information of the target job be taken into consideration, but also the network transmission performance of multiple tasks in the same job can be taken into consideration. This can increase the network transmission speed of the target node when the target job is running, thereby shortening the target job. Improve the operating efficiency of the target job.
  • Fig. 5 is a schematic diagram of an AI training system architecture provided by an embodiment of the present application.
  • the system architecture may include a user graphical interface/client 510, an AI job management server 520, a resource management server 530, and a hardware infrastructure 540.
  • the user graphical interface/client 510 may be used to receive AI training jobs from different users.
  • the AI job management server 520 can be used to manage and submit AI training jobs received from different users;
  • the resource management server 530 can include resource management and a scheduler, where the resource management can be used to bind and release resources;
  • the scheduler Job scheduling resources can be scheduled according to the requirements of different jobs;
  • the basic hardware setting 540 can refer to CPU, memory, network, GPU, and remote direct memory access (RDMA).
  • the user can submit an AI training job through the user graphical interface/client 510; after receiving the request, the AI job management server 520 can parse the job and submit the resource request to the resource management server 530; the resource management server 530 After receiving the request, the scheduler can select the appropriate node from the managed hardware infrastructure 540, that is, the underlying physical resources to place the job; after the scheduler completes the node selection, it starts the corresponding AI training job on the corresponding node, This part of the resource is occupied by the job, and the resource is released after the job ends.
  • the managed hardware infrastructure 540 that is, the underlying physical resources to place the job
  • the scheduler completes the node selection, it starts the corresponding AI training job on the corresponding node, This part of the resource is occupied by the job, and the resource is released after the job ends.
  • FIG. 6 is a schematic diagram of the physical architecture of a data center used for AI training operations provided by an embodiment of the present application.
  • the physical architecture may include a first-level switch 610, a second-level switch 620, and a second-level switch 630; among them, the first-level switch 610 can be used to manage the second-level switch 620 and the second-level switch.
  • the switch 630; the second-level switch 620 can be used to manage the server 621 and the server 622; the second-level switch 630 can be used to manage the server 631 and the server 632.
  • the first-level switch 610 may refer to a core switch; the second-level switch 620 and the second-level switch 630 may refer to a top-of-rack switch; the top-of-rack switch may be connected to multiple servers, and each server includes a CPU. And GPU resources; where, the server may refer to the node in the embodiment of the present application.
  • the foregoing physical architecture may also include one-level or multi-level switches.
  • the foregoing is illustrated by using two levels of switches, namely, the first-level switch and the second switch, as an example in FIG. 6, which is not limited in the embodiment of the present application. .
  • second-level switch 620, server 621, and server 622 are installed in the same rack, such as rack 1, and the second-level switch 630, server 631, and server 632 are installed in the same rack, such as machine Frame 2.
  • the job scheduling method shown in FIG. 7 may be executed by the scheduler shown in FIG. 5, and may be applied to the physical architecture shown in FIG. 6.
  • the method 700 shown in FIG. 7 includes S710 to S730, and these steps are respectively described in detail below.
  • a resource request for a target job can be received, and the resource request can be used to request resources for running the target job.
  • the resource request can carry demand information of the target job.
  • the target job refers to a job that has network transmission requirements at runtime.
  • the hardware resource request carried by the job can be received; according to the hardware resource demand carried by each task, the scheduler can perform node screening in the node clusters respectively to obtain the n candidate node sets, among which, among the n candidate node sets The hardware resources of each candidate node set are matched with the hardware resource requirements carried by the corresponding task.
  • the above-mentioned target job may refer to an AI training job, or may also refer to other types of jobs that have network transmission requirements.
  • resource requests for multiple target jobs may also be received.
  • Resource requests for multiple target jobs may refer to receiving resource requests for multiple target jobs from different users or the same user; one target among multiple target jobs A job can include multiple target tasks.
  • each candidate node set includes multiple candidate nodes.
  • the hardware resource request carried by the job can be received; according to the hardware resource demand carried by each task, the scheduler can perform node screening in the node cluster respectively to obtain the set of n candidate nodes, where n candidate nodes The hardware resources of each candidate node set in the set respectively match the hardware resource requirements carried by the corresponding task.
  • the hardware resource requirement may refer to the selection of eligible nodes through port filtering, node label matching, etc., such as the GPU type included in the node.
  • the port filtering of a node may mean that the job can be run on a node other than a certain port number; the label matching of a node may mean that the node that runs the target job is selected according to the IP address range.
  • the node screening method in the above step S720 can adopt the common method of the scheduler in the prior art, and there is no limitation here.
  • the target node of the mth task is used to process the mth task, and the network transmission performance score is composed of the aggregation degree of n tasks in the same rack, the affinity between n tasks, the cross-node degree of n tasks, and One or any combination of node idleness is determined, n is an integer greater than or equal to 1, and m is any positive integer between 1 and n.
  • n is any positive integer between 1 and n.
  • the initial value of m can be set to 1, and then set to 2, 3, 4...n, so that n tasks can be traversed by m.
  • n candidate node sets, and n target nodes are respectively selected from the n candidate node sets.
  • the candidate with the highest performance score is selected from the m-th candidate node set corresponding to the m-th task among the n tasks.
  • the node as the target node of the mth task includes:
  • the network transmission performance score of the candidate node mentioned above may refer to the decision based on the aggregation degree of n tasks in the same rack; among them, the goal of scoring n tasks in the same rack aggregation degree dimension is to try to make a single job as much as possible. Multiple tasks are placed in the same rack, thereby avoiding the transmission of data across racks between tasks, which can effectively improve the network transmission efficiency of the job.
  • n tasks can be placed in the rack where the candidate node in the m-th candidate node set is located. For example, suppose that one candidate node in the m-th candidate node set is a server. 621, it can be judged whether the n tasks can be placed in the multiple servers connected to the second-level switch 620; that is, it can be judged whether the n tasks can be placed in the server 621, or the server 621 and the server 622; if the second-level switch 620 Multiple servers connected in the middle can place n tasks to add points to the performance score of the server; if multiple servers included in the second-level switch 620 cannot place n tasks, the performance score of the server is subtracted .
  • the candidate node set includes candidate node 1 to candidate node 4, where candidate node 1 and candidate node 2 correspond to rack 1; candidate node 3 and candidate node 4 correspond to rack 2; if a job includes If all tasks of the task are not allocated, the placeability of the same rack shall be given priority; that is, if the resources of the candidate node managed in rack 1 can accommodate all tasks of the job, the task will be assigned to rack 1 first Resources in. If at least one of the tasks contained in a job has been bound to resources, for example, one task in the job has been assigned to candidate node 1, then other tasks contained in the task are given priority to be assigned to candidate node 1 or The candidate node 1 corresponds to the candidate node 2 of the same rack 1.
  • multiple tasks included in the target job can be placed in one or more nodes managed by the same rack as much as possible, so as Reduce the network transmission bandwidth occupied by the cross-rack when running the target job, thereby shortening the running time of the target job and improving the running efficiency of the target job.
  • the higher the affinity between n tasks, the higher the network transmission performance score, and the highest performance score is selected from the mth candidate node set corresponding to the mth task among the n tasks
  • the candidate node of as the target node of the m-th task includes:
  • the type of the m-th task is a parameter node task
  • tasks include work node tasks and parameter node tasks.
  • the work node tasks are used for iterative operations of the neural network.
  • the neural network model involves input parameters and output parameters, and the parameter node is used to manage the input parameters and output parameters of the work node.
  • the network transmission performance score of the candidate node mentioned above refers to the determination based on the affinity between different types of tasks in n tasks, where the goal of scoring based on the affinity between different types of tasks in n tasks is Make the work node tasks and parameter node tasks of the same job be placed in one node as much as possible, so as to ensure that the internal data transmission in the job occurs in the same node as much as possible; at the same time, try to avoid multiple parameter node tasks of the same job Concentrate in the same node to avoid that multiple parameter node tasks are stopped when the node fails, so that the input parameters and output parameters of multiple working node tasks of the same job cannot be effectively managed.
  • the n tasks may include different types of tasks, such as work node tasks and parameter node tasks; as shown in Figure 4, each task in the multiple tasks is a work node task, and in the mth
  • each task in the multiple tasks is a work node task, and in the mth
  • the m tasks are working node tasks. It is judged whether there are other working node tasks or parameter node tasks among the n tasks in a server; if other working node tasks or parameter node tasks among the n tasks have been placed in the server, Add points to the server's performance score.
  • the n tasks may include different types of tasks, such as a work node task and a parameter node task; as shown in FIG. 3, the parameter node 310 may also be called a parameter node, and the type of the mth task is a parameter.
  • a node task judge whether the candidate node in the m-th candidate node set has been placed with the working node tasks in the n tasks; that is, as shown in Figure 6, if the m-th task is a parameter node task, judge one Whether the server has placed the work node tasks among the n tasks; if the server has placed the work node tasks among the n tasks, add points to the performance score of the server; and judge whether the server has already placed n tasks If other parameter node tasks are placed in the server, the performance score of the server will be deducted.
  • the work node task and the parameter node task can be placed as much as possible; because the data volume of the parameter node task is relatively large, Avoid placing multiple parameter nodes in the same server.
  • affinity means that if two applications of application A and application B interact frequently, it is necessary to use affinity to make the two applications as close as possible, even on one node, to reduce network communication. And the performance loss brought about; the opposite of affinity is anti-affinity, which means that when the application is deployed with multiple copies, it is necessary to use anti-affinity to disperse each application instance. Node to improve reliability. Therefore, between work node tasks, between work node tasks and parameter node tasks, the affinity needs to be improved, so that the tasks are as close as possible, for example, set on the same node, while the affinity between parameter node tasks needs to be reduced (That is to improve the anti-affinity), so that the parameter node task is set as many different nodes as possible.
  • the affinity between different types of tasks in n tasks is scored, and the affinity of different types of task allocation resources can be considered, so that the tasks of the work node type are placed as much as possible, and then It can shorten the running time of the target job and improve the running efficiency of the target job.
  • the parameter node task can refer to the parameters responsible for maintaining the model, and the parameters are distributed to different working nodes after the iterative training is updated;
  • the working node task can refer to the task used to perform a certain batch of data iterations ;
  • there is frequent data interaction between the parameter node and the working node for example, the parameter node can send initial parameters to the working node, and the working node needs to send the updated parameters to the Parameter node.
  • selecting the candidate node with the highest network transmission performance score from the m-th candidate node set corresponding to the m-th task among the n tasks as the target node of the m-th task includes:
  • the above-mentioned performance scoring of candidate nodes may refer to the decision based on the cross-node degree of n tasks, where the goal of scoring by the cross-node degree of n tasks is to consider the effect of the assigned job on the bandwidth between nodes. Occupation situation.
  • the range of bonus points is greater than the range of subtraction; for jobs that do not require cross-node allocation, priority is given to candidate nodes with a large number of cross-nodes; for jobs that need cross-nodes, Placed in the candidate nodes with a small number of cross-nodes first.
  • the demand information also takes into account the network transmission information, which can improve the network transmission performance when running the target job, thereby shortening the running time of the target job and improving the running efficiency of the target job.
  • the n tasks can all be placed in a candidate node in the m-th candidate node set, the greater the number of cross-nodes of the candidate node, it means that other jobs that the candidate node is running frequently exchange data with other nodes.
  • Select the candidate node as the target node of the current task. After the current task is assigned to the target node, it can be ensured that the candidate node does not need to increase the number of interactions with other nodes. Therefore, the performance score of the candidate node is increased by adding points Amplitude can ensure that the candidate node is selected as the target node first.
  • the candidate node when the number of cross-nodes of the candidate node is smaller, it means that the candidate node is running other jobs and has few interactions with other nodes, so reduce the candidate node.
  • the bonus range of the node's performance score can ensure that the candidate node will not be preferentially selected as the target node.
  • the candidate node is selected The node is the target node of the current task. After the current task is assigned to the target node, the candidate node will continue to increase the number of interactions with other nodes, thereby causing the degradation of the candidate node’s network performance. Therefore, by reducing the candidate node’s The performance score bonus range can ensure that the candidate node will not be selected as the target node.
  • the number of cross-nodes of the candidate node when the number of cross-nodes of the candidate node is smaller, it indicates the number of interactions between other jobs and other nodes that the candidate node is running. Rarely, by increasing the bonus range of the performance score of the candidate node, it can be ensured that the candidate node is preferentially selected as the target node. After the task of the target job is assigned to the candidate node, the number of interactions between the candidate node and other nodes can be appropriately increased , Thereby optimizing the distribution efficiency.
  • the node span of the n tasks is determined according to the number of different candidate nodes to which the n tasks are allocated.
  • the scheduler can record the number of network connections for cross-node jobs on a node.
  • the cross-node degree of the n tasks is determined according to the real-time bandwidth usage by monitoring the network.
  • the foregoing smooth value of the real-time bandwidth used may refer to the bandwidth load at a certain moment; or, it may also refer to the bandwidth load obtained after smoothing the bandwidth used at multiple moments within a preset time period, where: Smoothing can refer to data processing methods such as taking the average value, or taking the maximum value, or taking the minimum value.
  • a data packet can be obtained, and the task ID corresponding to the data packet can be determined by checking the IP address of the data packet; according to the task ID, it can be determined whether the corresponding job is running; the more jobs are running, the greater the real-time network bandwidth will be used. , It means that the span of n tasks is larger.
  • n tasks can be placed in a certain server, the greater the number of cross-nodes of the server, the greater the performance score of the server; among them, the number of cross-nodes of the server It can refer to the number of other servers that the server needs to perform data exchange; or, the real-time monitoring of the server's bandwidth can be used to indicate the size of the server's cross-node degree; if all n tasks cannot be placed in a server, then The smaller the number of cross-nodes of the server, the larger the performance score of the server. In other words, jobs that do not need to be placed across servers are preferentially placed in servers with a large number of cross-nodes; jobs that need to be placed across servers are preferentially placed in servers with a small number of cross-nodes.
  • the specific implementation process of determining the performance score of the candidate node through the spanning degree of n tasks can refer to the subsequent step S833 shown in FIG. 8.
  • the candidate node with the highest network transmission performance score is selected as the mth candidate node from the mth candidate node set corresponding to the mth task among the n tasks.
  • the target node of a task including:
  • the performance scoring of candidate nodes described above can be determined by node idleness, where the goal of scoring based on node idleness is to try to reserve nodes whose hardware resources for job training are completely idle, so as to cope with subsequent large-scale tasks. , So that large-scale tasks can be placed in the same node as much as possible to avoid resource fragmentation. Therefore, by adding points to the performance score of the candidate node when the hardware resources used for job training are used, it can be ensured that the candidate node is preferentially selected as the target node, while the hardware resources used for job training are not used candidates. Nodes will not be preferentially selected as target nodes, so that candidate nodes whose hardware resources for job training are not used remain idle, and candidate nodes whose hardware resources for job training have been used are fully used, which can avoid resource fragmentation. ⁇ .
  • the hardware resources include an image processor and a central processing unit.
  • selecting the candidate node with the highest network transmission performance score as the target node of the m-th task from the m-th candidate node set corresponding to the m-th task among the n tasks further includes:
  • the network transmission performance score of the candidate node is added according to the allocation rate.
  • the greater the allocation rate the greater the increase in the network transmission performance score of the candidate node, and the smaller the allocation rate, the addition of the network transmission performance score of the candidate node. The smaller the sub-range.
  • the hardware resource usage that is, the allocation rate of hardware resources.
  • the allocation rate the more fully used hardware resources of the candidate node.
  • the performance score of the server is added; if the GPU or CPU allocation rate is The lower the rate, that is, the more idle CPUs or GPUs, the performance score of the server will be reduced.
  • scoring by node idleness can keep completely idle GPU hosts as much as possible, so that large-scale tasks can be placed, thereby avoiding resource fragmentation, thereby improving the operating efficiency of large-scale tasks and improving The utilization of cluster resources.
  • the performance score of the above candidate node can be determined by one of n tasks in the same rack aggregation degree, the affinity between different types of tasks in the n tasks, the span node degree of the n tasks, and the node idleness. Or determined by any combination.
  • the user can turn on or turn off the strategy separately through configuration based on the above-mentioned several dimensional strategies; alternatively, they can also combine the turn-on strategies and define scheduling strategies with different weight values.
  • the weight values corresponding to the above different evaluation dimensions may be thresholds preset according to user needs; wherein, the weight values of different evaluation dimensions may be set according to the priority of different evaluation dimensions; for example, if the rack aggregation degree is If the priority of multiple evaluation dimensions is the highest, the value of the weight corresponding to the rack aggregation degree can be configured as the largest value among the multiple weights.
  • n candidate node sets are obtained; according to each candidate in the mth candidate node set corresponding to the mth task The performance of the node is scored, the candidate node with the highest score is selected as the target node of the mth task, and the mth task is assigned to the target node of the mth task.
  • the performance can include n tasks in the same rack One or any combination of the degree of aggregation, the affinity between different types of tasks in the n tasks, the cross-node degree of the n tasks, and the node idleness; in the embodiment of the present application, the target job is assigned In terms of resources, not only can the demand information of the target job be taken into consideration, but also the network transmission load can be considered, which can improve the network transmission performance when running the target job, thereby shortening the running time of the target job and improving the running efficiency of the target job.
  • FIG. 8 is a schematic flowchart of a job scheduling method provided by an embodiment of the present application. Wherein, the method includes steps S810 to S870, and these steps are respectively described in detail below.
  • S810 Analyze all tasks included in the job, and select target nodes for these tasks in turn.
  • the above-mentioned job may refer to an AI training job, or may also refer to other jobs that have network transmission requirements at runtime.
  • the scheduler may obtain a job from the job queue for scheduling according to a certain rule, and the rule may be a resource fairness (DRF) algorithm or other algorithms.
  • the scheduler parses all tasks included in the job, and schedules each task in turn, selecting an appropriate node for binding, and the bound node is used to execute the task.
  • DPF resource fairness
  • S820 According to the hardware resource requirements carried by each task, perform node screening in the node cluster respectively to obtain a set of n candidate nodes.
  • the hardware resource requirement may refer to the selection of eligible nodes through port filtering, node label matching, etc., such as the GPU type included in the node.
  • the port filtering of a node may mean that the job can be run on a node other than a certain port number; the label matching of a node may mean that the node that runs the target job is selected according to the IP address range.
  • the method for preselecting nodes in the above S820 can adopt a common method of a scheduler in the prior art, and there is no limitation here.
  • the port filtering of a node may mean that the job can be run on a node other than a certain port number; the label matching of a node may mean that the node that runs the target job is selected according to the IP address range.
  • all candidate nodes can be evaluated in different dimensions and multiplied by the weight; finally, the pre-selected candidate nodes are optimized to obtain the node used to bind a certain task.
  • the above step S830 may include steps S831 to S835, that is, all candidate nodes can be measured from the rack dimension, affinity dimension, cross-node dimension, large-scale task dimension, and hardware resource quantity dimension used to manage nodes. Network transmission performance score evaluation.
  • the rack dimension evaluates the network transmission performance scores of candidate nodes; the goal of evaluation through this dimension is to place multiple tasks contained in a single job in the same rack as much as possible, so as to avoid cross-machines between tasks It can effectively improve the network transmission efficiency of operations.
  • the weight value w1 of this dimension can be 10000, and the evaluation value is obtained in the following way:
  • affinity means that if two applications of application A and application B interact frequently, it is necessary to use affinity to make the two applications as close as possible, even on one node, in order to reduce network communication. Performance loss; the opposite of affinity is anti-affinity.
  • Anti-affinity means that when an application is deployed with multiple copies, it is necessary to use anti-affinity to disperse each application instance on each node to Improve reliability.
  • the candidate node set includes candidate node 1 to candidate node 4, where candidate node 1 and candidate node 2 correspond to rack 1; candidate node 3 and candidate node 4 correspond to rack 2; if a job includes all If the tasks are not allocated, the placement of the same rack is given priority; that is, if the resources in the candidate node in rack 1 can accommodate all tasks of the job, the task will be assigned to the resources in rack 1 first . If at least one of the tasks contained in a job has been bound to resources, for example, one task in the job has been assigned to candidate node 1, then other tasks contained in the task are given priority to be assigned to candidate node 1 or The candidate node 1 corresponds to the candidate node 2 of the same rack 1.
  • the resource of the rack may refer to the server included in the rack, that is, the hardware resource in the candidate node; for example, the hardware resource may be the CPU, GPU, or memory in the server.
  • the affinity dimension between the parameter node task PS and the worker node task worker evaluates the network transmission performance score of the candidate node, that is, the PS and worker affinity dimension; the goal of evaluation through this dimension is to improve the performance of the worker node Centralize the transmission bandwidth of the network; at the same time, try to avoid the concentration of PS to the same node, which will cause a bottleneck in PS.
  • the above-mentioned parameter node PS and worker node worker may refer to different types of tasks.
  • a node is a parameter server 310, it is used to maintain the parameters of the model, and iterative training Update, distribute the parameters to different devices to update the model, the node is PS; if a node is node 320 or the GPU in node 330 is used to perform a certain batch of data iteration, then the node is a worker; if If a node is neither a PS nor a worker, the node is a resource that can be used for task scheduling.
  • the weight value w2 of this dimension can be 1000, and the evaluation value is obtained in the following way:
  • S833 Evaluate the network transmission performance score of the candidate node across the node dimension; the goal of the evaluation through this dimension is to evaluate the occupation of bandwidth between nodes by the tasks that have allocated resources.
  • the weight value w3 of this dimension can be 100, and the evaluation value is obtained in the following way:
  • the evaluation value can be:
  • nodes with fewer cross-node tasks are preferred.
  • the evaluation value can be:
  • the job is first allocated to the number of network transmission connections (also called the number of network transmission connections). For the network transmission load) larger nodes. Because when the number of tasks included in the job is 1 or the remaining resources in each node can meet the resource requirements of the job, the job does not need to occupy the network bandwidth for cross-node transmission, so it can be allocated to the network transmission connection A large number of nodes.
  • the number of job network connections for cross-node distributed training on the recording node may be used.
  • the foregoing smooth value of the real-time bandwidth used may refer to the bandwidth load at a certain moment; or, it may also refer to the bandwidth load obtained after smoothing the bandwidth used at multiple moments within a preset time period, where: Smoothing can refer to data processing methods such as taking the average value, or taking the maximum value, or taking the minimum value.
  • the above-mentioned AI training job can also be other types of jobs that require network transmission.
  • the network transmission needs of the job can be automatically identified, or the job can be manually submitted to the configuration file of the network connection.
  • the scheduling is performed through the scheduling mechanism of network transmission load awareness in the above embodiments of the present application.
  • the large-scale task dimension evaluates the network transmission performance score of the candidate node; the goal of the evaluation through this dimension is to reserve completely idle hardware resources as much as possible, so that large-scale tasks can be placed and resource fragmentation is avoided.
  • the hardware resources include GPU and CPU.
  • the weight value w4 of this dimension can be 10, and the evaluation value is obtained in the following way:
  • the foregoing GPU allocation rate may refer to the size of resources in the GPU that have been allocated to tasks; the GPU allocation rate is 0, which means that all GPUs on the node are completely idle.
  • S835 the GPU quantity dimension is evaluated; the goal of evaluation through this dimension is to maximize the placement possibility of large-scale GPU tasks, and priority is given to nodes with few remaining GPU resources to fill tasks.
  • the hardware resource dimension evaluates the network transmission performance scores of candidate nodes; the goal of evaluation through this dimension is to reduce resource fragmentation, maximize the possibility of placing tasks that require large-scale hardware resources, and select candidates with fewer remaining hardware resources
  • the node is filled with tasks first.
  • the weight value w5 of this dimension can be 1, and the evaluation value is obtained in the following way:
  • GPU allocated can represent the number of GPUs already occupied in the node; GPU total can represent the total number of GPUs in the node.
  • steps S834 and S835 may refer to the same dimension.
  • step S834 and step S835 the network transmission performance score of the candidate node is evaluated through the node space limit, so that the completely idle hardware resources are reserved as much as possible. It can be placed in response to large-scale tasks to avoid resource fragmentation.
  • the priority is higher.
  • the node is preferentially selected for task placement for illustration. Similarly, it can also be the evaluation of a certain node. The smaller the value, the higher the priority.
  • weight values w1 to w5 may be thresholds preset according to user needs; wherein, the weight values of different evaluation dimensions can be set according to the priority of different evaluation dimensions; for example, if the rack dimensions are multiple evaluations The priority of the dimensions is the highest, and the value of the weight w1 corresponding to the rack dimension can be configured as the maximum value from w1 to w5.
  • the job is delivered to the corresponding target node.
  • the network transmission connection number node.num_cross_nodes_job of each node is updated, and the job is started.
  • the weights of each dimension can be adjusted during the optimization of the above-mentioned job scheduling method to meet the above-mentioned overall goal, that is, the evaluation of the above-mentioned different dimensions can be mainly based on the network transmission bandwidth and avoiding resource fragmentation. ; Considering the network transmission bandwidth can improve the operating efficiency of AI operations; to avoid resource fragmentation, you can consider the placement of tasks, so that the resources of a large plan can be used for subsequent placement of large-scale tasks and improve the overall utilization of resources.
  • the above-mentioned Figure 8 evaluates candidate nodes in parallel through multiple dimensions; similarly, the above-mentioned different dimensions can also be used to evaluate candidate nodes in a serial manner.
  • the following is a detailed description of the process of evaluating candidate nodes using the above-mentioned different dimensions in a serial manner.
  • FIG. 9 is a schematic flowchart of a job scheduling method provided by an embodiment of the present application. Wherein, the method includes steps S901 to S911, and these steps are respectively described in detail below.
  • S901 Analyze all tasks included in the job, and select nodes for these tasks in turn.
  • the above-mentioned job may refer to an AI training job, or may also refer to other jobs that have network transmission requirements at runtime.
  • the scheduler may obtain a job from the job queue for scheduling according to a certain rule, and the rule may be a resource fairness (DRF) algorithm or other algorithms.
  • the scheduler parses all tasks included in the job, and schedules each task in turn, selecting an appropriate node for binding, and the bound node is used to execute the task.
  • DPF resource fairness
  • the eligible nodes such as the GPU type included in the node
  • the eligible nodes can be filtered out according to the port filtering of the node, the label matching of the node, and the like.
  • the port filtering of a node may mean that the job can be run on a node other than a certain port number; the label matching of a node may mean that the node that runs the target job is selected according to the IP address range.
  • the method for preselecting nodes in the above step 902 can adopt a common method of a scheduler in the prior art, and there is no limitation here.
  • the second-level switch may refer to a top-of-rack switch, and servers (also referred to as nodes) included in multiple top-of-rack switches can be networked.
  • the rack dimension is evaluated; the goal of evaluation through this dimension is to place multiple tasks included in a single job in the same node as much as possible, thereby improving network transmission efficiency.
  • the racks can be sorted according to certain rules, and then the nodes managed by the rack racks can be traversed in order.
  • the ordering rules for racks can refer to: if all tasks included in a job are allocated resources, then consider the placement of the nodes managed by the rack to the job, that is, the ability to accommodate all tasks included in the job The rack to which the node belongs is sorted first, otherwise it is lower; if part of all tasks included in a job has completed resource allocation, the rack to which the node belongs to the task is sorted higher, otherwise it is lower.
  • step S904 reference may be made to step S831 shown in FIG. 8, and details are not described herein again.
  • the affinity dimension between the PS task of the parameter node and the task of the working node is evaluated, that is, the affinity dimension of the PS and the worker; the goal of the evaluation through this dimension is to increase the network transmission bandwidth between the working nodes and place them in a centralized manner; At the same time, try to prevent PS from being concentrated on the same node, which will cause PS bottleneck.
  • the parameter node PS and the worker node worker may refer to different types of tasks.
  • a certain node is a parameter node 310, it is used to maintain the parameters of the model and update it during iterative training. , Distribute the parameters to different devices to update the model, the node is PS; if a node is node 320 or the GPU in node 330 is used to perform a certain batch of data iteration, then the node is a worker; If the node is neither a PS nor a worker, the node is a resource that can be used for task scheduling.
  • affinity means that if two applications of application A and application B interact frequently, it is necessary to use affinity to make the two applications as close as possible, even on one node, to reduce the network communication caused Performance loss;
  • the nodes to which the sorted racks belong can be traversed in sequence, and the nodes are sorted into K1, K2, and K3 according to the affinity rule.
  • sorting the nodes according to the affinity rule can mean: if a worker type task included in a certain job is placed in a certain node, then the node is placed in the K1 set; if a certain node is placed in a certain The PS type tasks included in a job are placed in the K2 set; other nodes are placed in the K3 set.
  • step S905 reference may be made to step S832 shown in FIG. 8, which will not be repeated here.
  • S906 Evaluate the cross-node network transmission load dimension; the goal of the evaluation through this dimension is to evaluate the occupation of bandwidth between nodes by tasks that have allocated resources.
  • the nodes in Ki are traversed in sequence, and the nodes are divided into sets T1 and T2 according to whether the current node can place all tasks in the job.
  • nodes with the same load can be merged to form sets G1, G2...Gn.
  • the process ends.
  • S907 Evaluate the cross-node network transmission load dimension; the goal of the evaluation through this dimension is to evaluate the occupation of bandwidth between nodes by tasks that have allocated resources.
  • the nodes in Ti can be traversed in sequence, and sorted according to the network transmission load on the current node, for example, the number of jobs across nodes; in addition, nodes with the same load can be merged to form a set G1 , G2...Gn.
  • each node can be evaluated for each dimension separately, that is, node 1 to node 3 can be evaluated in turn. Perform evaluations in various dimensions; it is also possible to merge node 1 and node 2 with the same number of network transmission connections, merge nodes with the same load to form a set and then perform overall evaluation. Overall evaluation can improve the accuracy of the evaluation.
  • step S906 if the number of nodes in a certain Ti is 0, return to step S906.
  • T1 includes 5 nodes
  • step S906 and step S907 reference may be made to step S833 shown in FIG. 8, and details are not described herein again.
  • S908 evaluate the large-scale task dimension; the goal of evaluation through this dimension is to keep completely idle GPU hosts as much as possible so that large-scale tasks can be placed and resource fragmentation is avoided.
  • the nodes in Gi can be traversed in sequence, and sorted according to the number of GPUs that the current node has allocated.
  • step S908 can refer to step S834 and step S835 shown in FIG. 8, which will not be repeated here.
  • the job is delivered to the corresponding node.
  • the number of network transmission connections for each node is updated, and the job is started.
  • the job scheduling method shown in FIG. 8 requires multi-dimensional judgments for each candidate node in the candidate node set; the job scheduling method shown in FIG. 9 selects the first part of candidate nodes according to the first dimension, and then presses the first The second dimension selects a subset of the first candidate part of nodes, that is, the second part of candidate nodes, and then selects a subset of the second part of candidate nodes according to the third selection dimension, that is, the third part of nodes.
  • the multiple selection dimensions described above are traversed in sequence.
  • the weights of each dimension can be adjusted during the optimization of the above-mentioned job scheduling method to meet the above-mentioned overall goal, that is, the evaluation of the above-mentioned different dimensions can be mainly based on the network transmission bandwidth and avoiding resource fragmentation. ; Considering the network transmission bandwidth can improve the operating efficiency of AI operations; to avoid resource fragmentation, you can consider the placement of tasks, so that the resources of a large plan can be used for subsequent placement of large-scale tasks and improve the overall utilization of resources.
  • the job scheduling method in the embodiment of the present application is described in detail above with reference to Figs. 1 to 9, and the device embodiment of the present application will be described in detail below with reference to Figs. 10 and 11. It should be understood that the job scheduling device in the embodiment of the present application can execute the various job scheduling methods of the foregoing embodiments of the present application, that is, the specific work processes of the following various products, and you can refer to the corresponding processes in the foregoing method embodiments.
  • FIG. 10 is a schematic block diagram of a job scheduling apparatus 1000 provided by an embodiment of the present application.
  • the job scheduling apparatus 1000 can execute each step in the job scheduling method shown in FIG. 7 to FIG. 9, and in order to avoid repetition, it will not be described in detail here.
  • the job scheduling device 1000 includes: a receiving unit 1010 and a processing unit 1020.
  • the receiving unit 1010 is configured to receive a target job, and the target job includes n tasks; the processing unit 1020 is configured to perform node screening in the node cluster according to the n tasks of the target job to obtain a set of n candidate nodes, Wherein, each candidate node set includes multiple candidate nodes; node selection is performed in the node cluster according to the n tasks of the target job to obtain n candidate node sets, wherein each candidate node set includes multiple candidate nodes ; Select the candidate node with the highest network transmission performance score from the set of candidate nodes corresponding to the mth task in the n tasks as the target node of the mth task, where the mth task
  • the target node is used to process the mth task, and the network transmission performance score is determined by the aggregation degree of the n tasks in the same rack, the affinity between the n tasks, and the value of the n tasks It is determined by one or any combination of cross-node degree and node idleness, n is an integer greater than
  • the processing unit 1020 is specifically configured to:
  • the network transmission performance score of the candidate node is subtracted.
  • the higher the affinity between the n tasks, the higher the network transmission performance score, and the processing unit 1020 is specifically configured to:
  • the type of the m-th task is a work node task
  • the type of the m-th task is a parameter node task
  • the network transmission performance score of the candidate node is added; and it is judged whether the candidate node in the m-th candidate node set needs to place other parameter node tasks in the n tasks, and if so, the network transmission performance of the candidate node
  • the transmission performance score is deducted.
  • processing unit 1020 is specifically configured to:
  • the processing unit 1020 is specifically configured to:
  • processing unit 1020 is further configured to:
  • the network transmission performance score of the candidate node is added according to the allocation rate, and the larger the allocation rate is, the greater the range of bonus points for the network transmission performance score of the candidate node is, and the smaller the allocation rate is, The smaller the addition range of the network transmission performance score of the candidate node is.
  • each task of the target job carries hardware resource requirements
  • the processing unit 1020 is specifically configured to:
  • each task perform node screening in the node clusters to obtain the n candidate node sets, wherein the hardware resources of each candidate node set in the n candidate node sets correspond to the corresponding Match the hardware resource requirements carried by the task.
  • the target job includes a training job of an artificial intelligence model.
  • job scheduling device 1000 here is embodied in the form of a functional unit.
  • unit herein can be implemented in the form of software and/or hardware, which is not specifically limited.
  • a "unit” may be a software program, a hardware circuit, or a combination of the two that realizes the above-mentioned functions.
  • the hardware circuit may include an application specific integrated circuit (ASIC), an electronic circuit, and a processor for executing one or more software or firmware programs (such as a shared processor, a dedicated processor, or a group processor). Etc.) and memory, merged logic circuits and/or other suitable components that support the described functions.
  • the units of the examples described in the embodiments of the present application can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • FIG. 11 is a schematic diagram of the hardware structure of a job scheduling device according to an embodiment of the present application.
  • the job scheduling apparatus 1100 shown in FIG. 11 may include a memory 1101, a processor 1102, a communication interface 1103, and a bus 1104. Among them, the memory 1101, the processor 1102, and the communication interface 1103 implement communication connections between each other through the bus 1104.
  • the memory 1101 may be a read-only memory (ROM), a static storage device and a random access memory (RAM).
  • the memory 1101 may store a program.
  • the processor 1102 and the communication interface 1103 are used to execute each step of the job scheduling method of the embodiment of the present application. For example, FIG. 7 to FIG. 9 shows the various steps of the job scheduling method.
  • the processor 1102 may adopt a general-purpose CPU, microprocessor, ASIC, GPU or one or more integrated circuits to execute related programs to realize the requirements of the units in the job scheduling apparatus shown in FIG. 10 in the embodiment of the present application.
  • the function executed, or the job scheduling method of the method embodiment of the present application is executed.
  • the processor 1102 may also be an integrated circuit chip with signal processing capabilities.
  • each step of the job scheduling method in the embodiment of the present application can be completed by an integrated logic circuit of hardware in the processor 1102 or instructions in the form of software.
  • the aforementioned processor 1102 may also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 1101, and the processor 1102 reads the information in the memory 1101, and combines its hardware to complete the functions required by the units included in the job scheduling apparatus of the embodiment of the present application, or perform the job scheduling of the method embodiment of the present application method.
  • the processor 1102 may correspond to the processing unit 1020 in the job scheduling apparatus shown in FIG. 10.
  • the communication interface 1103 uses a transceiving device such as but not limited to a transceiver to implement communication between the job scheduling device 1100 and other devices or a communication network.
  • a transceiving device such as but not limited to a transceiver to implement communication between the job scheduling device 1100 and other devices or a communication network.
  • the shown communication interface 1103 may correspond to the receiving unit 1010 in the job scheduling apparatus 1000 shown in FIG. 10, and the resource request of the target job may be received through the communication interface 1103.
  • the bus 1104 may include a path for transferring information between various components of the job scheduling apparatus 1100 (for example, the memory 1101, the processor 1102, and the communication interface 1103).
  • job scheduling device 1100 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the job scheduling device 1100 may also include other necessary for normal operation. Device. At the same time, according to specific needs, those skilled in the art should understand that the above-mentioned job scheduling device 1100 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the above-mentioned job scheduling device 1100 may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIG. 11.
  • the embodiment of the present application also provides a chip, which includes a transceiver unit and a processing unit.
  • the transceiver unit may be an input/output circuit or a communication interface;
  • the processing unit may be a processor, a microprocessor, or an integrated circuit integrated on the chip;
  • the chip may execute the job scheduling method in the above method embodiment.
  • the embodiment of the present application also provides a computer-readable storage medium on which an instruction is stored, and the job scheduling method in the foregoing method embodiment is executed when the instruction is executed.
  • the embodiments of the present application also provide a computer program product containing instructions that, when executed, execute the job scheduling method in the foregoing method embodiments.
  • the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), and application-specific integrated circuits. (application specific integrated circuit, ASIC), ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • Access memory synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory Take memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions or computer programs.
  • the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium.
  • the semiconductor medium may be a solid state drive.
  • At least one refers to one or more, and “multiple” refers to two or more.
  • the following at least one item (a)” or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • at least one item (a) of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
  • the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Abstract

本申请提供了一种作业调度方法以及作业调度装置,该方法包括:接收目标作业,目标作业包括n个任务;根据目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,每一候选节点集合包括多个候选节点;在n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为第m个任务的目标节点,第m个任务的目标节点用于处理第m个任务,网络传输性能分数由n个任务在同一机架聚合度、n个任务之间的亲和度、n个任务的跨节点度和节点空闲度中的一者或任意组合决定,n为大于或者等于1整数,m为1至n之间的任意正整数。本申请的技术方案能够缩短目标作业的运行时间,提高目标作业的运行效率。

Description

作业调度方法以及作业调度装置
本申请要求于2019年12月09日提交中国专利局、申请号为201911253271.7、申请名称为“一种任务调度的方法和系统”的中国专利申请的优先权,以及于2020年05月14日提交中国专利局、申请号为202010407994.4、申请名称为“作业调度方法以及作业调度装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及网络通信技术领域,更具体地,涉及作业调度方法以及作业调度装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
近年来深度学习在图像、语音等领域均取得了突破性的进展,主要得力于海量数据的获取、算法的不断优化以及算力的不断增长。深度学习目前主要指深度神经网络模型,随着网络模型越来复杂,数据量越来越大,模型训练的计算量变得异常庞大。
目前,通常采用分布式训练来满足具备网络传输需求的作业的时效性需求;比如,AI训练作业。若采用分布式训练方式,则不同作业之间可能会竞争相同的硬件资源;因此,需要调度器来对多用户的不同作业进行硬件资源调度,从而为不同作业分配合适的节点(例如为服务器)用于运行作业所包括的任务。当前调度器通常是基于任务的硬件资源需求来分配具有合适硬件资源的节点,而忽略了AI训练作业中对于网络性能的需求,例如在AI训练中,同一作业的多个任务之间会存在网络传输需求,现有技术忽略了这部分需求,从而导致AI训练作业的运行效率低。
因此,如何提高作业的运行效率成为了亟需解决的问题。
发明内容
本申请提供一种作业调度方法以及作业调度装置,可以缩短目标作业的运行时间,提高目标作业的运行效率。
第一方面,提供了一种作业调度方法,包括:接收目标作业,该目标作业包括n个任务;根据该目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,其中,每一候选节点集合包括多个候选节点;在该n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为该第m个任务的目标节点,其中,该第m个任务的目标节点用于处理该第m个任务,该网络传输性能分数由该n个 任务在同一机架聚合度、该n个任务之间的亲和度、该n个任务的跨节点度和节点空闲度中的一者或任意组合决定,n为大于或者等于1的整数,m为1至n之间的任意正整数。
在本申请的实施例中,通过根据目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合;在n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为第m个任务的目标节点,第m个任务的目标节点用于处理第m个任务,网络传输性能分数由n个任务在同一机架聚合度、n个任务之间的亲和度、n个任务的跨节点度和节点空闲度中的一者或任意组合决定;在本申请的实施例中,在为目标作业分配资源时,不仅可以考虑到目标作业的需求信息同时还可以考虑到同一作业中多个任务的网络传输性能,从而能够提高目标节点在运行目标作业时的网络传输速度,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
其中,上述的m为1至n之间的任意正整数,举例而言,m的初始值可设置为1,然后设置为2,3,4……n,从而通过m来遍历n个任务以及n个候选节点集合,在n个候选节点集合中分别选出n个目标节点。
结合第一方面,在第一方面的某些实现方式中,该n个任务在同一机架聚合度越高则该网络传输性能分数越高,该在该n个任务中的第m个任务对应的第m个候选节点集合中选择性能分数最高的候选节点作为该第m个任务的目标节点,包括:
判断该n个任务能否全部放置于该第m个候选节点集合中的候选节点所在的机架;如果是,对该候选节点的性能分数进行加分;如果否,对该候选节点的网络传输性能分数进行减分。
应理解,上述候选节点的网络传输性能分数可以是指根据n个任务在同一机架聚合度决定的;其中,通过n个任务在同一机架聚合度维度进行打分的目标是尽量将单个作业的多个任务放置到同一个机架中,从而避免任务之间跨机架传输数据,可有效提高作业的网络传输效率。
在本申请的实施例中,在对目标作业进行调度即为目标作业分配资源时,可以使得目标作业所包括的多个任务尽量放置到同一个机架管理的一个或多个节点中,从而尽量减少运行目标作业时跨机架占用的网络传输带宽,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
结合第一方面,在第一方面的某些实现方式中,该n个任务之间的亲和度越高则该网络传输性能分数越高,该在该n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为该第m个任务的目标节点,包括:
确认该第m个任务的类型;在该第m个任务的类型为工作节点任务的情况下,判断该候选节点是否需要放置该n个任务中的其他工作节点任务或者参数节点任务;如果是,对该候选节点的网络传输性能分数进行加分;
在该第m个任务的类型为参数节点任务的情况下,判断该第m个候选节点集合中的候选节点是否需要放置该n个任务中的工作节点任务;如果是,对该候选节点的网络传输性能分数进行加分;并判断该第m个候选节点集合中的候选节点是否需要放置该n个任务中的其他参数节点任务,如果是,对该候选节点的网络传输性能分数进行减分。
应理解,任务包括工作节点任务和参数节点任务,工作节点任务用于进行神经网络的迭代运算,神经网络模型中涉及输入参数和输出参数,参数节点用于管理工作节点的输入 参数和输出参数。
上述候选节点的网络传输性能分数以是指根据n个任务中不同种类的任务之间的亲和度决定的,其中,通过n个任务中不同种类的任务之间的亲和度进行打分的目标是使得同一作业的工作节点任务和参数节点任务尽可能集中放置在一个节点中,从而保证作业中的内部数据传输尽可能地发生在同一个节点中;同时,尽量避免同一作业的多个参数节点任务集中到同一节点中,避免该节点发生故障时,多个参数节点任务被停止,使得同一作业的多个工作节点任务的输入参数和输出参数不能得到有效管理。
需要说明的是,亲和性是指假如应用A与应用B两个应用频繁交互,所以有必要利用亲和性让两个应用的尽可能的靠近,甚至在一个节点上,以减少因网络通信而带来的性能损耗;与亲和性相对的为反亲和性,反亲和性是指当应用的采用多副本部署时,有必要采用反亲和性让各个应用实例打散分布在各个节点上,以提高可靠性。因此,工作节点任务之间,工作节点任务和参数节点任务之间需要提高亲和性,使得任务之间尽可能的靠近,例如设置在同一个节点,而参数节点任务之间需要降低亲和性(即提高反亲和性),使得参数节点任务尽可能设置在多个不同的节点。
结合第一方面,在第一方面的某些实现方式中,该在该n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为该第m个任务的目标节点,包括:
确认该第m个候选节点集合中的候选节点处理处于运行状态的其他作业时的跨节点数量;在该n个任务可全部放置于该候选节点的情况下,该跨节点数量越大,对该候选节点的网络传输性能分数的加分幅度越大,该跨节点数量越小,对该候选节点的网络传输性能分数的加分幅度越小;
在该n个任务不可全部放置于该候选节点的情况下,该跨节点数量越大,对该候选节点的网络传输性能分数的加分幅度越小,该跨节点数量越小,对该候选节点的网络传输性能分数的加分幅度越大。
需要说明的是,在对候选节点的性能打分时,考虑第m个候选节点集合中的候选节点处理处于运行状态的其他作业,由于运行结束的作业不会占用网络传输负载因此不进行考虑。
应理解,上述对候选节点的性能打分可以是指根据n个任务的跨节点度决定的,其中,通过n个任务的跨节点度进行打分的目标是考量已分配作业对节点间带宽的占用情况。
在本申请的实施例中,在对目标作业进行调度即为目标作业分配资源时,可以通过已分配资源的作业对节点间传输带宽的占用情况进行评估,从而使得在为目标作业分配资源时,不仅考虑到目标作业的需求信息同时还考虑到网络传输信息,从而能够提高在运行目标作业时的网络传输性能,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
在该n个任务可全部放置于第m个候选节点集合中的一个候选节点的情况下,该候选节点的跨节点数量越大,说明该候选节点正在运行的其他作业与其他节点频繁交互数据,选择该候选节点作为当前任务的目标节点,在将当前任务分配到该目标节点之后,可以保证该候选节点不用再增加与其他节点的交互数量,因此通过加大该候选节点的性能分数的加分幅度,可以保证优先选择该候选节点作为目标节点,反之,在该候选节点的跨节点数量越小的情况下,说明该候选节点正在运行的其他作业且与其他节点交互次数很少, 减少该候选节点的性能分数的加分幅度,可保证该候选节点不会优先选择作为目标节点。
在该n个任务不可全部放置于第m个候选节点集合中的候选节点的情况下,该跨节点数量越大,说明该候选节点正在运行的其他作业与其他节点频繁交互数据,若选择该候选节点作为当前任务的目标节点,在将当前任务分配到该目标节点之后,会使得该候选节点继续增加与其他节点的交互数量,从而造成该候选节点网络性能的劣化,因此通过减少该候选节点的性能分数的加分幅度,可以保证不会优先选择该候选节点作为目标节点,反之,在该候选节点的跨节点数量越小的情况下,说明该候选节点正在运行的其他作业与其他节点交互次数很少,通过增加该候选节点的性能分数的加分幅度,可保证该候选节点优先选择作为目标节点,在目标作业的任务分配到该候选节点之后,该候选节点与其他节点交互次数可以适当提高,从而优化分配效率。
结合第一方面,在第一方面的某些实现方式中,该n个任务的跨节点度是根据该n个任务分配到的不同候选节点的数量确定的。
例如,在感知跨节点作业的网络竞争时,可以基于跨节点的数量确定一个节点的网络传输负载。
结合第一方面,在第一方面的某些实现方式中,该n个任务的跨节点度是根据通过监控网络实时使用带宽确定的。
在一种可能的实现方式中,n个任务的跨节点数量可以是通过监控网络实时使用带宽的平滑值得到的。
可选地,可以通过监控系统,监控已分配作业对网络链路上实时使用带宽的平滑值,记为B;基于此对当前节点进行打分,score=1+1/(B+1),跨节点数量越大则已占用带宽越大,得分越低,则应该避免放置新的作业在该节点上。
例如,可以获取一个数据包,通过查看数据包的IP地址可以确定该数据包对应的任务ID;根据任务ID可以确定对应的作业是否在运行;运行的作业越多,则网络实时使用带宽越大,则说明n个任务的跨节点度越大。
示例性地,上述实时使用带宽的平滑值可以是指某一时刻的带宽负载;或者,也可以是指对预设时间段内多个时刻的使用带宽进行平滑处理后得到的带宽负载,其中,平滑处理可以是指取平均值,或者取最大值,或者取最小值等数据处理方法。
结合第一方面,在第一方面的某些实现方式中,该节点空闲度越小则该网络传输性能分数越高,在该n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为该第m个任务的目标节点,包括:
判断该第m个候选节点集合中的候选节点的用于作业训练的硬件资源是否被使用,如果是,对该候选节点的网络传输性能分数进行加分。
应理解,上述对候选节点的性能评分可以由节点空闲度决定,其中,通过节点空闲度进行打分的目标是尽量保留用于作业训练的硬件资源完全空闲的节点,以便应对后续出现的大规格任务,使得大规格任务能尽量地放置到同一个节点中,避免资源碎片化。因此,通过当候选节点的用于作业训练的硬件资源被使用时,对其性能分数进行加分,可保证该候选节点优先选择作为目标节点,而用于作业训练的硬件资源未被使用的候选节点则不会被优先选择作为目标节点,从而使得用于作业训练的硬件资源未被使用的候选节点保持空闲,用于作业训练的硬件资源已被使用的候选节点被充分使用,可避免资源碎片化。
可选地,硬件资源包括图像处理器和中央处理器。
结合第一方面,在第一方面的某些实现方式中,该在该n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为该第m个任务的目标节点,还包括:
确认该第m个候选节点集合中的候选节点中用于作业训练的硬件资源的分配率;根据该分配率对该候选节点的网络传输性能分数进行加分,该分配率越大,对该候选节点的网络传输性能分数的加分幅度越大,该分配率越小,对该候选节点的网络传输性能分数的加分幅度越小。
当确认候选节点中用于作业训练的硬件资源已经被使用时,进一步判断硬件资源使用的情况,即硬件资源的分配率,该分配率越高,说明该候选节点的硬件资源被使用更充分,此时希望把任务分配到该候选节点中,使得该候选节点能够充分利用自身的硬件资源,因此加大对该候选节点的性能分数的加分幅度,反之,则降低对该候选节点的性能分数的加分幅度。
结合第一方面,在第一方面的某些实现方式中,该目标作业的每个任务均携带有硬件资源需求,该根据该目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,包括:
根据每个任务携带的硬件资源需求,分别在该节点集群中进行节点筛选,得到该n个候选节点集合,其中,该n个候选节点集合中每个候选节点集合的硬件资源分别与对应的任务携带的硬件资源需求匹配。
结合第一方面,在第一方面的某些实现方式中,该目标作业包括人工智能模型的训练作业。
应理解,上述目标作业是指运行时具备网络传输负载需求的作业;目标作业可以是指人工智能模型的训练作业,或者也可以是指其它作业;本申请对此不作任何限定。
第二方面,提供了一种作业调度装置,包括:接收单元,用于接收目标作业,该目标作业包括n个任务;根据该目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,其中,每一候选节点集合包括多个候选节点;处理单元,用于在该n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为该第m个任务的目标节点,其中,该第m个任务的目标节点用于处理该第m个任务,该网络传输性能分数由该n个任务在同一机架聚合度、该n个任务之间的亲和度、该n个任务的跨节点度和节点空闲度中的一者或任意组合决定,n为大于或者等于1的整数,m为1至n之间的任意正整数。
在本申请的实施例中,通过根据目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合;在n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为第m个任务的目标节点,第m个任务的目标节点用于处理第m个任务,网络传输性能分数由n个任务在同一机架聚合度、n个任务之间的亲和度、n个任务的跨节点度和节点空闲度中的一者或任意组合决定;在本申请的实施例中,在为目标作业分配资源时,不仅可以考虑到目标作业的需求信息同时还可以考虑到同一作业中多个任务的网络传输性能,从而能够提高目标节点在运行目标作业时的网络传输速度,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
其中,上述的m为1至n之间的任意正整数,举例而言,m的初始值可设置为1,然后设置为2,3,4……n,从而通过m来遍历n个任务以及n个候选节点集合,在n个候选节点集合中分别选出n个目标节点。
结合第二方面,在第二方面的某些实现方式中,该n个任务在同一机架聚合度越高则该网络传输性能分数越高,该处理单元具体用于:
判断该n个任务能否全部放置于该第m个候选节点集合中的候选节点所在的机架;如果是,对该候选节点的网络传输性能分数进行加分;如果否,对该候选节点的网络传输性能分数进行减分。
应理解,上述候选节点的网络传输性能分数可以是指根据n个任务在同一机架聚合度决定;其中,通过n个任务在同一机架聚合度维度进行打分的目标是尽量将单个作业的多个任务放置到同一个机架中,从而避免任务之间跨机架传输数据,可有效提高作业的网络传输效率。
在本申请的实施例中,在对目标作业进行调度即为目标作业分配资源时,可以使得目标作业所包括的多个任务尽量放置到同一个机架管理的一个或多个节点中,从而尽量减少运行目标作业时跨机架占用的网络传输带宽,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
结合第二方面,在第二方面的某些实现方式中,该n个任务之间的亲和度越高则该网络传输性能分数越高,该处理单元具体用于:
确认该第m个任务的类型;在该第m个任务的类型为工作节点任务的情况下,判断该候选节点是否需要放置该n个任务中的其他工作节点任务或者参数节点任务;如果是,对该候选节点的网络传输性能分数进行加分;
在该第m个任务的类型为参数节点任务的情况下,判断该第m个候选节点集合中的候选节点是否需要放置该n个任务中的工作节点任务;如果是,对该候选节点的网络传输性能分数进行加分;并判断该第m个候选节点集合中的候选节点是否需要放置该n个任务中的其他参数节点任务,如果是,对该候选节点的网络传输性能分数进行减分。
应理解,任务包括工作节点任务和参数节点任务,工作节点任务用于进行神经网络的迭代运算,神经网络模型中涉及输入参数和输出参数,参数节点用于管理工作节点的输入参数和输出参数。
上述候选节点的网络传输性能分数以是指根据n个任务中不同种类的任务之间的亲和度决定的,其中,通过n个任务中不同种类的任务之间的亲和度进行打分的目标是使得同一作业的工作节点任务和参数节点任务尽可能集中放置在一个节点中,从而保证作业中的内部数据传输尽可能地发生在同一个节点中;同时,尽量避免同一作业的多个参数节点任务集中到同一节点中,避免该节点发生故障时,多个参数节点任务被停止,使得同一作业的多个工作节点任务的输入参数和输出参数不能得到有效管理。
需要说明的是,亲和性是指假如应用A与应用B两个应用频繁交互,所以有必要利用亲和性让两个应用的尽可能的靠近,甚至在一个节点上,以减少因网络通信而带来的性能损耗;与亲和性相对的为反亲和性,反亲和性是指当应用的采用多副本部署时,有必要采用反亲和性让各个应用实例打散分布在各个节点上,以提高可靠性。因此,工作节点任务之间,工作节点任务和参数节点任务之间需要提高亲和性,使得任务之间尽可能的靠近, 例如设置在同一个节点,而参数节点任务之间需要降低亲和性(即提高反亲和性),使得参数节点任务尽可能设置在多个不同的节点。
结合第二方面,在第二方面的某些实现方式中,该处理单元具体用于:
确认该第m个候选节点集合中的候选节点处理处于运行状态的其他作业时的跨节点数量;在该n个任务可全部放置于该候选节点的情况下,该跨节点数量越大,对该候选节点的网络传输性能分数的加分幅度越大,该跨节点数量越小,对该候选节点的网络传输性能分数的加分幅度越小;
在该n个任务不可全部放置于该候选节点的情况下,该跨节点数量越大,对该候选节点的网络传输性能分数的加分幅度越小,该跨节点数量越小,对该候选节点的网络传输性能分数的加分幅度越大。
需要说明的是,在对候选节点的性能打分时,考虑第m个候选节点集合中的候选节点处理处于运行状态的其他作业,由于运行结束的作业不会占用网络传输负载因此不进行考虑。
应理解,上述对候选节点的性能打分可以是指根据n个任务的跨节点度决定的,其中,通过n个任务的跨节点度进行打分的目标是考量已分配作业对节点间带宽的占用情况。
在本申请的实施例中,在对目标作业进行调度即为目标作业分配资源时,可以通过已分配资源的作业对节点间传输带宽的占用情况进行评估,从而使得在为目标作业分配资源时,不仅考虑到目标作业的需求信息同时还考虑到网络传输信息,从而能够提高在运行目标作业时的网络传输性能,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
在该n个任务可全部放置于第m个候选节点集合中的一个候选节点的情况下,该候选节点的跨节点数量越大,说明该候选节点正在运行的其他作业与其他节点频繁交互数据,选择该候选节点作为当前任务的目标节点,在将当前任务分配到该目标节点之后,可以保证该候选节点不用再增加与其他节点的交互数量,因此通过加大该候选节点的性能分数的加分幅度,可以保证优先选择该候选节点作为目标节点,反之,在该候选节点的跨节点数量越小的情况下,说明该候选节点正在运行的其他作业且与其他节点交互次数很少,减少该候选节点的性能分数的加分幅度,可保证该候选节点不会优先选择作为目标节点。
在该n个任务不可全部放置于第m个候选节点集合中的候选节点的情况下,该跨节点数量越大,说明该候选节点正在运行的其他作业与其他节点频繁交互数据,若选择该候选节点作为当前任务的目标节点,在将当前任务分配到该目标节点之后,会使得该候选节点继续增加与其他节点的交互数量,从而造成该候选节点网络性能的劣化,因此通过减少该候选节点的性能分数的加分幅度,可以保证不会优先选择该候选节点作为目标节点,反之,在该候选节点的跨节点数量越小的情况下,说明该候选节点正在运行的其他作业与其他节点交互次数很少,通过增加该候选节点的性能分数的加分幅度,可保证该候选节点优先选择作为目标节点,在目标作业的任务分配到该候选节点之后,该候选节点与其他节点交互次数可以适当提高,从而优化分配效率。
结合第二方面,在第二方面的某些实现方式中,该n个任务的跨节点度是根据该n个任务分配到的不同候选节点的数量确定的。
例如,在感知跨节点作业的网络竞争时,可以基于跨节点的数量确定一个节点的网络传输负载。
结合第二方面,在第二方面的某些实现方式中,该n个任务的跨节点度是根据通过监控网络实时使用带宽确定的。
在一种可能的实现方式中,n个任务的跨节点数量可以是通过监控网络实时使用带宽的平滑值得到的。
可选地,可以通过监控系统,监控已分配作业对网络链路上实时使用带宽的平滑值,记为B;基于此对当前节点进行打分,score=1+1/(B+1),跨节点数量越大则已占用带宽越大,得分越低,则应该避免放置新的作业在该节点上。
例如,可以获取一个数据包,通过查看数据包的IP地址可以确定该数据包对应的任务ID;根据任务ID可以确定对应的作业是否在运行;运行的作业越多,则网络实时使用带宽越大,则说明n个任务的跨节点度越大。
示例性地,上述实时使用带宽的平滑值可以是指某一时刻的带宽负载;或者,也可以是指对预设时间段内多个时刻的使用带宽进行平滑处理后得到的带宽负载,其中,平滑处理可以是指取平均值,或者取最大值,或者取最小值等数据处理方法。
结合第二方面,在第二方面的某些实现方式中,该节点空闲度越小则该网络传输性能分数越高,该处理单元具体用于:
判断该第m个候选节点集合中的候选节点的用于作业训练的硬件资源是否被使用,如果是,对该候选节点的网络传输性能分数进行加分。
应理解,上述对候选节点的性能评分可以由节点空闲度决定,其中,通过节点空闲度进行打分的目标是尽量保留用于作业训练的硬件资源完全空闲的节点,以便应对后续出现的大规格任务,使得大规格任务能尽量地放置到同一个节点中,避免资源碎片化。因此,通过当候选节点的用于作业训练的硬件资源被使用时,对其性能分数进行加分,可保证该候选节点优先选择作为目标节点,而用于作业训练的硬件资源未被使用的候选节点则不会被优先选择作为目标节点,从而使得用于作业训练的硬件资源未被使用的候选节点保持空闲,用于作业训练的硬件资源已被使用的候选节点被充分使用,可避免资源碎片化。
可选地,硬件资源包括图像处理器和中央处理器。
结合第二方面,在第二方面的某些实现方式中,该处理单元还用于:
确认该第m个候选节点集合中的候选节点中用于作业训练的硬件资源的分配率;根据该分配率对该候选节点的网络传输性能分数进行加分,该分配率越大,对该候选节点的网络传输性能分数的加分幅度越大,该分配率越小,对该候选节点的网络传输性能分数的加分幅度越小。
当确认候选节点中用于作业训练的硬件资源已经被使用时,进一步判断硬件资源使用的情况,即硬件资源的分配率,该分配率越高,说明该候选节点的硬件资源被使用更充分,此时希望把任务分配到该候选节点中,使得该候选节点能够充分利用自身的硬件资源,因此加大对该候选节点的性能分数的加分幅度,反之,则降低对该候选节点的性能分数的加分幅度。
结合第二方面,在第二方面的某些实现方式中,该目标作业的每个任务均携带有硬件资源需求,该处理单元具体用于:
根据每个任务携带的硬件资源需求,分别在该节点集群中进行节点筛选,得到该n个候选节点集合,其中,该n个候选节点集合中每个候选节点集合的硬件资源分别与对应的 任务携带的硬件资源需求匹配。
结合第二方面,在第二方面的某些实现方式中,该目标作业包括人工智能模型的训练作业。
应理解,上述目标作业是指运行时具备网络传输负载需求的作业;目标作业可以是指人工智能模型的训练作业,或者也可以是指其它作业;本申请对此不作任何限定。
第三方面,提供了一种作业调度装置,包括:存储器,用于存储程序;处理器,用于执行该存储器存储的程序,当该存储器存储的程序被执行时,该处理器用于执行:接收目标作业,该目标作业包括n个任务;根据该目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,其中,每一候选节点集合包括多个候选节点;在该n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为该第m个任务的目标节点,其中,该第m个任务的目标节点用于处理该第m个任务,该网络传输性能分数由该n个任务在同一机架聚合度、该n个任务之间的亲和度、该n个任务的跨节点度和节点空闲度中的一者或任意组合决定,n为大于或者等于1的整数,m为1至n之间的任意正整数。
在一种可能的实现方式中,上述作业调度装置中包括处理器还用于执行第一方面以及第一方面中的任意一种实现方式中的方法。
应理解,在上述第一方面中对相关内容的扩展、限定、解释和说明也适用于第三方面中相同的内容。
在本申请的实施例中,通过根据目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合;在n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为第m个任务的目标节点,第m个任务的目标节点用于处理第m个任务,网络传输性能分数由n个任务在同一机架聚合度、n个任务之间的亲和度、n个任务的跨节点度和节点空闲度中的一者或任意组合决定;在本申请的实施例中,在为目标作业分配资源时,不仅可以考虑到目标作业的需求信息同时还可以考虑到同一作业中多个任务的网络传输性能,从而能够提高目标节点在运行目标作业时的网络传输速度,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
第四方面,提供一种计算机存储介质,该计算机存储介质存储有程序代码,该程序代码包括用于执行第一方面以及第一方面中的任意一种实现方式中的作业调度方法中的步骤的指令。
上述存储介质具体可以是非易失性存储介质。
第五方面,提供一种芯片,该芯片包括处理器与数据接口,该处理器通过该数据接口读取存储器上存储的指令,执行上述第一方面以及第一方面的任意一种实现方式中的作业调度方法。
可选地,作为一种实现方式,该芯片还可以包括存储器,该存储器中存储有指令,该处理器用于执行该存储器上存储的指令,当该指令被执行时,该处理器用于执行第一方面以及第一方面中的任意一种实现方式中的作业调度方法。
附图说明
图1是本申请实施例提供的典型的全连接网络模型的示意图;
图2是本申请实施例提供的神经网络模型的训练流程的示意图;
图3是本申请实施例提供的参数节点方式的分布式训练示意图;
图4是本申请实施例提供的去中心化的参数同步方式的分布式训练示意图;
图5是本申请实施例提供的AI训练的系统架构的示意图;
图6是本申请实施例提供的AI训练作业的物理架构的示意图;
图7是本申请实施例提供的作业调度方法的示意性流程图;
图8是本申请实施例提供的作业调度方法的示意性流程图;
图9是本申请实施例提供的作业调度方法的示意性流程图;
图10是本申请实施例提供的作业调度装置的示意图;
图11是本申请实施例提供的作业调度装置的示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应理解,在本申请的各实施例中,“第一”、“第二”、“第三”、“第四”等仅是为了指代不同的对象,并不表示对指代的对象有其它限定。
由于本申请实施例涉及大量的专业术语,为了便于理解,下面先对本申请实施例可能涉及的相关术语和概念进行介绍。
1、深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐藏层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐藏层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐藏层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
例如,如图1所示的为典型的全连接网络模型,包括了输入层110、隐藏层120、隐藏层130以及输出层140;数据从输入层110流入,逐步经过计算最终从在输出层140得到结果;其中,中间每一层都会若干个参数,和上一层的输入进行计算得到输出;模型参数需要大量数据通过模型训练拟合,从而得到最佳的模型效果。
2、神经网络模型的训练流程
示例性地,图2是本申请实施例提供的神经网络模型的训练流程的示意图。训练流程包括步骤S210至步骤S280,下面对步骤S210至步骤S280进行详细描述。
S210、网络模型初次加载。
S220、向网络模型中输入训练数据。
S230、根据训练数据使得网络模型的参数初始化。
S240、前向传播。
其中,前向传播算法是指利用若干个权重系数矩阵W,偏倚向量b和输入值向量x进行一系列线性运算和激活运算;即从输入层开始一层一层的向后计算,一直到运算到输 出层,得到输出结果为值。
S250、根据结果计算损失。
例如,在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数);比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。
因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
S260、反向传播。
例如,神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
S270、网络模型参数不断更新。
S280、保存网络模型的参数或者权重。
由于上述网络模型的训练过程需要大量迭代训练(成千上万次)才能得到最终的模型参数值,满足相应的任务需求,因此深度神经网络的模型训练往往是一个非常费时的过程。
3、分布式AI模型训练
随着网络模型越来复杂,数据量越来越大,模型训练的计算量变得异常庞大;因此,通过分布式训练来满足模型生成的时效性需求。分布式训练是指通过多个节点的中央处理器(central processing unit,CPU)或者GPU设备协作训练。目前,主流的分布式训练方式包括中心化的参数节点方式和去中心化的ALLReduce方式;下面以GPU的分布式训练进行说明,应理解CPU与GPU类似,只不过只有CPU作为worker节点的计算设备。
图3是本申请实施例提供的参数节点方式的示意图。
其中,如图3所示可以包括参数节点310(Parameter,PS)、工作节点320以及工作节点330。
其中,参数节点和工作节点可通过服务器实现,用于实现参数节点的服务器可包括至少一个CPU,实现工作节点的服务器可包括至少一个CPU和至少一个GPU,其中,至少一个GPU用于作业训练。
示例性地,参数节点310是指机器学习模型训练时模型的中心同步节点,它负责维护模型的参数,并且在迭代训练进行更新,将参数分发给不同设备以更新模型,继续训练。参与训练的每个GPU上均有一份相同的神经网络模型,它们可以在不同节点上,由各自节点(例如,工作节点320或者工作节点330)的CPU发出指令调用GPU进行模型的计算处理。每次迭代时不同GPU处理不同批次的数据,迭代完成后,它们需要和参数节点 310进行参数同步,保证模型训练过程中不同GPU上的参数一致。
图4是本申请实施例提供的去中心化的参数同步方式的示意图。
其中,与图3所示的参数节点方式不同,该模式下多个工作节点(例如,工作节点401~工作节点405)之间可以直接互通发送同步参数或者梯度值,可以不需要通过参数节点(又称为参数服务器)进行参数或者梯度值的同步。
不论是上述图3还是图4所示的分布式训练方法,由于节点之间在每次迭代都需要传输大量的模型参数;比如,从MB级别GB级别不等;因此,在分布式训练过程中对节点之间的网络传输带宽有很高的要求。
4、AI训练作业调度
在云数据中心的场景下,即多用户共享资源池,不再是单人独享专有资源的模式;因此,需要由专门的调度器为不同用户的作业进行调度,为作业的不同任务选择合适的节点运行;一方面需要满足作业对运硬件和软件环境的需求,另一方面也需要提高资源的利用率,达到资源共享的核心目的-分时复用。换而言之,针对AI训练作业,若采用分布式训练,则不同作业之间可能会竞争相同链路中的网络资源;此时,调度器需要对多用户的不同作业进行资源调度,为不同作业选择合适的节点和GPU放置任务。
目前,通常采用分布式训练来满足具备网络传输需求的作业的时效性需求;比如,AI训练作业。若采用分布式训练方式,则不同作业之间可能会竞争相同的硬件资源;因此,需要调度器来对多用户的不同作业进行硬件资源调度,从而为不同作业分配合适的节点(例如为服务器)用于运行作业所包括的任务。当前调度器通常是基于任务的硬件资源需求来分配具有合适硬件资源的节点,而忽略了AI训练作业中对于网络性能的需求,例如在AI训练中,同一作业的多个任务之间会存在网络传输需求,现有技术忽略了这部分需求,从而导致AI训练作业的运行效率低。
有鉴于此,本申请提出了一种作业调度方法及作业调度装置,通过根据目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合;在n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为第m个任务的目标节点,第m个任务的目标节点用于处理第m个任务,网络传输性能分数由n个任务在同一机架聚合度、n个任务之间的亲和度、n个任务的跨节点度和节点空闲度中的一者或任意组合决定;在本申请的实施例中,在为目标作业分配资源时,不仅可以考虑到目标作业的需求信息同时还可以考虑到同一作业中多个任务的网络传输性能,从而能够提高目标节点在运行目标作业时的网络传输速度,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
图5是本申请实施例提供的AI训练的系统架构的示意图。
如图5所示,该系统架构可以包括用户图形界面/客户端510、AI作业管理服务器520、资源管理服务器530以及硬件基础设施540。
示意性地,用户图形界面/客户端510可以用于接收来自不同用户的AI训练作业。AI作业管理服务器520可以用于对接收来自不同用户的AI训练作业进行管理与提交;资源管理服务器530中可以包括资源管理以及调度器,其中,资源管理可以用于绑定以及释放资源;调度器可以根据不同作业的需求为作业调度资源;硬件基础设置540可以是指CPU、内存、网络、GPU以及远程直接数据存取(remote direct memory access,RDMA)。
示例性地,用户可以通过用户图形界面/客户端510提交AI训练作业;AI作业管理服务器520接收到请求后可以对作业进行解析,并且将资源请求提交到资源管理服务器530中;资源管理服务器530中接收到请求后可以通过调度器从所管理的硬件基础设施540即底层物理资源中选择合适的节点进行作业的放置;调度器完成节点的选取后,在相应的节点上启动对应AI训练作业,这部分资源被该作业占用,直至该作业结束后释放该资源。
下面结合图6对用于AI训练作业的数据中心的物理架构图进行说明。
图6是本申请实施例提供的用于AI训练作业的数据中心的物理架构的示意图。
如图6所示,该物理架构中可以包括第一级交换机610、第二级交换机620以及第二级交换机630;其中,第一级交换机610可以用于管理第二级交换机620以及第二级交换机630;第二级交换机620可以用于管理服务器621和服务器622;第二级交换机630可以用于管理服务器631和服务器632。
示例性地,第一级交换机610可以是指核心交换机;第二级交换机620和第二级交换机630可以是指架顶交换机;架顶交换机可连接有多个服务器,每个服务器中又包括CPU以及GPU资源;其中,服务器可以是指本申请实施例中的节点。
需要说明的是,上述物理架构中也可以包括一级或者多级交换机,上述通过图6中通过两级交换机即第一级交换机与第二交换机进行举例说明,本申请实施例对此不作任何限定。
值得注意的是,第二级交换机620、服务器621以及服务器622设置在同一个机架中,例如机架1,第二级交换机630、服务器631以及服务器632设置在同一个机架中,例如机架2。
下面结合图7对本申请实施例的作业调度方法进行详细的介绍。
图7所示的作业调度方法可以由图5所示的调度器来执行,可以应用于是图6所示的物理架构中。图7所示的方法700包括S710至S730,下面分别对这些步骤进行详细的描述。
S710、接收目标作业,其中,目标作业包括n个任务。
在一个示例中,可以接收目标作业的资源请求,资源请求可以用于请求运行目标作业的资源,资源请求中可以携带目标作业的需求信息,目标作业是指运行时具备网络传输需求的作业。
例如,可以接收作业携带的硬件资源请求;根据每个任务携带的硬件资源需求,调度器可以分别在节点集群中进行节点筛选,得到所述n个候选节点集合,其中,n个候选节点集合中每个候选节点集合的硬件资源分别与对应的任务携带的硬件资源需求匹配。
示例性地,上述目标作业可以是指AI训练作业,或者,还可以是指其它类型的具有网络传输需求的作业。
在一个示例中,也可以接收多个目标作业的资源请求,多个目标作业的资源请求可以是指接收来自不同用户或者同一用户的多个目标作业的资源请求;多个目标作业中的一个目标作业中可以包括多个目标任务。
S720、根据目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合。
其中,每一候选节点集合包括多个候选节点。
示例性地,可以接收作业携带的硬件资源请求;根据每个任务携带的硬件资源需求,调度器可以分别在节点集群中进行节点筛选,得到所述n个候选节点集合,其中,n个候选节点集合中每个候选节点集合的硬件资源分别与对应的任务携带的硬件资源需求匹配。
其中,硬件资源需求可以是指端口过滤、节点的标签匹配等筛选出符合条件的节点,如节点包含的GPU类型。
例如,节点的端口过滤可以是指该作业可以在某一个端口号之外的其他节点中运行;节点的标签匹配可以是指根据IP地址范围选择运行目标作业的节点。
上述步骤S720中的节点的筛选方法可以采用现有技术中的调度器的常用方法,此处不作任何限定。
S730、在n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为第m个任务的目标节点。
其中,第m个任务的目标节点用于处理第m个任务,网络传输性能分数由n个任务在同一机架聚合度、n个任务之间的亲和度、n个任务的跨节点度和节点空闲度中的一者或任意组合决定,n为大于或者等于1的整数,m为1至n之间的任意正整数。
应理解,上述的m为1至n之间的任意正整数,举例而言,m的初始值可设置为1,然后设置为2,3,4……n,从而通过m来遍历n个任务以及n个候选节点集合,在n个候选节点集合中分别选出n个目标节点。
在一个实施例中,n个任务在同一机架聚合度越高则网络传输性能分数越高,在n个任务中的第m个任务对应的第m个候选节点集合中选择性能分数最高的候选节点作为所述第m个任务的目标节点,包括:
判断n个任务能否全部放置于所述第m个候选节点集合中的候选节点所在的机架;如果是,对候选节点的网络传输性能分数进行加分;如果否,对候选节点的网络传输性能分数进行减分。
应理解,上述候选节点的网络传输性能分数可以是指根据n个任务在同一机架聚合度决定的;其中,通过n个任务在同一机架聚合度维度进行打分的目标是尽量将单个作业的多个任务放置到同一个机架中,从而避免任务之间跨机架传输数据,可有效提高作业的网络传输效率。
示例性地,如图6所示,首先判断n个任务能否放置于第m个候选节点集合中的候选节点所在的机架,比如,假设第m个候选节点集合中的一个候选节点为服务器621,则可以判断n个任务能否放置于第二级交换机620连接的多个服务器中;即判断n个任务能否放置于服务器621,或者服务器621与服务器622中;如果第二级交换机620中连接的多个服务器能够放置n个任务,对该服务器的性能分数进行加分;如果第二级交换机620中包括的多个服务器不能够放置n个任务,对该服务器的性能分数进行减分。
举例来说,假如候选节点集合中包括候选节点1至候选节点4,其中,候选节点1与候选节点2对应机架1;候选节点3与候选节点4对应机架2;若某一作业所包含的所有任务均未分配,则优先考虑同一机架的可放置性;即若机架1中管理的候选节点中的资源能够容纳下该作业的所有任务,则优先将该任务分配至机架1中的资源。若某一作业所包含的任务中已经有至少一个任务已经绑定资源,比如,作业中已有一个任务分配至候选节点1,则该任务中包含的其它任务优先考虑分配至候选节点1或者与候选节点1对应相同 机架1的候选节点2。
在本申请的实施例中,在对目标作业进行调度即为目标作业分配资源时,可以使得目标作业所包括的多个任务尽量放置到同一个机架管理的一个或多个节点中,从而尽量减少运行目标作业时跨机架占用的网络传输带宽,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
示例性地,通过n个任务在同一机架聚合度决定候选节点的性能分数的具体实现过程可以参见后续图8所示的步骤S831。
在一个实施例中,n个任务之间的亲和度越高则网络传输性能分数越高,在所述n个任务中的第m个任务对应的第m个候选节点集合中选择性能分数最高的候选节点作为所述第m个任务的目标节点,包括:
确认所述第m个任务的类型;在第m个任务的类型为工作节点任务的情况下,判断第m个候选节点集合中的候选节点是否需要放置n个任务中的其他工作节点任务或者参数节点任务;如果是,对候选节点的网络传输性能分数进行加分;
在第m个任务的类型为参数节点任务的情况下,判断第m个候选节点集合中的候选节点是否需要放置n个任务中的工作节点任务;如果是,对候选节点的网络传输性能分数进行加分;并判断第m个候选节点集合中的候选节点是否需要放置n个任务中的其他参数节点任务,如果是,对候选节点的网络传输性能分数进行减分。
应理解,任务包括工作节点任务和参数节点任务,工作节点任务用于进行神经网络的迭代运算,神经网络模型中涉及输入参数和输出参数,参数节点用于管理工作节点的输入参数和输出参数。
上述候选节点的网络传输性能分数以是指根据n个任务中不同种类的任务之间的亲和度决定,其中,通过n个任务中不同种类的任务之间的亲和度进行打分的目标是使得同一作业的工作节点任务和参数节点任务尽可能集中放置在一个节点中,从而保证作业中的内部数据传输尽可能地发生在同一个节点中;同时,尽量避免同一作业的多个参数节点任务集中到同一节点中,避免该节点发生故障时,多个参数节点任务被停止,使得同一作业的多个工作节点任务的输入参数和输出参数不能得到有效管理。
示例性地,n个任务中可以包括不同类型的任务,比如,工作节点任务与参数节点任务;如图4所示,多个任务为中的每个任务均为工作节点任务,在第m个任务的类型为工作节点任务的情况下,判断第m个候选节点集合中的候选节点是否已经放置有n个任务中的其他工作节点任务或者参数节点任务;即可以如图6所示,若第m个任务为工作节点任务,判断一个服务器中是否已经放置有n个任务中的其他工作节点任务或者参数节点任务;如果服务器中已经放置有n个任务中的其他工作节点任务或者参数节点任务,对该服务器的性能分数进行加分。
示例性地,n个任务中可以包括不同类型的任务,比如,工作节点任务与参数节点任务;如图3所示,参数节点310又可以称为参数节点,在第m个任务的类型为参数节点任务的情况下,判断第m个候选节点集合中的候选节点是否已经放置有n个任务中的工作节点任务;即可以如图6所示,若第m个任务为参数节点任务,判断一个服务器中是否已经放置有n个任务中的工作节点任务;如果服务器中已经放置有n个任务中的工作节点任务,对服务器的性能分数进行加分;并判断该服务器是否已经放置有n个任务中的其 他参数节点任务,如果服务器中已经放置其他参数节点任务,对该服务器的性能分数进行减分。
应理解,由于工作节点任务与参数节点任务之间具有频繁的数据交互,因此考虑到网络传输负载,可以尽量将工作节点任务与参数节点任务集中放置;由于参数节点任务的数据量较大,因此避免多个参数节点集中放置在同一服务器中。
需要说明的是,亲和性是指假如应用A与应用B两个应用频繁交互,所以有必要利用亲和性让两个应用的尽可能的靠近,甚至在一个节点上,以减少因网络通信而带来的性能损耗;与亲和性相对的为反亲和性,反亲和性是指当应用的采用多副本部署时,有必要采用反亲和性让各个应用实例打散分布在各个节点上,以提高可靠性。因此,工作节点任务之间,工作节点任务和参数节点任务之间需要提高亲和性,使得任务之间尽可能的靠近,例如设置在同一个节点,而参数节点任务之间需要降低亲和性(即提高反亲和性),使得参数节点任务尽可能设置在多个不同的节点。
在本申请的实施例中,通过n个任务中不同种类的任务之间的亲和度进行打分,可以考虑不同类型的任务分配资源的亲和性,使得工作节点类型的任务尽量集中放置,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
其中,参数节点任务可以是指用于负责维护模型的参数,并且在迭代训练进行更新后将参数分发至不同的工作节点;工作节点任务可以是指用于执行某一批次的数据迭代的任务;例如图3所示,参数节点与工作节点之间具有频繁的数据交互;比如,参数节点可以将初始参数发送至工作节点,工作节点在对初始参数进行更新后需要将更新后的参数发送至参数节点。
示例性地,通过n个任务之间的亲和度决定候选节点的性能分数的具体实现过程可以参见后续图8所示的步骤S832。
在一个实施例中,在n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为第m个任务的目标节点,包括:
确认第m个候选节点集合中的候选节点处理处于运行状态的其他作业时的跨节点数量;
在n个任务可全部放置于第m个候选节点集合中的候选节点的情况下,跨节点数量越大,对候选节点的网络传输性能分数的加分幅度越大,跨节点数量越小,对候选节点的网络传输性能分数的加分幅度越小;在n个任务不可全部放置于第m个候选节点集合中的候选节点的情况下,跨节点数量越大,对候选节点的网络传输性能分数的加分幅度越小,跨节点数量越小,对候选节点的网络传输性能分数的加分幅度越大。
需要说明的是,上述对候选节点的性能打分可以是指根据n个任务的跨节点度决定的,其中,通过n个任务的跨节点度进行打分的目标是考量已分配作业对节点间带宽的占用情况。
应理解,在上述任意一种情况下,加分幅度是均大于减分幅度的;对于不需要跨节点分配的作业,优先分配在跨节点数量大的候选节点中;对于需要跨节点的作业,优先放置在跨节点数量小的候选节点中。
还应理解,在对候选节点的性能打分时,考虑第m个候选节点集合中的候选节点处理处于运行状态的其他作业,由于运行结束的作业不会占用网络传输负载因此不进行考 虑。
在本申请的实施例中,通过n个任务的跨节点度进行打分,可以考虑已分配资源的作业对节点间传输带宽的占用情况,从而使得在为目标作业分配资源时,不仅考虑到目标作业的需求信息同时还考虑到网络传输信息,从而能够提高在运行目标作业时的网络传输性能,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
在该n个任务可全部放置于第m个候选节点集合中的一个候选节点的情况下,该候选节点的跨节点数量越大,说明该候选节点正在运行的其他作业与其他节点频繁交互数据,选择该候选节点作为当前任务的目标节点,在将当前任务分配到该目标节点之后,可以保证该候选节点不用再增加与其他节点的交互数量,因此通过加大该候选节点的性能分数的加分幅度,可以保证优先选择该候选节点作为目标节点,反之,在该候选节点的跨节点数量越小的情况下,说明该候选节点正在运行的其他作业且与其他节点交互次数很少,减少该候选节点的性能分数的加分幅度,可保证该候选节点不会优先选择作为目标节点。
在该n个任务不可全部放置于第m个候选节点集合中的候选节点的情况下,该跨节点数量越大,说明该候选节点正在运行的其他作业与其他节点频繁交互数据,若选择该候选节点作为当前任务的目标节点,在将当前任务分配到该目标节点之后,会使得该候选节点继续增加与其他节点的交互数量,从而造成该候选节点网络性能的劣化,因此通过减少该候选节点的性能分数的加分幅度,可以保证不会优先选择该候选节点作为目标节点,反之,在该候选节点的跨节点数量越小的情况下,说明该候选节点正在运行的其他作业与其他节点交互次数很少,通过增加该候选节点的性能分数的加分幅度,可保证该候选节点优先选择作为目标节点,在目标作业的任务分配到该候选节点之后,该候选节点与其他节点交互次数可以适当提高,从而优化分配效率。
在一种可能的实现方式中,所述n个任务的跨节点度是根据所述n个任务分配到的不同候选节点的数量确定的。
例如,在感知跨节点作业的网络竞争时,调度器可以记录节点上跨节点作业的网络连接数量。
在一种可能的实现方式中,所述n个任务的跨节点度是根据通过监控网络实时使用带宽确定的。
例如,可以通过监控系统,监控已有作业对网络链路上实时使用带宽的平滑值,记为B;基于此对当前节点进行打分,score=1+1/(B+1),跨节点数量则表示已占用带宽越大,得分越低,则应该避免放置新的作业在该节点上。
示例性地,上述实时使用带宽的平滑值可以是指某一时刻的带宽负载;或者,也可以是指对预设时间段内多个时刻的使用带宽进行平滑处理后得到的带宽负载,其中,平滑处理可以是指取平均值,或者取最大值,或者取最小值等数据处理方法。
例如,可以获取一个数据包,通过查看数据包的IP地址可以确定该数据包对应的任务ID;根据任务ID可以确定对应的作业是否在运行;运行的作业越多,则网络实时使用带宽越大,则说明n个任务的跨节点度越大。
应理解,由于分布式AI训练的网络传输带宽波动很小,所以通过采用实时监控带宽能够很好表征作业的网络传输需求。
示例性地,如图6所示,若n个任务可全部放置在某一个服务器中,则该服务器跨节 点数量越大则对该服务器的性能分数加分幅度越大;其中,服务器跨节点数量可以是指该服务器需要进行数据交互的其它服务器的数量;或者,可以通过实时监控该服务器的使用带宽说明该服务器的跨节点度的大小;若n个任务不可全部放置于某个服务器中,则该服务器跨节点数量越小则对该服务器的性能分数加分幅度越大。换而言之,对于不需要跨服务器放置的作业,优先放置在跨节点数量大的服务器中;对于需要跨服务器放置的作业,优先放置在跨节点数量小的服务器中。
示例性地,通过n个任务的跨节点度决定候选节点的性能分数的具体实现过程可以参见后续图8所示的步骤S833。
在一个实施例中,节点空闲度越小则网络传输性能分数越高,在n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为第m个任务的目标节点,包括:
判断所述第m个候选节点集合中的候选节点中用于作业训练的硬件资源是否被使用,如果是,对候选节点的网络传输性能分数进行加分。
应理解,上述对候选节点的性能评分可以由节点空闲度决定,其中,通过节点空闲度进行打分的目标是尽量保留用于作业训练的硬件资源完全空闲的节点,以便应对后续出现的大规格任务,使得大规格任务能尽量地放置到同一个节点中,避免资源碎片化。因此,通过当候选节点的用于作业训练的硬件资源被使用时,对其性能分数进行加分,可保证该候选节点优先选择作为目标节点,而用于作业训练的硬件资源未被使用的候选节点则不会被优先选择作为目标节点,从而使得用于作业训练的硬件资源未被使用的候选节点保持空闲,用于作业训练的硬件资源已被使用的候选节点被充分使用,可避免资源碎片化。
可选地,硬件资源包括图像处理器和中央处理器。
在一个实施例中,在n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为第m个任务的目标节点,还包括:
确认第m个候选节点集合中的候选节点中用于作业训练的硬件资源的分配率;
根据分配率对候选节点的网络传输性能分数进行加分,分配率越大,对候选节点的网络传输性能分数的加分幅度越大,分配率越小,对候选节点的网络传输性能分数的加分幅度越小。
当确认候选节点中用于作业训练的硬件资源已经被使用时,进一步判断硬件资源使用的情况,即硬件资源的分配率,该分配率越高,说明该候选节点的硬件资源被使用更充分,此时希望把任务分配到该候选节点中,使得该候选节点能够充分利用自身的硬件资源,因此加大对该候选节点的性能分数的加分幅度,反之,则降低对该候选节点的性能分数的加分幅度。
示例性地,如图6所示,在一个服务器中若GPU或者CPU的分配率越大,即空闲的CPU或者GPU越少,则对该服务器的性能分数进行加分;若GPU或者CPU的分配率越小,即空闲的CPU或者GPU越多,则对该服务器的性能分数进行减分。
在本申请的实施例中,通过节点空闲度进行打分可以尽量保留完全空闲的GPU主机,以便应对大规格的任务能够放置,从而避免资源碎片化,从而能够提高大规格的任务运行效率,并且提高集群资源的利用率。
示例性地,通过节点空闲度进行打分的具体实现过程可以参见后续图8所示的步骤 S834与步骤S835。
其中,上述候选节点的性能分数可以通过n个任务在同一机架聚合度、n个任务中不同种类的任务之间的亲和度、n个任务的跨节点度和节点空闲度中的一者或任意组合决定的。
例如,用户可以基于上述几个维度的策略,通过配置单独开启或者关闭;或者,也可以可组合开启策略并且定义不同权重值的调度策略。
示例性地,上述不同评估维度对应的权重值可以是根据用户需求进行预先设定的阈值;其中,可以根据不同评估维度的优先级设置不同评估维度的权重值;比如,若机架聚合度是多个评估维度中优先级最高,则机架聚合度对应的权重的数值可以配置为多个权重中最大的数值。
在本申请的实施例中,通过根据目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合;根据第m个任务对应的第m个候选节点集合中的每一个候选节点的性能进行打分,选择分数最高的候选节点作为第m个任务的目标节点,并将第m个任务分配到第m个任务的目标节点中,其中,性能可以包括n个任务在同一机架聚合度、n个任务中不同种类的任务之间的亲和度、n个任务的跨节点度和节点空闲度中的一者或任意组合;在本申请的实施例中,在为目标作业分配资源时,不仅可以考虑到目标作业的需求信息同时还可以考虑到网络传输负载,从而能够提高在运行目标作业时的网络传输性能,进而能够缩短目标作业的运行时间,提高目标作业的运行效率。
下面结合图8与图9对采用上述不同维度的评估策略进行作业调度方法的过程进行详细说明。
图8是本申请实施例提供的作业调度方法的示意性流程图。其中,该方法包括步骤S810至步骤S870,下面分别对这些步骤进行详细的描述。
S810、解析作业中包含的所有任务,依次为这些任务选择目标节点。
其中,上述作业可以是指AI训练作业,或者,还可以是指其它运行时具备网络传输需求的作业。
例如,调度器可以从作业队列中根据一定规则获取一个作业进行调度,该规则可以是资源公平(dominated resource Fairness,DRF)算法或者其他算法。调度器解析该作业包含的所有任务(task),并且依次为每个task调度,选择合适的节点进行绑定,绑定的节点用于执行该task。
S820、根据每个任务携带的硬件资源需求,分别在该节点集群中进行节点筛选,得到n个候选节点集合。
其中,硬件资源需求可以是指端口过滤、节点的标签匹配等筛选出符合条件的节点,如节点包含的GPU类型。
例如,节点的端口过滤可以是指该作业可以在某一个端口号之外的其他节点中运行;节点的标签匹配可以是指根据IP地址范围选择运行目标作业的节点。上述S820中的节点的预选方法可以采用现有技术中的调度器的常用方法,此处不作任何限定。
例如,节点的端口过滤可以是指该作业可以在某一个端口号之外的其他节点中运行;节点的标签匹配可以是指根据IP地址范围选择运行目标作业的节点。
S830、遍历所有候选节点,根据不同维度对候选节点的网络传输性能分数进行评估, 最终从所有候选节点中得到网络传输性能分数最高的候选节点。
例如,可以采用不同维度对所有候选节点进行评估,并且乘以权重;最终对预选的候选节点进行优选,得到用于绑定某一任务的节点。
示意性地,上述步骤S830可以包括步骤S831至步骤S835,即可以从用于管理节点的机架维度、亲和性维度、跨节点维度、大规格任务维度以及硬件资源数量维度进行所有候选节点的网络传输性能分数评估。
应理解,上述不同维度的评估可以主要是基于网络传输带宽以及避免资源碎片化的角度进行评估;考虑网络传输带宽可以提高AI作业的运行效率;避免资源碎片化可以考虑任务的放置,使得将大规划的资源用于后续放置大规格任务,提高资源的整体利用率。
S831、机架rack维度对候选节点的网络传输性能分数进行评估;通过该维度进行评估的目标是使得单个作业所包含的多个任务尽量放置到同一个机架中,从而避免任务之间跨机架传输数据,可有效提高作业的网络传输效率。
例如,该维度的权重值w1可以为10000,评估值通过以下方式得出:
1、若某一作业所包含的所有任务均未分配,考虑机架维度的可放置性,即计算该机架剩余所有候选节点是否能够放置下作业的所有作业。
如果任务所属作业可以放置到该候选节点,则该候选节点所属的管理交换机的评估值为1,即Rack,score=1;如果任务所属作业无法放置到该候选节点,则该候选节点所属的管理交换机的评估值为-1;即score=-1。
2、若某一作业所包含的任务中已经有至少一个任务已经绑定资源,考虑任务放置的亲和性因素。
其中,亲和性是指假如应用A与应用B两个应用频繁交互,所以有必要利用亲和性让两个应用的尽可能的靠近,甚至在一个节点上,以减少因网络通信而带来的性能损耗;与亲和性相对的为反亲和性,反亲和性是指当应用采用多副本部署时,有必要采用反亲和性让各个应用实例打散分布在各个节点上,以提高可靠性。
若同作业已经存在任务调度到某个机架中,如果对于另一个节点与已经被调度任务所在的节点属于相同机架所管理,则该另一节点的评估值为1,即score=1;如果另一节点与经被调度任务所在的节点属于不同机架所管理,则该另一节点的评估值为-1,即score=-1。
例如,假如候选节点集合中包括候选节点1至候选节点4,其中,候选节点1与候选节点2对应机架1;候选节点3与候选节点4对应机架2;若某一作业所包含的所有任务均未分配,则优先考虑同一机架的可放置性;即若机架1中的候选节点中的资源能够容纳下该作业的所有任务,则优先将该任务分配至机架1中的资源。若某一作业所包含的任务中已经有至少一个任务已经绑定资源,比如,作业中已有一个任务分配至候选节点1,则该任务中包含的其它任务优先考虑分配至候选节点1或者与候选节点1对应相同机架1的候选节点2。
应理解,机架的资源可以是指机架中包括的服务器即候选节点中的硬件资源;比如,硬件资源可以服务器中的CPU、GPU或者内存。
S832、参数节点任务PS与工作节点任务worker之间的亲和性维度对候选节点的网络传输性能分数进行评估,即PS和worker亲和性维度;通过该维度进行评估的目标是提高工作节点之间的网络传输带宽,集中放置;同时,尽量避免PS集中到同一节点,从而造 成PS出现瓶颈。
需要说明的是,上述参数节点PS与工作节点worker可以是指任务的类型不同,例如,如图3所示,若某一节点为参数服务器310即用于负责维护模型的参数,并且在迭代训练进行更新,将参数分发给不同设备以更新模型,则该节点为PS;若某一节点为节点320或者节点330中的GPU用于执行某一批次的数据迭代,则该节点为worker;若某一节点即不是PS也不是worker,则该节点为可以用于任务调度的资源。
例如,该维度的权重值w2可以为1000,评估值通过以下方式得出:
1、若该任务是工作节点任务(worker),则如果遍历的节点中已经有该作业分配的其他工作节点任务,则该节点的评估值为1,即score=1。
应理解,在同一节点中放置一个作业所包括的多个任务,从而能够使得同一作业的多个任务集中放置,从而能够减少节点之间的传输带宽的需求,提高任务的运行效率。
2、若该任务是参数节点任务(PS),则如果遍历的节点中已经有该作业分配过的工作节点任务,则该节点的评估值为1,即score=1;如果该节点已经放置其他参数节点任务,则该节点的评估值为0.5,即score=-0.5。
应理解,若在同一节点中即放置PS也放置worker,则可以使得该worker与PS进行参数同步或者共享时减少传输带宽的需求,提高任务的运行效率;此外,由于PS的运算量较大,因此需要避免在同一节点上放置多个作业中的PS,即可以将不同作业中的PS放置在不同的节点中,从而避免PS集中到同一节点出现瓶颈。
S833、跨节点维度对候选节点的网络传输性能分数进行评估;通过该维度进行评估的目标是评估已分配资源的作业对节点间的带宽的占用情况。
例如,该维度的权重值w3可以为100,评估值通过以下方式得出:
假设,调度器记录的节点间的网络传输连接数量为node.num_cross_nodes_job,一个作业在两个节点上同时调度GPU训练任务,则将两个节点的网络传输连接数量均加1,默认数量为0。
1、若作业所包含的任务数量等于1,或者,遍历每一个节点的剩余资源均大于或者等于作业需求的总资源,即判断是否可以将作业所包含的任务调度到同一个节点;对于不需要跨节点进行调度的作业,则跨节点任务越多的节点越优先;即对于不需要跨节点就可以满足资源调度的任务,可以优先部署在已经绑定较多跨节点任务的节点上。
例如,评估值可以为:
Figure PCTCN2020129971-appb-000001
2、若作业所包含的任务数量不等于1,或者,遍历每一个节点的剩余资源均小于作业需求的总资源,对于需要跨节点进行调度的作业,则跨节点任务越少的节点越优先。
例如,评估值可以为:
Figure PCTCN2020129971-appb-000002
示例性地,上述评估值对应的公式(1)与公式(2)为举例说明,本申请对公式中的参数不作任何限定。
应理解,对于上述情况1而言,若一个作业包括的任务数量为1或者每一个节点中的剩余资源均能够满足作业的资源需求,则优先将该作业分配至网络传输连接数量(又可以 称为网络传输负载)较大的节点。因为当作业包括的任务数量为1或者每一个节点中的剩余资源均能够满足作业的资源需求的情况下,则该作业不需要占用跨节点传输的网络带宽,因此可以将其分配至网络传输连接数量大的节点。
还应理解,对于上述情况2而言,由于作业所包含的任务数量不等于1,或者每一个节点中的剩余资源均不能够满足作业的资源需求,则说明该作业可能需要进行跨节点分配,则优先考虑将该作业分配至网络传输连接数量小的节点。因为跨节点的作业需要占用跨节点传输的网络带宽,因此,为了提高作业的运行效率优先考虑将其分配至网络传输连接数量小的节点。
在一个实施例中,在感知跨节点作业的网络竞争时,可以采用记录节点上跨节点分布式训练的作业网络连接数量。
在另一个实施例中,在感知跨节点作业的网络竞争时,可以采用监控系统,即监控已有作业对网络链路上实时使用带宽的平滑值,记为B。基于此对节点进行打分,score=1+1/(B+1),已占用带宽越大,得分越低,则应该避免放置新的分布式训练作业在该节点上。
示例性地,上述实时使用带宽的平滑值可以是指某一时刻的带宽负载;或者,也可以是指对预设时间段内多个时刻的使用带宽进行平滑处理后得到的带宽负载,其中,平滑处理可以是指取平均值,或者取最大值,或者取最小值等数据处理方法。
需要说明的是,由于分布式AI训练的网络传输带宽波动很小,所以通过实时监控带宽能够很好的表征作业的网络传输需求。
在一种可能的实现方式中,上述AI训练作业也可以是其它对他类型的对网络传输具有需求的作业,可以通过自动识别作业的网络传输需求,或者,作业手动提交网络连接的配置文件,通过上述本申请实施例中的网络传输负载感知的调度机制进行调度。
S834、大规格任务维度对候选节点的网络传输性能分数进行评估;通过该维度进行评估的目标是尽量保留完全空闲的硬件资源,以便应对大规格任务能够放置,避免资源碎片化。
可选地,硬件资源包括GPU和CPU。
例如,以GPU进行举例说明,该维度的权重值w4可以为10,评估值通过以下方式得出:
1、对于GPU分配率为0的节点,评估值可以为0,即score=0;
2、对于GPU分配率大于0的节点,评估值可以为1;即score=1。
需要说明的是,上述GPU分配率可以是指GPU中已经分配给任务的资源大小;GPU分配率为0,则说明该节点上的所有GPU为完全空闲状态。S835、GPU数量维度进行评估;通过该维度进行评估的目标是尽量提高大规格GPU任务的放置可能性,将剩余GPU资源少的节点优先放满任务。
S835、硬件资源维度对候选节点的网络传输性能分数进行评估;通过该维度进行评估的目标是减少资源碎片化,尽量提高需要大规格硬件资源的任务的放置可能性,将剩余硬件资源少的候选节点优先放满任务。
例如,该维度的权重值w5可以为1,评估值通过以下方式得出:
Figure PCTCN2020129971-appb-000003
其中,GPU allocated可以表示节点中已经被占用的GPU的数量;GPU total可以表示节点中GPU的总数量。
需要说明的是,上述步骤S834与步骤S835可以是指同一维度,通过步骤S834与步骤S835均是通过节点空限度对候选节点的网络传输性能分数进行评估,使得尽量保留完全空闲的硬件资源,以便应对大规格任务能够放置,避免资源碎片化。
应理解,上述通过不同维度的评估方式中,若某个节点的评估值越大则优选级越高,优先选择该节点进行任务的放置进行举例说明,同理,也可以是某个节点的评估值越小则优先级越高。
还应理解,上述权重值w1~w5可以是根据用户需求进行预先设定的阈值;其中,可以根据不同评估维度的优先级设置不同评估维度的权重值;比如,若机架维度是多个评估维度中优先级最高,则机架维度对应的权重w1的数值可以配置为w1~w5中的最大值。
S840、将上述所有维度的评估值乘以权重后,相加得到该任务在每个候选节点的最终得分,从中选取分值最大的节点放置该任务。
S850、判断一个作业所包含的所有任务是否均已经全部选择合适的资源;若是,则执行S860;若否,则执行S820。
S860、下发作业。
例如,在作业选择合适的资源后,将作业下发至相应的目标节点。
S870、更新节点上作业的网络传输连接数量。
例如,所有任务选择完毕且均得到对应资源后,更新每个节点的网络传输连接数量node.num_cross_nodes_job,开始运行作业。
需要说明的是,上述作业调度方法在进行优选时各个维度的权重均可以调整,满足上述总体目标即可,即上述不同维度的评估可以主要是基于网络传输带宽以及避免资源碎片化的角度进行评估;考虑网络传输带宽可以提高AI作业的运行效率;避免资源碎片化可以考虑任务的放置,使得将大规划的资源用于后续放置大规格任务,提高资源的整体利用率。
应理解,上述举例说明是为了帮助本领域技术人员理解本申请实施例,而非要将本申请实施例限于所例示的具体数值或具体场景。本领域技术人员根据所给出的上述举例说明,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。
上述图8是通过多个维度并行对候选节点进行评估;同理,也可以通过串行的方式采用上述不同维度对候选节点进行评估。下面对通过串行的方式采用上述不同维度对候选节点进行评估的流程进行详细说明。
图9是本申请实施例提供的作业调度方法的示意性流程图。其中,该方法包括步骤S901至步骤S911,下面分别对这些步骤进行详细的描述。
S901、解析作业中包含的所有任务,依次为这些任务选择节点。
其中,上述作业可以是指AI训练作业,或者,还可以是指其它运行时具备网络传输需求的作业。
例如,调度器可以从作业队列中根据一定规则获取一个作业进行调度,该规则可以是资源公平(dominated resource Fairness,DRF)算法或者其他算法。调度器解析该作业包含的所有任务(task),并且依次为每个任务调度,选择合适的节点进行绑定,绑定的节点用于执行该task。
S902、选取任务,根据任务需求对资源进行预选,筛选出满足条件的候选节点集合N1。
示例性地,可以根据节点的端口过滤、节点的标签匹配等筛选出符合条件的节点,如节点包含的GPU类型。
例如,节点的端口过滤可以是指该作业可以在某一个端口号之外的其他节点中运行;节点的标签匹配可以是指根据IP地址范围选择运行目标作业的节点。
上述步骤902中的节点的预选方法可以采用现有技术中的调度器的常用方法,此处不作任何限定。
S903、确定候选节点集合N1中的候选节点所属的机架rack集合。
例如,如图6所示,第二级交换机可以是指架顶交换机,多个架顶交换机中包括的服务器(又可以称为节点)是可以联网。
S904、机架rack维度进行评估;通过该维度进行评估的目标是使得单个作业所包含的多个task尽量放置到同一个节点中,从而提高网络传输效率。
示例性地,可以对机架按照一定规则进行排序,然后按照顺序遍历机架rack所管理的节点。
其中,对机架的排序规则可以是指:若一个作业所包含的所有任务均为被分配资源,则考虑机架管理的节点对作业的可放置性,即能够容纳作业所包含的所有任务的节点所属的机架排序在前,否则靠后;若一个作业所包含的所有任务中已有部分任务完整了资源分配,则将任务所在的节点所属的机架排序靠前,否则靠后。
需要说明的是,上述步骤S904中具体的实现方式可以参见图8所示的步骤S831,此处不再赘述。
S905、参数节点PS任务与工作节点任务之间的亲和性维度进行评估,即PS和worker亲和性维度;通过该维度进行评估的目标是提高工作节点之间的网络传输带宽,集中放置;同时,尽量避免PS集中到同一节点,从而造成PS出现瓶颈。
应理解,上述参数节点PS与工作节点worker可以是指任务的类型不同,例如,如图3所示,若某一节点为参数节点310即用于负责维护模型的参数,并且在迭代训练进行更新,将参数分发给不同设备以更新模型,则该节点为PS;若某一节点为节点320或者节点330中的GPU用于执行某一批次的数据迭代,则该节点为worker;若某一节点即不是PS也不是worker,则该节点为可以用于任务调度的资源。
上述亲和性是指假如应用A与应用B两个应用频繁交互,所以有必要利用亲和性让两个应用的尽可能的靠近,甚至在一个节点上,以减少因网络通信而带来的性能损耗;
示例性地,可以依次遍历排序后的机架rack所属的节点,根据亲和性规则将节点进行排序分为K1、K2以及K3。
其中,根据亲和性规则将节点进行排序可以是指:若某一节点中放置了某一作业所包含的worker类型的任务,则将该节点放到K1集合;若某一节点中放置了某一作业所包含 的PS类型的任务,则将该节点放到K2集合;其它节点则放到K3集合。
需要说明的是,上述步骤S905中具体的实现方式可以参见图8所示的步骤S832,此处不再赘述。
S906、跨节点网络传输负载维度进行评估;通过该维度进行评估的目标是评估已分配资源的作业对节点间的带宽的占用情况。
示例性地,依次遍历Ki(例如,K1、K2以及K3)中的节点,根据当前节点是否能够放置作业中的所有任务,将节点分为集合T1、T2。
例如,可以根据跨节点的作业数量,将负载相同的节点合并,形成集合G1、G2…Gn。
在一个实施例中,若某个Ki中的节点数量为0,则返回执行步骤S905。
例如,若K1中的节点数量为0即表示未放置某一作业所包含的worker类型的任务,则返回遍历K2中的节点,查询K2中是否存在放置某一作业所包含的PS类型的任务的节点;若依次遍历Ki中的节点,Ki中数量均为0,则结束流程。
S907、跨节点网络传输负载维度进行评估;通过该维度进行评估的目标是评估已分配资源的作业对节点间的带宽的占用情况。
示例性地,可以依次遍历Ti(例如,T1、T2)中的节点,根据当前节点上的网络传输负载进行排序,例如,跨节点的作业数量;此外,负载相同的节点可以合并,形成集合G1、G2…Gn。
举例说明,若节点1、节点2以及节点3中的网络传输连接数量分别为3、3、2,可以分别对每个节点分别进行每个维度的评估,即可以分别依次对节点1至节点3进行各个维度的评估;也可以将网络传输连接数量相同的节点1与节点2合并,将负载相同的节点合并形成集合然后进行统筹评估,通过统筹评估能够提高评估的准确性。
在一个实施例中,若某个Ti中的节点数量为0,则返回执行步骤S906。
例如,若T1中包括5个节点,遍历T1中的5个节点确定不存在节点能够放置作业中所有任务;若T1=0返回至步骤906,遍历T2中的多个节点,查找是否存在节点能够放置作业中所有任务;若依次遍历Ti中的节点,Ti中均不存在节点能够放置作业中所有任务,则结束流程。
需要说明的是,上述步骤S906与步骤S907中具体的实现方式可以参见图8所示的步骤S833,此处不再赘述。
S908、大规格任务维度进行评估;通过该维度进行评估的目标是尽量保留完全空闲的GPU主机,以便应对大规格任务能够放置,避免资源碎片化。
示例性地,可以依次遍历Gi中的节点,根据当前节点已经分配的GPU数量进行排序。
例如,可以将一个任务放置在分配GPU最多的节点,从而避免资源碎片化。
需要说明的是,上述步骤S908中具体的实现方式可以参见图8所示的步骤S834与步骤S835,此处不再赘述。
S909、判断一个作业所包含的所有任务是否均已经全部选择合适的资源;若是,则执行步骤S910;若否,则执行步骤S902。
S910、下发作业。
例如,在作业选择合适的资源后,将作业下发至相应的节点。
S911、更新节点上作业的网络传输连接数量。
例如,所有任务选择完毕且均得到对应资源后,更新每个节点的网络传输连接数量,开始运行作业。
应理解,图8所示的作业调度方法需要对候选节点集合中的每个候选节点进行多维度的判断;图9所示的作业调度方法是按第一维度选择第一部分候选节点,再按第二维度在第一候选部分节点中选择一个子集,即第二部分候选节点,再按第三选择维度在第二部分候选节点中选择一个子集,即第三部分节点。类似地,依次遍历执行上述多个选择维度。
需要说明的是,上述作业调度方法在进行优选时各个维度的权重均可以调整,满足上述总体目标即可,即上述不同维度的评估可以主要是基于网络传输带宽以及避免资源碎片化的角度进行评估;考虑网络传输带宽可以提高AI作业的运行效率;避免资源碎片化可以考虑任务的放置,使得将大规划的资源用于后续放置大规格任务,提高资源的整体利用率。
应理解,上述举例说明是为了帮助本领域技术人员理解本申请实施例,而非要将本申请实施例限于所例示的具体数值或具体场景。本领域技术人员根据所给出的上述举例说明,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。
上文结合图1至图9,详细描述了本申请实施例中的作业调度方法,下面将结合图10和图11,详细描述本申请的装置实施例。应理解,本申请实施例中的作业调度装置可以执行前述本申请实施例的各种作业调度方法,即以下各种产品的具体工作过程,可以参考前述方法实施例中的对应过程。
图10是本申请实施例提供的作业调度装置1000的示意性框图。
应理解,作业调度装置1000能够执行图7至图9所示的作业调度方法中的各个步骤,为了避免重复,此处不再详述。作业调度装置1000包括:接收单元1010和处理单元1020。
其中,接收单元1010用于接收目标作业,所述目标作业包括n个任务;处理单元1020用于根据所述目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,其中,每一候选节点集合包括多个候选节点;根据所述目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,其中,每一候选节点集合包括多个候选节点;在所述n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为所述第m个任务的目标节点,其中,所述第m个任务的目标节点用于处理所述第m个任务,所述网络传输性能分数由所述n个任务在同一机架聚合度、所述n个任务之间的亲和度、所述n个任务的跨节点度和节点空闲度中的一者或任意组合决定,n为大于或者等于1的整数,m为1至n之间的任意正整数。
可选地,作为一个实施例,所述n个任务在同一机架聚合度越高则所述网络传输性能分数越高,所述处理单元1020具体用于:
判断所述n个任务能否全部放置于所述第m个候选节点集合中的候选节点所在的机架;
如果是,对所述候选节点的网络传输性能分数进行加分;
如果否,对所述候选节点的网络传输性能分数进行减分。
可选地,作为一个实施例,所述n个任务之间的亲和度越高则所述网络传输性能分数越高,所述处理单元1020具体用于:
确认所述第m个任务的类型;
在所述第m个任务的类型为工作节点任务的情况下,判断所述第m个候选节点集合中的候选节点是否需要放置所述n个任务中的其他工作节点任务或者参数节点任务;如果是,对所述候选节点的网络传输性能分数进行加分;
在所述第m个任务的类型为参数节点任务的情况下,判断所述第m个候选节点集合中的候选节点是否需要放置所述n个任务中的工作节点任务;如果是,对所述候选节点的网络传输性能分数进行加分;并判断所述第m个候选节点集合中的候选节点是否需要放置所述n个任务中的其他参数节点任务,如果是,对所述候选节点的网络传输性能分数进行减分。
可选地,作为一个实施例,所述处理单元1020具体用于:
确认所述第m个候选节点集合中的候选节点处理处于运行状态的其他作业时的跨节点数量;
在所述n个任务可全部放置于所述候选节点的情况下,所述跨节点数量越大,对所述候选节点的网络传输性能分数的加分幅度越大,所述跨节点数量越小,对所述候选节点的网络传输性能分数的加分幅度越小;
在所述n个任务不可全部放置于所述候选节点的情况下,所述跨节点数量越大,对所述候选节点的网络传输性能分数的加分幅度越小,所述跨节点数量越小,对所述候选节点的网络传输性能分数的加分幅度越大。
可选地,作为一个实施例,所述节点空闲度越小则所述网络传输性能分数越高,所述处理单元1020具体用于:
判断所述第m个候选节点集合中的候选节点中用于作业训练的硬件资源是否被使用,如果是,对所述候选节点的网络传输性能分数进行加分。
可选地,作为一个实施例,所述处理单元1020还用于:
确认所述第m个候选节点集合中的候选节点中用于作业训练的硬件资源的分配率;
根据所述分配率对所述候选节点的网络传输性能分数进行加分,所述分配率越大,对所述候选节点的网络传输性能分数的加分幅度越大,所述分配率越小,对所述候选节点的网络传输性能分数的加分幅度越小。
可选地,作为一个实施例,所述目标作业的每个任务均携带有硬件资源需求,所述处理单元1020具体用于:
根据每个任务携带的硬件资源需求,分别在所述节点集群中进行节点筛选,得到所述n个候选节点集合,其中所述n个候选节点集合中每个候选节点集合的硬件资源分别与对应的任务携带的硬件资源需求匹配。
可选地,作为一个实施例,所述目标作业包括人工智能模型的训练作业。
应理解,这里的作业调度装置1000以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现,对此不作具体限定。
例如,“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。
因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
图11是本申请实施例的作业调度装置的硬件结构示意图。
图11所示的作业调度装置1100可以包括存储器1101、处理器1102、通信接口1103以及总线1104。其中,存储器1101、处理器1102、通信接口1103通过总线1104实现彼此之间的通信连接。
存储器1101可以是只读存储器(read-only memory,ROM),静态存储设备和随机存取存储器(random access memory,RAM)。存储器1101可以存储程序,当存储器1101中存储的程序被处理器1102执行时,处理器1102和通信接口1103用于执行本申请实施例的作业调度方法的各个步骤,例如,可以执行图7至图9所示的作业调度方法的各个步骤。
处理器1102可以采用通用的CPU、微处理器、ASIC、GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的图10所示的作业调度装置中的单元所需执行的功能,或者执行本申请方法实施例的作业调度方法。
处理器1102还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例的作业调度方法的各个步骤可以通过处理器1102中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器1102还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1101,处理器1102读取存储器1101中的信息,结合其硬件完成本申请实施例的作业调度装置中包括的单元所需执行的功能,或者执行本申请方法实施例的作业调度方法。
例如,处理器1102可以与图10所示的作业调度装置中的处理单元1020对应。
通信接口1103使用例如但不限于收发器一类的收发装置,来实现作业调度装置1100与其他设备或通信网络之间的通信。
例如,所示通信接口1103可以与图10所示的作业调度装置1000中的接收单元1010对应,可以通过通信接口1103接收目标作业的资源请求。
总线1104可包括在作业调度装置1100各个部件(例如,存储器1101、处理器1102、通信接口1103)之间传送信息的通路。
应注意,尽管上述作业调度装置1100仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,作业调度装置1100还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,上述作业调度装置1100还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理 解,上述作业调度装置1100也可仅仅包括实现本申请实施例所必须的器件,而不必包括图11中所示的全部器件。
本申请实施例还提供一种芯片,该芯片包括收发单元和处理单元。其中,收发单元可以是输入输出电路、通信接口;处理单元为该芯片上集成的处理器或者微处理器或者集成电路;该芯片可以执行上述方法实施例中的作业调度方法。
本申请实施例还提供一种计算机可读存储介质,其上存储有指令,该指令被执行时执行上述方法实施例中的作业调度方法。
本申请实施例还提供一种包含指令的计算机程序产品,该指令被执行时执行上述方法实施例中的作业调度方法。
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟 悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种作业调度方法,其特征在于,包括:
    接收目标作业,所述目标作业包括n个任务;
    根据所述目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,其中,每一候选节点集合包括多个候选节点;
    在所述n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为所述第m个任务的目标节点,其中,所述第m个任务的目标节点用于处理所述第m个任务,所述网络传输性能分数由所述n个任务在同一机架聚合度、所述n个任务之间的亲和度、所述n个任务的跨节点度和节点空闲度中的一者或任意组合决定,n为大于或者等于1的整数,m为1至n之间的任意正整数。
  2. 根据权利要求1所述的方法,其特征在于,所述n个任务在同一机架聚合度越高则所述网络传输性能分数越高,所述在所述n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为所述第m个任务的目标节点,包括:
    判断所述n个任务能否全部放置于所述第m个候选节点集合中的候选节点所在的机架;
    如果是,对所述候选节点的网络传输性能分数进行加分;
    如果否,对所述候选节点的网络传输性能分数进行减分。
  3. 根据权利要求1所述的方法,其特征在于,所述n个任务之间的亲和度越高则所述网络传输性能分数越高,所述在所述n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为所述第m个任务的目标节点,包括:
    确认所述第m个任务的类型;
    在所述第m个任务的类型为工作节点任务的情况下,判断所述第m个候选节点集合中的候选节点是否需要放置所述n个任务中的其他工作节点任务或者参数节点任务;如果是,对所述候选节点的网络传输性能分数进行加分;
    在所述第m个任务的类型为参数节点任务的情况下,判断所述第m个候选节点集合中的候选节点是否需要放置所述n个任务中的工作节点任务;如果是,对所述候选节点的网络传输性能分数进行加分;并判断所述候选节点是否需要放置所述n个任务中的其他参数节点任务,如果是,对所述候选节点的网络传输性能分数进行减分。
  4. 根据权利要求1所述的方法,其特征在于,所述在所述n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为所述第m个任务的目标节点,包括:
    确认所述第m个候选节点集合中的候选节点处理处于运行状态的其他作业时的跨节点数量;
    在所述n个任务可全部放置于所述候选节点的情况下,所述跨节点数量越大,对所述候选节点的网络传输性能分数的加分幅度越大,所述跨节点数量越小,对所述候选节点的网络传输性能分数的加分幅度越小;
    在所述n个任务不可全部放置于所述候选节点的情况下,所述跨节点数量越大,对所述候选节点的网络传输性能分数的加分幅度越小,所述跨节点数量越小,对所述候选节点的网络传输性能分数的加分幅度越大。
  5. 根据权利要求1所述的方法,其特征在于,所述节点空闲度越小则所述网络传输性能分数越高,所述在所述n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为所述第m个任务的目标节点,包括:
    判断所述第m个候选节点集合中的候选节点中用于作业训练的硬件资源是否被使用,如果是,对所述候选节点的网络传输性能分数进行加分。
  6. 根据权利要求5所述的方法,其特征在于,所述在所述n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为所述第m个任务的目标节点,还包括:
    确认所述第m个候选节点集合中的候选节点中用于作业训练的硬件资源的分配率;
    根据所述分配率对所述候选节点的网络传输性能分数进行加分,所述分配率越大,对所述候选节点的网络传输性能分数的加分幅度越大,所述分配率越小,对所述候选节点的网络传输性能分数的加分幅度越小。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述目标作业的每个任务均携带有硬件资源需求,所述根据所述目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,包括:
    根据每个任务携带的硬件资源需求,分别在所述节点集群中进行节点筛选,得到所述n个候选节点集合,其中所述n个候选节点集合中每个候选节点集合的硬件资源分别与对应的任务携带的硬件资源需求匹配。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述目标作业包括人工智能模型的训练作业。
  9. 一种作业调度装置,其特征在于,包括:
    接收单元,用于接收目标作业,所述目标作业包括n个任务;
    处理单元,用于根据所述目标作业的n个任务在节点集群中分别进行节点筛选,得到n个候选节点集合,其中,每一候选节点集合包括多个候选节点;在所述n个任务中的第m个任务对应的第m个候选节点集合中选择网络传输性能分数最高的候选节点作为所述第m个任务的目标节点,其中,所述第m个任务的目标节点用于处理所述第m个任务,所述网络传输性能分数由所述n个任务在同一机架聚合度、所述n个任务之间的亲和度、所述n个任务的跨节点度和节点空闲度中的一者或任意组合决定,n大于或者等于1的整数,m为1至n之间的任意正整数。
  10. 根据权利要求9所述的作业调度装置,其特征在于,所述n个任务在同一机架聚合度越高则所述网络传输性能分数越高,所述处理单元具体用于:
    判断所述n个任务能否全部放置于所述第m个候选节点集合中的候选节点所在的机架;
    如果是,对所述候选节点的网络传输性能分数进行加分;
    如果否,对所述候选节点的网络传输性能分数进行减分。
  11. 根据权利要求9所述的作业调度装置,其特征在于,所述n个任务之间的亲和度 越高则所述网络传输性能分数越高,所述处理单元具体用于:
    确认所述第m个任务的类型;
    在所述第m个任务的类型为工作节点任务的情况下,判断所述第m个候选节点集合中的候选节点是否需要放置所述n个任务中的其他工作节点任务或者参数节点任务;如果是,对所述候选节点的性能分数进行加分;
    在所述第m个任务的类型为参数节点任务的情况下,判断所述第m个候选节点集合中的候选节点是否需要放置所述n个任务中的工作节点任务;如果是,对所述候选节点的网络传输性能分数进行加分;并判断所述候选节点是否需要放置所述n个任务中的其他参数节点任务,如果是,对所述候选节点的网络传输性能分数进行减分。
  12. 根据权利要求9所述的作业调度装置,其特征在于,所述处理单元具体用于:
    确认所述第m个候选节点集合中的候选节点处理处于运行状态的其他作业时的跨节点数量;
    在所述n个任务可全部放置于所述候选节点的情况下,所述跨节点数量越大,对所述候选节点的网络传输性能分数加分的幅度越大,所述跨节点数量越小,对所述候选节点的网络传输性能分数加分的幅度越小;
    在所述n个任务不可全部放置于所述候选节点的情况下,所述跨节点数量越大,对所述候选节点的网络传输性能分数加分的幅度越小,所述跨节点数量越小,对所述候选节点的网络传输性能分数加分的幅度越大。
  13. 根据权利要求9所述的作业调度装置,其特征在于,所述节点空闲度越小则所述网络传输性能分数越高,所述处理单元具体用于:
    判断所述第m个候选节点集合中的候选节点中用于作业训练的硬件资源是否被使用,如果是,对所述候选节点的网络传输性能分数进行加分。
  14. 根据权利要求13所述的作业调度装置,其特征在于,所述处理单元还用于:
    确认所述第m个候选节点集合中的候选节点中用于作业训练的硬件资源的分配率;
    根据所述分配率对所述候选节点的网络传输性能分数进行加分,所述分配率越大,对所述候选节点的网络传输性能分数的加分幅度越大,所述分配率越小,对所述候选节点的性能分数的加分幅度越小。
  15. 根据权利要求9至14中任一项所述的作业调度装置,其特征在于,所述目标作业的每个任务均携带有硬件资源需求,所述处理单元具体用于:
    根据每个任务携带的硬件资源需求,分别在所述节点集群中进行节点筛选,得到所述n个候选节点集合,其中所述n个候选节点集合中每个候选节点集合的硬件资源分别与对应的任务携带的硬件资源需求匹配。
  16. 根据权利要求9至15中任一项所述的作业调度装置,其特征在于,所述目标作业包括人工智能模型的训练作业。
  17. 一种作业调度装置,其特征在于,包括处理器、存储器和通信接口,所述存储器中用于存储计算机执行指令,所述运行时,所述处理器运行所述存储器中的计算机执行指令以执行如权利要求1至8中任一项所述的作业调度方法。
  18. 一种计算机可读存储介质,其特征在于,包括计算机程序,当该计算机程序在计算机上运行时,使得该计算机执行如权利要求1至8中任一项所述的作业调度方法。
PCT/CN2020/129971 2019-12-09 2020-11-19 作业调度方法以及作业调度装置 WO2021115082A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20899083.8A EP4057142A4 (en) 2019-12-09 2020-11-19 TASK SCHEDULING METHOD AND TASK SCHEDULING APPARATUS
US17/835,143 US20220300323A1 (en) 2019-12-09 2022-06-08 Job Scheduling Method and Job Scheduling Apparatus

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201911253271 2019-12-09
CN201911253271.7 2019-12-09
CN202010407994.4 2020-05-14
CN202010407994.4A CN113037800B (zh) 2019-12-09 2020-05-14 作业调度方法以及作业调度装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/835,143 Continuation US20220300323A1 (en) 2019-12-09 2022-06-08 Job Scheduling Method and Job Scheduling Apparatus

Publications (1)

Publication Number Publication Date
WO2021115082A1 true WO2021115082A1 (zh) 2021-06-17

Family

ID=76329435

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129971 WO2021115082A1 (zh) 2019-12-09 2020-11-19 作业调度方法以及作业调度装置

Country Status (3)

Country Link
US (1) US20220300323A1 (zh)
EP (1) EP4057142A4 (zh)
WO (1) WO2021115082A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115062771A (zh) * 2022-08-16 2022-09-16 之江实验室 一种分布式机器学习梯度汇聚方法、装置及模型训练方法
CN115248728A (zh) * 2022-09-21 2022-10-28 之江实验室 面向智能计算的分布式训练任务调度方法、系统和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140359624A1 (en) * 2013-05-30 2014-12-04 Hewlett-Packard Development Company, L.P. Determining a completion time of a job in a distributed network environment
CN105808634A (zh) * 2015-01-15 2016-07-27 国际商业机器公司 分布式映射化简网络
CN105847358A (zh) * 2016-03-24 2016-08-10 广东三盟信息科技有限公司 一种云计算环境下大数据节点分布的实现方法及其系统
CN109997124A (zh) * 2016-10-24 2019-07-09 谷歌有限责任公司 用于测量关键词的语义相关性的系统和方法
CN110008024A (zh) * 2019-04-02 2019-07-12 广西大学 一种多维约束下基于延迟决策的容器调度方法以及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092683B (zh) * 2011-11-07 2017-12-26 Sap欧洲公司 用于数据分析的基于启发式的调度
US8898505B2 (en) * 2011-12-01 2014-11-25 International Business Machines Corporation Dynamically configureable placement engine

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140359624A1 (en) * 2013-05-30 2014-12-04 Hewlett-Packard Development Company, L.P. Determining a completion time of a job in a distributed network environment
CN105808634A (zh) * 2015-01-15 2016-07-27 国际商业机器公司 分布式映射化简网络
CN105847358A (zh) * 2016-03-24 2016-08-10 广东三盟信息科技有限公司 一种云计算环境下大数据节点分布的实现方法及其系统
CN109997124A (zh) * 2016-10-24 2019-07-09 谷歌有限责任公司 用于测量关键词的语义相关性的系统和方法
CN110008024A (zh) * 2019-04-02 2019-07-12 广西大学 一种多维约束下基于延迟决策的容器调度方法以及装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115062771A (zh) * 2022-08-16 2022-09-16 之江实验室 一种分布式机器学习梯度汇聚方法、装置及模型训练方法
CN115062771B (zh) * 2022-08-16 2022-11-25 之江实验室 一种分布式机器学习梯度汇聚方法、装置及模型训练方法
CN115248728A (zh) * 2022-09-21 2022-10-28 之江实验室 面向智能计算的分布式训练任务调度方法、系统和装置

Also Published As

Publication number Publication date
US20220300323A1 (en) 2022-09-22
EP4057142A1 (en) 2022-09-14
EP4057142A4 (en) 2022-12-21

Similar Documents

Publication Publication Date Title
KR102300984B1 (ko) 작업 서버를 사용한 대규모 분산형 시스템의 기계 학습 모델의 훈련
CN113037800B (zh) 作业调度方法以及作业调度装置
CN107404523A (zh) 云平台自适应资源调度系统和方法
WO2021056390A1 (zh) 卷积神经网络模型同步训练方法、集群及可读存储介质
CN106503791A (zh) 用于有效神经网络部署的系统和方法
CN105975342A (zh) 基于改进布谷鸟搜索算法的云计算任务调度方法及系统
WO2022171066A1 (zh) 基于物联网设备的任务分配方法、网络训练方法及装置
US11757790B2 (en) Method and server for adjusting allocation of computing resources to plurality of virtualized network functions (VNFs)
WO2021115082A1 (zh) 作业调度方法以及作业调度装置
CN102937918A (zh) 一种hdfs运行时数据块平衡方法
CN113742089B (zh) 异构资源中神经网络计算任务的分配方法、装置和设备
CN115904539A (zh) 一种切分策略的在线生成方法、装置、设备及存储介质
CN114205317A (zh) 基于sdn与nfv的服务功能链sfc资源分配方法及电子设备
CN113190342B (zh) 用于云-边协同网络的多应用细粒度卸载的方法与系统架构
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
CN114466014B (zh) 一种服务调度方法、装置、电子设备及存储介质
Lin et al. Impact of MapReduce policies on job completion reliability and job energy consumption
CN110415162B (zh) 大数据中面向异构融合处理器的自适应图划分方法
KR20230068709A (ko) 스케줄러, 스케줄러의 동작 방법 및 이를 포함한 전자 장치
JP2012038275A (ja) 取引計算シミュレーションシステム、方法及びプログラム
CN113656494A (zh) 参数服务器的同步方法、系统及可读存储介质
WO2020019315A1 (zh) 一种基于图数据的计算运行调度方法、系统、计算机可读介质及设备
He et al. A reinforcement learning method for scheduling service function chains with multi-resource constraints
CN117114055B (zh) 面向工业应用场景的fpga二值神经网络加速方法
US11966783B1 (en) Real time scheduling using expected application resource usage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20899083

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020899083

Country of ref document: EP

Effective date: 20220607

NENP Non-entry into the national phase

Ref country code: DE