WO2016206564A1 - 作业调度方法、装置及分布式系统 - Google Patents

作业调度方法、装置及分布式系统 Download PDF

Info

Publication number
WO2016206564A1
WO2016206564A1 PCT/CN2016/086102 CN2016086102W WO2016206564A1 WO 2016206564 A1 WO2016206564 A1 WO 2016206564A1 CN 2016086102 W CN2016086102 W CN 2016086102W WO 2016206564 A1 WO2016206564 A1 WO 2016206564A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
control node
slice
run
running
Prior art date
Application number
PCT/CN2016/086102
Other languages
English (en)
French (fr)
Inventor
才华
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2016206564A1 publication Critical patent/WO2016206564A1/zh
Priority to US15/852,786 priority Critical patent/US10521268B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores

Definitions

  • the present application belongs to the field of data processing technologies, and in particular, to a job scheduling method, apparatus, and distributed system.
  • distributed jobs include a limited number of tasks, and there are certain dependencies between tasks. Each task is divided into multiple task slices. When the job runs, each task is composed of The task slice of the task is processed for data.
  • the job scheduling mode is usually performed by the central node (master). For example, suppose the job has two tasks, namely task1 and task2, and the dependency of task1 and task2 is the input of task2 as task1. The output, when the job is running, the central node will look for a task without a predecessor task, assuming that task1 runs first. After all the task slices of task1 have finished running, the central node then schedules task2 to run.
  • the technical problem to be solved by the present application is to provide a job scheduling method, a device, and a distributed system, which improve scheduling efficiency and improve resource utilization.
  • the present application discloses a job scheduling method, which is applied to a distributed system, where the distributed system includes at least a central node, a plurality of control nodes connected to the central node, and each control node respectively. a plurality of computing nodes connected; the central node assigns tasks of the jobs to the respective control nodes; the control nodes schedule respective task slices of the tasks assigned to them to run in the computing nodes connected thereto, the method comprising:
  • Scheduling the first control node of the first task to end when at least one task slice operation of the first task ends Knowing that the second control node that schedules the second task acquires the running data generated by the at least one task slice of the first task; wherein the first task is a task that is any one of the jobs; the second task is Relying on any of the tasks of the first task in the job;
  • the second control node schedules at least one task slice operation of the second task to process the operation data.
  • the second control node schedules at least one task slice operation of the second task, and processing the operation data includes:
  • the task data version that matches the running data version runs, and processes the running data.
  • the method further includes:
  • the second control node requests the first control node to re-run the task slice corresponding to the running data version in the first task;
  • the first control node schedules the task slice corresponding to the running data version in the first task to re-run, and after the re-run ends, notifying the second control node to acquire the regenerated operating data;
  • the second control node acquires the regenerated running data, and schedules the first task slice to run when the regenerated running data version matches the required data version of the first task slice; otherwise, the request The first control node schedules the task slice corresponding to the running data version in the first task to re-run until the regenerated running data version matches the required data version of the first task slice.
  • the second control node schedules any task slice running of the second task, and after processing the running data, the method further includes:
  • the second control node requests the first control node to schedule a task slice corresponding to the running data version in the first task when the second task slice in the second task fails to process the running data.
  • the first control node schedules, in the first task, the task slice corresponding to the running data version is re-run; and after the task slice corresponding to the running data version is re-runned in the first task, the notification station is notified
  • the second control node obtains the regenerated operational data
  • the second control node acquires the regenerated operational data, and schedules the second task slice operation of the second task to process the regenerated operational data.
  • the second control node requests the first control node to schedule the number of runs in the first task Rerunning the task slice corresponding to the version includes:
  • the second control node requests the first control node to re-run the task slice corresponding to the running data version in the first task by increasing the required data version.
  • a distributed system includes a central node, a plurality of control nodes connected to the central node, and a plurality of computing nodes connected to each control node;
  • the central node is configured to allocate a task of a job to each control node
  • the computing node is configured to run a task slice allocated by a control node connected thereto;
  • the first control node of the control node is configured to notify the second control node that schedules the second task to acquire at least one task of the first task when the at least one task slice operation of the first task assigned to it ends Generating the generated operational data; wherein the first task is a task running in any of the jobs; and the second task is any task in the job that depends on the first task;
  • the second control node is configured to acquire running data generated by running at least one task slice of the first task, and allocate it to each task slice of the second task; and schedule at least one task of the second task The slice runs to process the run data.
  • the second control node schedules each task slice operation of the second task, and processing the operation data includes:
  • the task data version that matches the running data version runs, and processes the running data.
  • the second control node is further configured to:
  • the first control node When the required data version of the first task slice in the second task does not match the running data, requesting the first control node to schedule a task slice corresponding to the running data version in the first task to re-run Obtaining a running data version regenerated by the first control node, and scheduling the first task slice to run when the regenerated running data version matches the required data version of the first task slice; otherwise requesting The first control node schedules the task slice corresponding to the running data version in the first task to re-run until the regenerated running data version matches the required data version of the first task slice;
  • the first control node is further configured to: when the request of the second control node is received, schedule a task slice corresponding to the running data version in the first task to be re-run; and run the operation in the first task. After the task slice corresponding to the data version is re-run, the second control node is notified to obtain the regenerated operational data.
  • the second control node is further configured to: when the second task slice in the second task fails to process the running data, request the first control node to schedule the first task, Corresponding to the running data version Retrieving the task slice; retrieving the regenerated running data version, and scheduling the task slicing of the second task to process the rerun data version;
  • the first control node is further configured to: when receiving the request of the second control node, schedule the task slice corresponding to the running data version to be re-run in the first task; After the task slice corresponding to the running data version is re-run, the second control node is notified to acquire the regenerated operating data.
  • the second control node requests the first control node to schedule the first task, and the task slice corresponding to the running data version is re-run specifically:
  • the first control node is requested to schedule the task slice corresponding to the running data version in the first task to be re-run by increasing the required data version.
  • a job scheduling apparatus is applied to a control node of a distributed system, the distributed system including at least a central node, a plurality of control nodes connected to the central node, and a plurality of computing nodes respectively connected to each control node;
  • the control node acquires a task of the job assigned by the central node, and schedules each task slice of the task to run in a computing node connected thereto, and the device includes:
  • a notification module configured to notify a control node that schedules the second task to acquire operation data generated by at least one task slice operation of the first task when the at least one task slice operation of the first task ends;
  • the control node of the task allocates the acquired running data to each task slice of the second task; schedules each task slice of the second task to run, and processes the running data;
  • the first task is a task that runs in any of the jobs; and the second task is any task in the job that depends on the first task.
  • a job scheduling apparatus is applied to a control node of a distributed system, the distributed system including at least a central node, a plurality of control nodes connected to the central node, and a plurality of computing nodes respectively connected to each control node;
  • the control node acquires a task of the job assigned by the central node, and schedules each task slice of the task to run in a computing node connected thereto, and the device includes:
  • An acquiring module configured to acquire, when receiving the notification of the control node that schedules the first task, operation data generated by running at least one task slice of the first task; where the notification is the control node that schedules the first task Transmitting after the end of at least one task slicing of the first task; the first task is a task running in any of the jobs; the second task is dependent on the first task in the job Any task
  • An allocation module configured to allocate, to the obtained operation data generated by the at least one task slice of the first task, to each task slice of the second task;
  • a scheduling module configured to schedule at least one task slice operation of the second task to process the running data.
  • the distributed system provided by the present application is composed of a central node, a control node and a computing node.
  • the central node performs task assignment
  • the control node is responsible for task scheduling, thereby reducing the scheduling pressure of the central node and improving scheduling efficiency.
  • at least one task slice operation of the first task in the job may be scheduled to process the operation data without waiting for all task slices of the first task.
  • the task slice of the second task can be scheduled to run for data processing, making full use of the cluster resources, improving resource utilization and task concurrency, and reducing the running time of the operation.
  • FIG. 1 is a schematic structural diagram of a distributed system according to an embodiment of the present application.
  • FIG. 2 is a flow chart of an embodiment of a job scheduling method according to an embodiment of the present application.
  • FIG. 3 is a flow chart of still another embodiment of a job scheduling method according to an embodiment of the present application.
  • FIG. 4 is a flowchart of still another embodiment of a job scheduling method according to an embodiment of the present application.
  • FIG. 5 is a flowchart of still another embodiment of a job scheduling method according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an embodiment of a job scheduling apparatus according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of another embodiment of a job scheduling apparatus according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of still another embodiment of a job scheduling apparatus according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of still another embodiment of a job scheduling apparatus according to an embodiment of the present application.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or in a computer readable medium.
  • RAM random access memory
  • ROM read only memory
  • flash RAM flash memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
  • the central node when the job scheduling is performed, the central node performs unified scheduling, and the central node needs to control the task slicing operation of another task of the previous task after the end of all the task slicing operations of one task in the job, and the center
  • the way of unified management and control of nodes leads to a huge amount of data information of the central node, especially to a certain extent that the scale of operations rises, and the information volume of the central node is bursting, which seriously affects the scheduling efficiency.
  • the task slice of the previous task with the dependency needs to be run, and then the task slice of the next task can be scheduled to run, which may result in the existence of spare resources of the cluster resources, resulting in waste of resources.
  • the embodiment of the present application provides a distributed system, as shown in FIG.
  • the distributed system is composed of a central node 101, a plurality of control nodes 102 connected to the central node, and a plurality of computing nodes 103 connected to each control node 102.
  • the central node 102 assigns tasks for the jobs to the various control nodes.
  • the individual task slices that the control node 103 schedules for its assigned tasks run in the compute nodes to which it is connected.
  • the computing node 103 runs a task slice allocated by the control node connected thereto;
  • the central node may first allocate a part of tasks to each control node, and when there are control nodes with idle processing resources, allocate other unallocated tasks.
  • Each control node controls the operation of a task of the job.
  • the scheduling of the task of the job is implemented by the control node, and the central node is not required to perform unified scheduling, thereby reducing the scheduling pressure of the central node, thereby improving scheduling efficiency.
  • the control node after the at least one task slice operation of the first task in the job ends, the task of the second task may be scheduled to be cut.
  • the slice runs, and the running data is processed, without waiting for the end of all the task slicing of the first task, and the task slicing of the second task can be scheduled to run for data processing, fully utilizing the cluster resources, and avoiding waste of resources.
  • the task slice in the next task that depends on the task can be scheduled to start running, without waiting for the task.
  • Status which improves resource utilization and task concurrency and reduces job uptime.
  • FIG. 2 is a flowchart of an embodiment of a job scheduling method according to an embodiment of the present application.
  • the technical solution is specifically applied to the distributed system shown in FIG. 1 , and the method may include the following steps:
  • the first task is a task that runs in any of the jobs; and the second task is any task in the job that depends on the first task.
  • the first control node refers to a control node that schedules the first task
  • the second control node refers to a control node that schedules the second task.
  • the first task and which control node is assigned to the second task for scheduling are pre-allocated by the central node.
  • the task slicing operation specifically refers to the control node controlling the task slice to run in the computing node.
  • the task slicing operation is described.
  • This application is applicable to application scenarios in which each task in the job has dependencies, such as a DAG (Directed Acyclic Graph) model, and the output data of the previous task is the input data of the next task.
  • DAG Directed Acyclic Graph
  • the first task can be any one of the tasks in the job that is running.
  • the second task may refer to any task that depends on the first task.
  • Each task slice operation in the first task ends, and the generated operation data, the first control node notifies the second control node to acquire the operation data.
  • the second control node acquires running data generated by at least one task slice of the first task, and allocates to each task slice of the second task.
  • the second control node After receiving the notification of the first control node, the second control node acquires, in the first control node, the first task At least one task slices generated run data and is assigned to each task slice of the second task.
  • the second control node schedules at least one task slice operation of the second task to process the operation data.
  • the second control node After the end of at least one task slicing of the first task, the second control node acquires the generated running data, and allocates the task slice to the second task, that is, the task slicing of the second task may be scheduled to process the running data.
  • the first control node after the at least one task slice operation of the first task in the job ends, the first control node notifies the second control node, and the second control node may schedule the task slice operation of the second task to run the data.
  • the processing does not need to wait for the end of all the task slicing of the first task, and the task slicing of the second task can be scheduled to run for data processing, fully utilizing the cluster resources, and avoiding waste of resources.
  • the scheduling process is controlled by the control node, and the central node is not required to perform unified scheduling.
  • the central node is only responsible for the assignment of tasks, thereby reducing the scheduling pressure of the central node and improving the scheduling efficiency.
  • the task slice will not run successfully because the compute node corresponding to the task slice has insufficient processing resources or other conditions that affect normal operation.
  • the second control node schedules at least one task slice operation of the second task, and processes the operation data, which may be a task slice in which the processing resources in the second task meet the preset condition, and the operation data is processed.
  • the task slice of the second task cannot be run.
  • the data version corresponds to the number of runs corresponding to the data obtained by the task slice for each run. For example, the version of the data obtained at the end of the first run is 0.
  • the second version of the run ends with the data version being 1, followed by By analogy, the third run ends with a data version of 2...
  • a flowchart of still another embodiment of a job scheduling method according to an embodiment of the present application may include the following steps:
  • the first control node that schedules the first task when the at least one task slice operation of the first task ends, notify the second control node that schedules the second task to acquire the operation of generating at least one task slice of the first task. data.
  • the first task is a task that runs in any of the jobs; and the second task is any task in the job that depends on the first task.
  • the second control node acquires running data generated by at least one task slice of the first task, and allocates to each task slice of the second task.
  • the second control node schedules, in the second task, a task slice that matches a version of the required data with the running data version, and processes the running data.
  • the second control node may schedule the task slice running of the second task, and process the Operating data.
  • the insufficiency of the processing resources may also affect the operation of the task slice. Therefore, specifically, the task of the second task is processed, and the processing resource meets the preset condition, and the task data version matches the running data version. Run data.
  • the second control node may also request the first control node to schedule the task slice corresponding to the running data version to be re-run in the first task.
  • the first control node schedules the task slice corresponding to the running data version in the first task to re-run, and after the re-run ends, notifies the second control node to acquire the Regenerated operational data;
  • the second control node After receiving the notification of the first control node, the second control node acquires the regenerated operational data, and schedules the second when the regenerated generated operational data version matches the required data version of the first task slice
  • the first task of the task is run by the slice; otherwise, the first control node is requested to schedule the task slice corresponding to the running data version in the first task to be re-run until the regenerated running data version and the first task slice are The demand data version matches.
  • the method may include the following steps:
  • the first control node that schedules the first task informs the second control node that schedules the second task to acquire the running of the at least one task slice generation of the first task, when the at least one task slice operation of the first task ends. data.
  • the first task is a task that runs in any of the jobs; and the second task is any task in the job that depends on the first task.
  • the first task slice of the first task refers to any one of the first tasks
  • the second control node acquires running data generated by at least one task slice of the first task, and allocates to each task slice of the second task.
  • the second control node determines whether the required data version of the first task slice of the second task matches the running data version of the first task slice of the first task; if yes, execute step 409, if no, execute Step 404.
  • the first task slice of the second task may refer to any one of the second tasks.
  • the first task slice of the first task may refer to any one of the task slices in the first task that has finished running.
  • the second control node requests the first control node to schedule the first task slice of the first task to re-run.
  • the request may carry the required data version of the second control node.
  • the first control node schedules the first task slice of the first task to re-run.
  • the first control node After receiving the request of the second control node, the first control node schedules the first task slice of the first task to re-run.
  • the first control node notifies the second control node to acquire the regenerated operational data after the first task slice rerun of the first task ends.
  • the second control node acquires the regenerated operational data.
  • the second control node After receiving the notification of the first control node, the second control node acquires the running data regenerated by the first task slice of the first task.
  • the second control node determines whether the re-produced running data version matches the required data version of the first task slice of the second task, and if yes, performs step 409, and if no, returns to step 404 to continue execution.
  • the second control node schedules a first task slice operation of the second task, and processes operation data regenerated by the first task slice of the first task.
  • the first task slice of the first task may be scheduled to continue to run again until the second task is The required data version of a task slice matches the running data version of the first task slice of the first task.
  • the first task slice of the first task is re-run, if the first task slice requirement data version of the second task is lower than the running data version of the first task of the first task, the first task cannot be regenerated. The data is in a lower version, so the first task slice of the first task cannot be scheduled to run.
  • the step 405 may be specifically: a required version of the first task slice of the first task by the first control node, higher than a current running data version of the first task slice of the first task. Time, Scheduling the first task slice of the first task to re-run.
  • the success rate of the task slicing operation is improved by the judgment of the required data version and the running data version.
  • the task slice of the first task may be scheduled to be re-run until the running data version matching the required data version is obtained, and the data processing success rate is improved.
  • the task slicing of the next task can be scheduled after the end of all the task slicing operations of one task. Once the running data of the task slicing is wrong, the subsequent task will fail after inputting the data error.
  • Each control node may save a data version table for each task slice, and may store the obtained running data version in the data version table until the data running data version and the data version table need data. When the version matches, the task slice is scheduled to run.
  • any task in the second task is sliced when the task slice of the second task is scheduled to run, when the running data is processed, it may fail, for example, the running data cannot be successfully read.
  • the first task may also be scheduled, and the running data version corresponds to The task slice is re-run.
  • FIG. 5 it is a flowchart of an embodiment of a job scheduling method provided by an embodiment of the present application, which may include the following steps:
  • the first control node that schedules the first task informs the second control node that schedules the second task to acquire the second task slice generation operation of the first task, when the at least one task slice operation of the first task ends. data.
  • the first task is a task that runs in any of the jobs; and the second task is any task in the job that depends on the first task.
  • the second task slice of the first task may be any one of the first tasks.
  • the second control node acquires running data generated by at least one task slice of the first task, and allocates to each task slice of the second task.
  • the second control node schedules at least one task slice operation of the second task to process the running data.
  • the at least one task slice may be a task slice in which the required data version matches the running data version and the processing resource satisfies the preset condition.
  • the second control node determines whether there is a second task slice generation for processing the first task in the second task.
  • the second task slice of the failed running data if yes, step 505 is performed, and if not, the process ends.
  • the second task slice of the second task may refer to any task slice in the second task that fails to process the running data.
  • the second task slice of the first task may be any one of the task slices that has been run in the first task.
  • the second control node requests the first control node to schedule the second task slice of the first task to re-run.
  • the first control node schedules the second task slice of the first task to be re-run.
  • the first control node notifies the second control node to acquire the regenerated operational data after the second task slice rerun of the first task ends.
  • the second control node acquires the regenerated running data, and schedules the second task slice running of the second task.
  • the task slice since the task slice is re-run once, only the running data higher than the previous version can be generated, and the task slice can run, indicating that the current required data version matches the running data version.
  • the second control node may trigger the second task slice of the first task to re-run by increasing the data request version.
  • the second control node requests the first control node to re-run the task slice corresponding to the running data version in the first task, specifically, by requesting the first control node to schedule the first by increasing the required data version.
  • the task slice corresponding to the running data version described in the task is re-run.
  • the second control node increases the required data version and carries the enhanced demand data version in the request sent to the first control node.
  • the first control node triggers the task slice to re-run when the required data version in the request is higher than its current running data version.
  • the task slice when the data processing fails, the task slice can be scheduled to be re-run, thereby improving the resource utilization rate and improving the self-repairing capability of the job failure.
  • task1 includes three task slices M1, M2, and M3, and task2 includes two task slices R1 and R2.
  • the central node assigns task1 to the first control node taskmaster1 scheduling, and assigns task2 to the second control node taskmaster2 scheduling.
  • Taskmaster1 and taskmaster2 can be any two control nodes in the control node.
  • the task data of each task slice of task1 can be provided to taskmaster2.
  • Taskmaster2 assigns run data to the task slice in each compute node; taskmaster2 can also request taskmaster1 to get the data for the corresponding version of each task slice request.
  • the central node triggers taskmaster1 to schedule task1 to run.
  • Taskmaster2 maintains a data version list for each task2, including the running data version and the required data version.
  • the task slice of task2 can be scheduled to run only when the running data version and the required data version are consistent, and the processing resources meet the preset conditions.
  • the initial demand data version of each task slice is the same.
  • taskmaster1 finishes running any of the task slices of task1, it notifies taskmaster2 to obtain the generated running data.
  • Taskmaster2 assigns the acquired operational data to each task of the second task.
  • a task slice of task2 is sliced, assuming that the required data version of R1 is sliced with a task of task1, assuming that the running data version of M1 does not match, R1 of task2 is in a state of waiting for the running data of M1.
  • taskmaster2 can obtain the data of its required version from the request of taskmaster1, and taskmaster1 can re-run the task slice M1 of task1 when the required version data requested by taskmaster2 is higher than the running data version.
  • taskmaster1 can inform taskmaster2 to obtain the data generated by M1 re-run.
  • taskmaster2 After receiving the notification of taskmaster1, taskmaster2 obtains the data generated by M1 re-run and assigns the data generated by M1 re-run to R1.
  • R1 When the required data version of R1 matches the running data version of M1, R1 starts running and processes the running data of M1.
  • the task slice M1 will not be scheduled to run.
  • R2 continues to wait until the processing resource of its computing node satisfies the preset condition, and its required data version and the received task slice of task1 If it matches the running data of M2, it can run and process the running data of M2.
  • taskmaster2 may request taskmaster1 to schedule M1 to re-run.
  • M1 can only generate high version data every time it runs, it can be improved by the version of the required data version. If the taskmaster1 is scheduled to re-run M1, taskmaster1 can trigger M1 to re-run when the required data version of taskmaster2 is higher than the running data version.
  • the embodiment of the present application further provides a distributed system, where the distributed system includes a central node 101, a plurality of control nodes 102 connected to the central node 101, and a connection of each control node. a plurality of computing nodes 103;
  • the central node 101 is configured to allocate tasks of the job to each control node, and each control node schedules one of the tasks of the job.
  • the computing node 103 is configured to run a task slice allocated by a control node connected thereto;
  • the first control node in the control node 103 is configured to notify the second control node that schedules the second task to acquire at least one task slice of the first task when the at least one task slice operation of the first task assigned to it ends Running the generated operational data; wherein the first task is a task running in any of the jobs; and the second task is any task in the job that depends on the first task;
  • a second control node in the control node 103 configured to acquire operation data generated by at least one task slice operation of the first task in the first control node, and allocate to each task slice of the second task; At least one task slice of the second task runs to process the operational data.
  • the first control node and the second control node are different, and may be any two control nodes of the plurality of control nodes, which are allocated by the central node, and respectively schedule the first task and the second task.
  • the scheduling of the tasks in the job is realized by the control node, and the central node is not required to perform unified scheduling, and the central node is only responsible for the assignment of the tasks, thereby reducing the scheduling pressure of the central node and improving the scheduling efficiency.
  • the first control node notifies the second control node, and the second control node can schedule the task slice operation of the second task to run.
  • the data is processed without waiting for the end of all the task slicing of the first task, and the task slicing of the second task can be scheduled to run for data processing, fully utilizing the cluster resources, and avoiding waste of resources.
  • the second control node schedules each task slice of the second task to run, and the processing the data may be specifically:
  • the task data version that matches the running data version runs, and processes the running data.
  • the second control node may also request the first control in the embodiment of the present application.
  • the node, in the first task schedules the task slice corresponding to the running data version to be re-run. Therefore, as a further embodiment, the second control node is further configured to:
  • the first control node When the required data version of the first task slice in the second task does not match the running data, requesting the first control node to schedule a task slice corresponding to the running data version in the first task to re-run Obtaining the running data regenerated by the first control node, scheduling the first task slice to run when the regenerated running data version matches the required data version of the first task slice; otherwise requesting the The first control node schedules the task slice corresponding to the running data version in the first task to re-run until the regenerated running data version matches the required data version of the first task slice.
  • the first control node is further configured to:
  • the second control node Receiving, by the second control node, the task slice corresponding to the running data version in the first task is re-run; in the first task, the task slice corresponding to the running data version is re-running Thereafter, the second control node is notified to acquire the regenerated operational data.
  • the success rate of the task slicing operation is further improved.
  • the task slice of the first task may be scheduled to be re-run until the running data version matching the required data version is obtained, thereby further improving data processing success. rate.
  • the task slicing of the next task can be scheduled after the end of all the task slicing operations of one task. Once the running data of the task slicing is wrong, the subsequent task will fail after inputting the data error.
  • the second control node is also used to:
  • the first control node is further configured to:
  • the second control node requests the first control node to re-run the task slice corresponding to the running data version in the first task, and may request the first control node to schedule the first by increasing the required data version.
  • the task slice corresponding to the running data version described in the task is re-run.
  • the second control node increases the required data version and carries the enhanced demand data version in the request sent to the first control node.
  • the first control node triggers the task slice to re-run when the required data version in the request is higher than its current running data version.
  • the task is re-run by scheduling the task, which improves the resource utilization rate and improves the self-repair ability of the job failure.
  • the embodiment of the present application further provides a job scheduling apparatus.
  • FIG. 7 it is a schematic structural diagram of the job scheduling apparatus, and the apparatus is specifically applied to a control node in the distributed system shown in FIG. 1.
  • the device can include:
  • the notification module 701 is configured to notify the control node that schedules the second task to acquire the running data generated by the at least one task slice running of the first task, at the end of the at least one task slice running of the scheduled first task; And the control node that schedules the second task allocates the acquired operation data to each task slice of the second task; schedules each task slice of the second task to run, and processes the operation data;
  • the first task is a task that runs in any of the jobs; and the second task is any task in the job that depends on the first task.
  • FIG. 8 is a schematic structural diagram of another embodiment of a job scheduling apparatus according to an embodiment of the present disclosure, where the apparatus is specifically applied to a control node in the distributed system shown in FIG.
  • the device can include:
  • the obtaining module 801 is configured to: when receiving the notification of the control node that schedules the first task, acquire the running data generated by the at least one task slice running of the first task; wherein the notification is the control of the scheduling first task Sending the node after the end of at least one task slice operation of the first task; the first task is a task running in any of the jobs; and the second task is dependent on the first task in the job Any of the tasks;
  • the distribution module 802 is configured to allocate, to the obtained operation data generated by the at least one task slice of the first task, each task slice of the second task;
  • the scheduling module 803 is configured to schedule at least one task slice operation of the second task to process the running data.
  • the scheduling module 803 is specifically configured to schedule, in the second task, a task slice running that matches a version of the required data with the running data version, and processes the running data.
  • the apparatus may further include:
  • the first requesting module 901 is configured to: when the required data version of the first task slice in the second task does not match the running data, request the control node that schedules the first task to schedule the first task The task slice corresponding to the running data version is re-run;
  • the obtaining module 801 is further configured to obtain a running data version regenerated by the task slice corresponding to the running data version in the first task.
  • the distribution module 802 is further configured to send the acquired regenerated running data version to the first task slice of the second task.
  • the scheduling module 803 is further configured to: when the regenerated running data version matches the required data version of the first task slice, schedule the first task slice to run; otherwise, request the first control node to schedule The task slice corresponding to the running data version in the first task is re-run until the regenerated running data version matches the required data version of the first task slice.
  • the apparatus may further include:
  • the second requesting module 1001 is configured to: when the second task slice in the second task fails to process the running data, request the first control node to schedule the first task, where the running data version corresponds The task slice is re-run;
  • the obtaining module 801 is further configured to obtain a running data version regenerated by the task slice corresponding to the running data version in the first task.
  • the allocating module 802 is further configured to allocate the regenerated running data version to the second task slice of the second task.
  • the scheduling module 803 is further configured to schedule a second task slice operation of the second task, and process the re-run data version.
  • first device if a first device is coupled to a second device, the first device can be directly electrically coupled to the second device, or electrically coupled indirectly through other devices or coupling means. Connected to the second device.
  • the description of the specification is intended to be illustrative of the preferred embodiments of the invention. The scope of protection of the application is subject to the definition of the appended claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种作业调度方法、装置及分布式系统,所述分布式系统至少包括中心节点(101)、与中心节点(101)连接的多个控制节点(102)以及每一控制节点(102)分别连接的多个计算节点(103);中心节点(101)为各个控制节点(102)分配所述作业的任务;控制节点(102)调度各个任务切片在与其连接的计算节点(103)中运行。所述方法包括:调度第一任务的第一控制节点在第一任务的至少一个任务切片运行结束时,通知调度第二任务的第二控制节点获取第一任务的至少一个任务切片生成的运行数据(201);第二控制节点获取第一任务的至少一个任务切片生成的运行数据,分配给第二任务的每一个任务切片(202);第二控制节点调度第二任务的至少一个任务切片运行,处理所述运行数据(203)。所述作业调度方法、装置及分布式系统提高了调度效率,提高了资源利用率。

Description

作业调度方法、装置及分布式系统
本申请要求2015年06月26日递交的申请号为201510362989.5、发明名称为“作业调度方法、装置及分布式系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于数据处理技术领域,具体地说,涉及一种作业调度方法、装置及分布式系统。
背景技术
在传统的分布式系统中,分布式作业包括有限数量的任务(task),并且任务之间有一定的依赖关系,每一任务又划分为多个任务切片(instance),作业运行时,由各个任务的任务切片进行数据处理。
基于传统的分布式系统,作业调度方式通常是由中心节点(master),进行统一调度,例如,假设作业有两个任务,分别为task1和task2,task1和task2的依赖关系为task2的输入为task1的输出,当作业运行起来之后,中心节点会寻找一个没有前驱任务的任务,假设为task1先运行起来,当task1的全部任务切片运行结束之后,中心节点再调度task2运行。
但是,这种作业调度方式,中心节点处理和维护的数据量巨大,影响调度效率,且没有充分的利用集群资源,导致资源浪费。
发明内容
有鉴于此,本申请所要解决的技术问题是提供了作业调度方法、装置及分布式系统,提高了调度效率,且提高了资源利用率。
为了解决上述技术问题,本申请公开了一种作业调度方法,应用于分布式系统中,所述分布式系统至少包括中心节点、与所述中心节点连接的多个控制节点以及每一控制节点分别连接的多个计算节点;所述中心节点为各个控制节点分配所述作业的任务;所述控制节点调度为其分配的任务的各个任务切片在与其连接的计算节点中运行,所述方法包括:
调度第一任务的第一控制节点在所述第一任务的至少一个任务切片运行结束时,通 知调度第二任务的第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据;其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务;
所述第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据,并分配给所述第二任务的每一个任务切片;
所述第二控制节点调度所述第二任务的至少一个任务切片运行,处理所述运行数据。
优选地,所述第二控制节点调度所述第二任务的至少一个任务切片运行,处理所述运行数据包括:
调度所述第二任务中,需求数据版本与所述运行数据版本匹配的任务切片运行,处理所述运行数据。
优选地,在所述第二任务中的第一任务切片的需求数据版本与所述运行数据不匹配时,所述方法还包括:
所述第二控制节点请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行;
所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,并在重新运行结束之后,通知所述第二控制节点获取所述重新生成的运行数据;
所述第二控制节点获取所述重新生成的运行数据,并在所述重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配时,调度所述第一任务切片运行;否则请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,直至重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配。
优选地,所述第二控制节点调度所述第二任务的任一任务切片运行,处理所述运行数据之后,所述方法还包括:
所述第二控制节点在所述第二任务中的第二任务切片处理所述运行数据失败时,请求所述第一控制节点调度所述第一任务中,所述运行数据版本对应的任务切片重新运行;
所述第一控制节点调度所述第一任务中,所述运行数据版本对应的任务切片重新运行;并在所述第一任务中所述运行数据版本对应的任务切片重新运行结束之后,通知所述第二控制节点获取重新生成的运行数据;
所述第二控制节点获取所述重新生成的运行数据,并调度所述第二任务的所述第二任务切片运行,处理所述重新生成的运行数据。
优选地,所述第二控制节点请求所述第一控制节点调度所述第一任务中所述运行数 据版本对应的任务切片重新运行包括:
所述第二控制节点通过提高需求数据版本,请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行。
一种分布式系统,包括一中心节点、与所述中心节点连接的多个控制节点以及每一个控制节点连接的多个计算节点;
所述中心节点,用于为各个控制节点分配作业的任务;
所述计算节点,用于运行与其连接的控制节点分配的任务切片;
所述控制节点中的第一控制节点,用于在为其分配的第一任务的至少一个任务切片运行结束时,通知调度第二任务的第二控制节点获取所述第一任务的至少一个任务切片运行生成的运行数据;其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务;
所述第二控制节点,用于获取所述第一任务的至少一个任务切片运行生成的运行数据,并分配给所述第二任务的每一个任务切片;调度所述第二任务的至少一个任务切片运行,处理所述运行数据。
优选地,所述第二控制节点调度所述第二任务的每一个任务切片运行,处理所述运行数据包括:
调度所述第二任务中,需求数据版本与所述运行数据版本匹配的任务切片运行,处理所述运行数据。
优选地,所述第二控制节点还用于:
在所述第二任务中的第一任务切片的需求数据版本与所述运行数据不匹配时,请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行;获取所述第一控制节点重新生成的运行数据版本,并在所述重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配时,调度所述第一任务切片运行;否则请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,直至重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配;
所述第一控制节点还用于接收到所述第二控制节点的请求时,调度所述第一任务中所述运行数据版本对应的任务切片重新运行;在所述第一任务中所述运行数据版本对应的任务切片重新运行结束之后,通知所述第二控制节点获取重新生成的运行数据。
优选地,所述第二控制节点还用于:在所述第二任务中的第二任务切片处理所述运行数据失败时,请求所述第一控制节点调度所述第一任务中,所述运行数据版本对应的 任务切片重新运行;获取重新生成的运行数据版本,并调度所述第二任务的任务切片运行,处理所述重新运行的数据版本;
所述第一控制节点,还用于接收到所述第二控制节点的请求时,调度所述第一任务中,所述运行数据版本对应的任务切片重新运行;在所述第一任务中所述运行数据版本对应的任务切片重新运行结束之后,通知所述第二控制节点获取所述重新生成的运行数据。
优选地,所述第二控制节点请求所述第一控制节点调度所述第一任务中,所述运行数据版本对应的任务切片重新运行具体是:
通过提高需求数据版本,请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行。
一种作业调度装置,应用于分布式系统的控制节点中,所述分布式系统至少包括中心节点、与中心节点连接的多个控制节点以及每一个控制节点分别连接的多个计算节点;所述控制节点获取所述中心节点分配的所述作业的一任务,并调度所述任务的各个任务切片在与其连接的计算节点中运行,所述装置包括:
通知模块,用于在第一任务的至少一个任务切片运行结束时,通知调度第二任务的控制节点获取所述第一任务的至少一个任务切片运行生成的运行数据;以便于所述调度第二任务的控制节点将获取的所述运行数据,分配给所述第二任务的每一个任务切片;调度所述第二任务的每一个任务切片运行,处理所述运行数据;
其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务。
一种作业调度装置,应用于分布式系统的控制节点中,所述分布式系统至少包括中心节点、与中心节点连接的多个控制节点以及每一个控制节点分别连接的多个计算节点;所述控制节点获取所述中心节点分配的所述作业的一任务,并调度所述任务的各个任务切片在与其连接的计算节点中运行,所述装置包括:
获取模块,用于接收到调度第一任务的控制节点的通知时,获取所述第一任务的至少一个任务切片运行生成的运行数据;其中,所述通知为所述调度第一任务的控制节点在所述第一任务的至少一个任务切片运行结束之后发送的;所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务;
分配模块,用于将获取的所述第一任务的至少一个任务切片生成的运行数据,分配给所述第二任务的每一个任务切片;
调度模块,用于调度所述第二任务的至少一个任务切片运行,处理所述运行数据。
与现有技术相比,本申请可以获得包括以下技术效果:
本申请提供的分布式系统,由中心节点、控制节点以及计算节点组成,中心节点进行任务分配,控制节点负责任务调度,从而降低了中心节点的调度压力,提高了调度效率。在进行作业调度时,在作业中的第一任务的至少一个任务切片运行结束之后,即可以调度第二任务的至少一个任务切片运行,对运行数据进行处理,无需等待第一任务的全部任务切片运行结束,第二任务的任务切片即可以调度运行进行数据处理,充分利用了集群资源,提高了资源利用率和任务并发度,降低了作业运行时间。
当然,实施本申请的任一产品必不一定需要同时达到以上所述的所有技术效果。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是本申请实施例的分布式系统的一种结构示意图;
图2是本申请实施例的一种作业调度方法一个实施例流程图;
图3是本申请实施例的一种作业调度方法又一个实施例流程图;
图4是本申请实施例的一种作业调度方法又一个实施例流程图;
图5是本申请实施例的一种作业调度方法又一个实施例流程图;
图6是本申请实施例的实际应用中任务依赖关系的一种示意图;
图7是本申请实施例的一种作业调度装置一个实施例结构示意图;
图8是本申请实施例的一种作业调度装置另一个实施例结构示意图;
图9是本申请实施例的一种作业调度装置又一个实施例结构示意图;
图10是本申请实施例的一种作业调度装置又一个实施例结构示意图。
具体实施方式
以下将配合附图及实施例来详细说明本申请的实施方式,藉此对本申请如何应用技术手段来解决技术问题并达成技术功效的实现过程能充分理解并据以实施。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或 非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
在现有技术中,进行作业调度时,由中心节点进行统一调度,中心节点需要控制作业中一个任务的全部任务切片运行结束之后,再调度依赖前一个任务的另一个任务的任务切片运行,中心节点统一管控的方式导致中心节点的数据信息量巨大,特别是作业规模上升的一定程度,中心节点的信息量爆棚,严重影响调度效率。且作业的各个任务运行时,需要具有依赖关系的前一个任务的任务切片全部运行结束之后,才能调度后一个任务的任务切片运行,这就导致集群资源可能存在空余,从而导致资源的浪费。
为了解决这一技术问题,本申请实施例提供了一种分布式系统,如图1所示。
该分布式系统由中心节点101、与该中心节点连接的多个控制节点102以及每一控制节点102连接的多个计算节点103构成。
中心节点102为各个控制节点分配所述作业的任务。
控制节点103调度为其分配的任务的各个任务切片在与其连接的计算节点中运行。
计算节点103运行与其连接的控制节点分配的任务切片;
当作业的任务数量大于控制节点数量时,中心节点可以先分配一部分任务给各个控制节点,当存在处理资源空闲的控制节点时,再分配其他未分配的任务。
每一个控制节点控制所述作业的一个任务的运行。
通过本申请实施例,由控制节点实现对作业的任务的调度,无需中心节点进行统一调度,降低了中心节点的调度压力,从而可以提高调度效率。且通过控制节点的调度,在作业中的第一任务的至少一个任务切片运行结束之后,即可以调度第二任务的任务切 片运行,对运行数据进行处理,无需等待第一任务的全部任务切片运行结束,第二任务的任务切片即可以调度运行进行数据处理,充分利用了集群资源,避免了资源的浪费。本申请实施例的分布式系统,通过控制节点之间的交互,只要任一个任务中的任一个任务切片运行结束,即可以调度依赖该任务的下一个任务中任务切片开始运行,无需一直处于等待状态,使得提高了资源利用率和任务并发度,降低作业运行时间。
下面将结合附图对本申请技术方案进行详细描述。
图2为本申请实施例提供的一种作业调度方法一个实施例的流程图,该技术方案具体应用于图1所示的分布式系统中,该方法可以包括以下几个步骤:
201:调度第一任务的第一控制节点在所述第一任务的至少一个任务切片运行结束时,通知调度第二任务的第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据。
其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务。
第一控制节点是指调度第一任务的控制节点,第二控制节点是指调度第二任务的控制节点。第一任务以及第二任务分配给哪一个控制节点进行调度是由中心节点预先分配。
本申请实施例中,任务切片运行具体即是指在控制节点控制任务切片在计算节点中运行,为了方便描述,在本申请实施例中,均以任务切片运行进行描述。
本申请适用于作业中各个任务具有依赖关系的应用场景,例如DAG(Directed Acyclic Graph,有限无环图)模型,前一个任务的输出数据,为下一个任务的输入数据。
第一任务可以是作业中任一个正在运行的任务。
依赖该第一任务的任务可能有多个,第二任务可以是指依赖该第一任务的任一个任务。
需要说明的是,第一任务中的“第一”、第二任务中的“第二”,并不是表示顺序关系,其仅是为了描述上区分不同的任务。
第一任务中的每一任务切片运行结束,生成的运行数据,第一控制节点会通知第二控制节点获取该运行数据。
202:第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据,并分配给所述第二任务的每一个任务切片。
第二控制节点接收到第一控制节点的通知之后,获取第一控制节点中,第一任务的 至少一个任务切片生成的运行数据,并分配给所述第二任务的每一个任务切片。
203:第二控制节点调度所述第二任务的至少一个任务切片运行,处理所述运行数据。
在第一任务的至少一个任务切片运行结束之后,第二控制节点获取其生成的运行数据,分配给第二任务的任务切片,即可以调度第二任务的任务切片运行,处理所述运行数据。
本申请实施例,在作业中的第一任务的至少一个任务切片运行结束之后,第一控制节点会通知第二控制节点,第二控制节点即可以调度第二任务的任务切片运行,对运行数据进行处理,无需等待第一任务的全部任务切片运行结束,第二任务的任务切片即可以调度运行进行数据处理,充分利用了集群资源,避免了资源的浪费。且调度过程由控制节点控制实现,无需中心节点进行统一调度,中心节点只负责任务的分配,从而减轻了中心节点的调度压力,可以提高调度效率。
其中,由于任务切片对应的计算节点处理资源不足或者其他影响正常运行的条件,任务切片也不会运行成功。
因此第二控制节点调度所述第二任务的至少一个任务切片运行,处理所述运行数据,可以是调度所述第二任务中的处理资源满足预设条件的任务切片,处理所述运行数据。
对于处理资源未满足预设条件的任务切片,可以等待其处理资源满足预设条件时,再调度运行。
另外,由于第一任务的任务切片的运行数据版本,与第二任务的任务切片的需求数据版本不一致时,第二任务的任务切片也无法运行。
其中,数据版本与任务切片每运行一次得到的数据对应的运行次数对应,例如第一次运行结束得到的数据版本即为0、在重新运行时,第二次运行结束得到数据版本为1,依次类推,第三次运行结束得到数据版本为2……
作为又一个实施例,如图3所示,为本申请实施例提供的一种作业调度方法又一个实施例的流程图,该方法可以包括以下几个步骤:
301:调度第一任务的第一控制节点在所述第一任务的至少一个任务切片运行结束时,通知调度第二任务的第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据。
其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务。
302:第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据,并分配给所述第二任务的每一个任务切片。
303:第二控制节点调度所述第二任务中,需求数据版本与所述运行数据版本匹配的任务切片运行,处理所述运行数据。
也即第一任务的该至少一个任务切片的运行数据版本,与第二任务的任务切片的需求数据版本匹配时,第二控制节点才可以调度所述第二任务的任务切片运行,处理所述运行数据。其中,由于处理资源的不足也会影响任务切片的运行,因此具体的是调度第二任务中,处理资源满足预设条件,且需求数据版本与所述运行数据版本匹配的任务切片运行,处理所述运行数据。
另外,如果第一任务的所述至少一个任务切片的运行数据版本,与第二任务的某一个任务切片,假设为第一任务切片的需求数据版本不匹配时,则本申请实施例中,第二控制节点还可以请求第一控制节点,调度第一任务中,所述运行数据版本对应的任务切片重新运行。
第一控制节点接收到第二控制节点的请求之后,即调度所述第一任务中所述运行数据版本对应的任务切片重新运行,并在重新运行结束之后,通知所述第二控制节点获取所述重新生成的运行数据;
第二控制节点接收到第一控制节点的通知之后,获取重新生成的运行数据,并在所述重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配时,调度所述第二任务的第一任务切片运行;否则请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,直至重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配。
因此,作为又一个实施例,如图4所示,在本申请实施例提供的一种作业调度方法又一个实施例的中,该方法可以包括以下几个步骤:
401:调度第一任务的第一控制节点在所述第一任务的至少一个任务切片运行结束时,通知调度第二任务的第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据。
其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务。
所述第一任务的第一任务切片是指第一任务中的任一个任务切片、
402:第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据,并分配给所述第二任务的每一个任务切片。
403:第二控制节点判断所述第二任务的第一任务切片的需求数据版本是否与所述第一任务的第一任务切片的运行数据版本匹配;如果是,执行步骤409,如果否,执行步骤404。
其中,第二任务的第一任务切片可以是指第二任务中的任一个任务切片。
第一任务的第一任务切片可以是指第一任务中已经运行结束的任一个任务切片。
404:第二控制节点请求所述第一控制节点调度所述第一任务的第一运任务切片重新运行。
其中,该请求中可以携带第二控制节点的需求数据版本。
405:第一控制节点调度所述第一任务的第一任务切片重新运行。
第一控制节点接收到第二控制节点的请求之后,即调度所述第一任务的第一任务切片重新运行。
406:第一控制节点在所述第一任务的第一任务切片重新运行结束之后,通知所述第二控制节点获取所述重新生成的运行数据。
407:第二控制节点获取所述重新生成的运行数据。
第二控制节点接收到第一控制节点的通知之后,即获取所述述第一任务的第一任务切片重新生成的运行数据。
408:第二控制节点判断所述重新生产的运行数据版本与所述第二任务的第一任务切片的需求数据版本是否匹配,如果是,执行步骤409,如果否,返回步骤404继续执行。
409:第二控制节点调度所述第二任务的第一任务切片运行,处理所述第一任务的第一任务切片重新生成的运行数据。
如果第二任务的第一任务切片的需求数据版本与第一任务的第一任务切片的运行数据版本不一致,则可以调度该第一任务的第一任务切片继续重新运行,直至第二任务的第一任务切片的需求数据版本与第一任务的第一任务切片的运行数据版本匹配。
在调度第一任务的第一任务切片重新运行时,若第二任务的第一任务切片需求数据版本低于该第一任务的第一任务切坡的运行数据版本,则由于第一任务无法再生成低版本的数据,因此将无法调度第一任务的第一任务切片运行。
因此,作为又一个实施例,该步骤405可以具体是第一控制节点在所述第二任务的第一任务切片的需求版本,高于所述第一任务的第一任务切片的当前运行数据版本时, 调度所述第一任务的所述第一任务切片重新运行。
本申请实施例中,通过需求数据版本与运行数据版本的判断,使得提高了任务切片运行的成功率。且在第二任务中任务切片的需求数据版本与运行数据版本不匹配时,还可以调度第一任务的任务切片重新运行,直至获得与需求数据版本匹配的运行数据版本,提高了数据处理成功率。避免了现有技术中,一个任务全部任务切片运行结束之后,才能调度下一个任务的任务切片运行,一旦任务切片的运行数据出错,后继任务运行起来之后就会因输入数据错误导致失败的问题。
其中,每一个控制节点可以为每一个任务切片保存一份数据版本表,可以将其获取到的运行数据版本存放在该数据版本表中,直至该数据运行数据版本与数据版本表中的需求数据版本匹配时,再调度该任务切片运行。
由于在调度第二任务的任务切片运行时,第二任务中任一任务切片,处理运行数据时,有可能会失败,比如无法成功读取运行数据。为了保证作业正常运行,提高任务失败的自修复能力,在所述第二任务中的任一任务切片处理所述运行数据失败时,还可以调度所述第一任务中,所述运行数据版本对应的任务切片重新运行。
如图5所示,为本申请实施例提供的一种作业调度方法一个实施例的流程图,该方法可以包括以下几个步骤:
501:调度第一任务的第一控制节点在所述第一任务的至少一个任务切片运行结束时,通知调度第二任务的第二控制节点获取所述第一任务的第二任务切片生成的运行数据。
其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务。
所述第一任务的第二任务切片可以是第一任务中任一个任务切片。
502:第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据,并分配给所述第二任务的每一个任务切片。
503:第二控制节点调度第二任务的至少一个任务切片运行,处理所述运行数据。
其中,该至少一个任务切片可以是第二任务中,需求数据版本与该运行数据版本匹配,且处理资源满足预设条件的任务切片。
504:所第二控制节点判断第二任务中,是否存在处理第一任务的第二任务切片生成 的运行数据失败的第二任务切片,如果是,执行步骤505,如果否,则结束流程。
其中,该第二任务的第二任务切片可以是指第二任务中处理所述运行数据失败的任一个任务切片。
第一任务的第二任务切片可以是第一任务中已经运行结束的任一个任务切片。
505:第二控制节点请求所述第一控制节点调度所述第一任务的第二任务切片重新运行。
506:第一控制节点调度所述第一任务的第二任务切片重新运行。
507:第一控制节点在所述第一任务的第二任务切片重新运行结束之后,通知所述第二控制节点获取所述重新生成的运行数据;
508:第二控制节点获取所述重新生成的运行数据,并调度所述第二任务的所述第二任务切片运行。
其中,由于任务切片每重新运行一次,只能生成高于前一版本的运行数据,任务切片能够运行,表明其当前需求数据版本与运行数据版本匹配。
因此,为了能够调度所述第一任务的所述第二任务切片重新运行,第二控制节点可以通过提高数据请求版本,以触发第一任务的第二任务切片重新运行。
也即第二控制节点请求第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,具体是通过提高需求数据版本,请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行。
也即第二控制节点提高需求数据版本,并在发送至第一控制节点的请求中携带该提高的需求数据版本。从而第一控制节点在请求中的需求数据版本高于其当前运行数据版本时,触发任务切片重新运行。
本申请实施例中,在数据处理失败,可以调度任务切片重新运行,从而在提高了资源利用率的同时,还提高了作业失败自修复能力。
下面结合一个实际应用的例子,对本申请技术方案进行详细描述。
假设作业包括两个具有依赖关系的任务:task1和task2,task2的输入为task1的输出。如图6所示的示意图,假设task1包括三个任务切片M1、M2以及M3,task2包括两个任务切片R1以及R2。
中心节点将task1分配给第一控制节点taskmaster1调度,将task2分配给第二控制节点taskmaster2调度。
taskmaster1以及taskmaster2可以为控制节点中任意的两个控制节点。
task1的每一个任务切片的运行数据,taskmaster1均可以提供给taskmaster2。
taskmaster2会将运行数据分配给每一个计算节点中的任务切片;taskmaster2的也可以向taskmaster1,请求获取每一个任务切片请求对应版本的数据。
假设task1的输入数据已经准备好,可以正常运行。中心节点触发taskmaster1调度task1运行。
taskmaster2为每一个task2可以维护一个数据版本列表,其中包括运行数据版本和需求数据版本,只有运行数据版本和需求数据版本一致,且处理资源满足预设条件时,task2的任务切片才可以调度运行。每一个任务切片的初始需求数据版本一样。
taskmaster1在task1的任一个任务切片运行结束之后,即通知taskmaster2获取生成的运行数据。
taskmaster2将获取的运行数据分配给第二任务的每一个任务。
如果task2的某一个任务切片,假设R1的需求数据版本与task1的某一个任务切片,假设M1的运行数据版本不匹配,则task2的R1处于等待处理M1的运行数据的状态。
同时taskmaster2可以向taskmaster1的请求获取其需求版本的数据,taskmaster1可以在taskmaster2请求的需求版本数据高于该运行数据版本时,调度task1的该任务切片M1重新运行。
重新运行结束之后,taskmaster1可以通知taskmaster2获取M1重新运行生成的数据,taskmaster2接收到taskmaster1的通知之后,获取M1重新运行生成的数据,并将M1重新运行生成的数据分配给R1。
在R1的需求数据版本与M1的运行数据版本匹配时,R1即开始运行,处理M1的运行数据。
如果需求版本的数据低于运行数据的版本,任务切片M1将不能被调度运行。
如果task2的某一个任务切片,假设R2的处理资源未满足预设条件,则R2继续等待,直至其计算节点的处理资源满足预设条件时,且其需求数据版本与接收到的task1的任务切片,假设与M2的运行数据匹配,则即可以运行,处理M2的运行数据。
在task2的任务切片运行,处理task1的任务切片的运行数据时,有可能读取运行数据失败,导致数据处理失败,假设R1读取M1的运行数据失败,则taskmaster2可以请求taskmaster1调度M1重新运行。
由于M1每运行一次只能生成高版本的数据,因此可以通过提高需求数据版本的方 式,请求taskmaster1调度M1重新运行,taskmaster1在taskmaster2的需求数据版本高于其运行数据版本时,则可以触发M1重新运行。
如图1中所示,本申请实施例还提供了一种分布式系统,该分布式系统包括一中心节点101、与所述中心节点101连接的多个控制节点102以及每一个控制节点连接的多个计算节点103;
所述中心节点101,用于为各个控制节点分配所述作业的任务,每一控制节点调度所述作业的其中一个任务。
所述计算节点103,用于运行与其连接的控制节点分配的任务切片;
控制节点103中的第一控制节点,用于在为其分配的第一任务的至少一个任务切片运行结束时,通知调度第二任务的第二控制节点获取所述第一任务的至少一个任务切片运行生成的运行数据;其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务;
控制节点103中的第二控制节点,用于获取第一控制节点中所述第一任务的至少一个任务切片运行生成的运行数据,并分配给所述第二任务的每一个任务切片;调度所述第二任务的至少一个任务切片运行,处理所述运行数据。
第一控制节点以及第二控制节点不同,可以是多个控制节点中的任意两个控制节点,由中心节点分配,分别调度第一任务和第二任务。
通过本申请实施例的分布式系统,通过控制节点,实现了作业中任务的调度,无需中心节点进行统一调度,中心节点只负责任务的分配,从而减轻了中心节点的调度压力,可以提高调度效率。在进行作业调度时,在作业中的第一任务的至少一个任务切片运行结束之后,第一控制节点会通知第二控制节点,第二控制节点即可以调度第二任务的任务切片运行,对运行数据进行处理,无需等待第一任务的全部任务切片运行结束,第二任务的任务切片即可以调度运行进行数据处理,充分利用了集群资源,避免了资源的浪费。
作为又一个实施例,所述第二控制节点调度所述第二任务的每一个任务切片运行,处理所述运行数据可以具体是:
调度所述第二任务中,需求数据版本与所述运行数据版本匹配的任务切片运行,处理所述运行数据。
如果第一任务的所述至少一个任务切片的运行数据版本,与第二任务的任一个任务切片的需求数据版本不匹配时,则本申请实施例中,第二控制节点还可以请求第一控制节点,调度第一任务中,所述运行数据版本对应的任务切片重新运行。因此,作为又一个实施例,所述第二控制节点还用于:
在所述第二任务中的第一任务切片的需求数据版本与所述运行数据不匹配时,请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行;获取所述第一控制节点重新生成的运行数据,在所述重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配时,调度所述第一任务切片运行;否则请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,直至重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配。
所述第一控制节点还用于:
接收到所述第二控制节点的请求时,调度所述第一任务中所述运行数据版本对应的任务切片重新运行;在所述第一任务中所述运行数据版本对应的任务切片重新运行结束之后,通知所述第二控制节点获取所述重新生成的运行数据。
通过需求数据版本与运行数据版本的判断,使得进一步提高了任务切片运行的成功率。且在第二任务中任务切片的需求数据版本与运行数据版本不匹配时,还可以调度第一任务的任务切片重新运行,直至获得与需求数据版本匹配的运行数据版本,进一步的提高数据处理成功率。避免了现有技术中,一个任务全部任务切片运行结束之后,才能调度下一个任务的任务切片运行,一旦任务切片的运行数据出错,后继任务运行起来之后就会因输入数据错误导致失败的问题。
由于在调度第二任务的任务切片运行时,第二任务中任一任务切片,处理运行数据时,有可能会失败,比如无法成功读取运行数据,因此,作为又一个实施例,所述第二控制节点还用于:
在所述第二任务中的第二任务切片处理所述运行数据失败时,请求所述第一控制节点调度所述第一任务中,所述运行数据版本对应的任务切片重新运行;获取重新生成的运行数据版本,并调度所述第二任务的任务切片运行,处理所述重新运行生成的数据版本;
所述第一控制节点还用于:
接收到所述第二控制节点的请求时,调度所述第一任务中所述运行数据版本对应的 任务切片重新运行;在所述第一任务中所述运行数据版本对应的任务切片重新运行结束之后,通知所述第二控制节点获取所述重新生成的运行数据。
另外,第二控制节点请求第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,可以是通过提高需求数据版本,请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行。
也即第二控制节点提高需求数据版本,并在发送至第一控制节点的请求中携带该提高的需求数据版本。从而第一控制节点在请求中的需求数据版本高于其当前运行数据版本时,触发任务切片重新运行。
在数据处理失败,通过调度任务切片重新运行,提高了资源利用率的同时,还提高了作业失败自修复能力。
本申请实施例还提供了一种作业调度装置,如图7所示,为该作业调度装置的结构示意图,该装置具体应用于图1所示的分布式系统中的控制节点中。
该装置可以包括:
通知模块701,用于在调度的第一任务的至少一个任务切片运行结束时,通知调度第二任务的控制节点获取所述第一任务的至少一个任务切片运行生成的运行数据;以便于所述调度第二任务的控制节点将获取的所述运行数据,分配给所述第二任务的每一个任务切片;调度所述第二任务的每一个任务切片运行,处理所述运行数据;
其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务。
图8示出了本申请实施例提供的一种作业调度装置另一个实施例的结构示意图,该装置具体应用于图1所示的分布式系统中的控制节点中,
该装置可以包括:
获取模块801,用于接收到调度第一任务的控制节点的通知时,获取所述第一任务的至少一个任务切片运行生成的运行数据;其中,所述通知为所述调度第一任务的控制节点在所述第一任务的至少一个任务切片运行结束之后发送的;所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务;
分配模块802,用于将获取的所述第一任务的至少一个任务切片生成的运行数据,分配给所述第二任务的每一个任务切片;
调度模块803,用于调度所述第二任务的至少一个任务切片运行,处理所述运行数据。
其中,所述调度模块803具体是用于调度所述第二任务中,需求数据版本与所述运行数据版本匹配的任务切片运行,处理所述运行数据。
如图9所示,作为又一个实施例,在图8所示的作业调度装置中,该装置还可以包括:
第一请求模块901,用于在所述第二任务中的第一任务切片的需求数据版本与所述运行数据不匹配时,请求所述调度第一任务的控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行;
则所述获取模块801还用于获取所述第一任务中所述运行数据版本对应的任务切片重新生成的运行数据版本;
则所述分配模块802,还用于将获取的重新生成的运行数据版本发送至所述第二任务的第一任务切片。
则所述调度模块803还用于在所述重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配时,调度所述第一任务切片运行;否则请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,直至重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配。
如图10所示,作为又一个实施例,在图8所示的作业调度装置中,该装置还可以包括:
第二请求模块1001,用于在所述第二任务中的第二任务切片处理所述运行数据失败时,请求所述第一控制节点调度所述第一任务中,所述运行数据版本对应的任务切片重新运行;
所述获取模块801,还用于获取所述第一任务中,所述运行数据版本对应的任务切片重新生成的运行数据版本。
所述分配模块802,还用于将重新生成的运行数据版本分配至所述第二任务的第二任务切片。
所述调度模块803还用于调度所述第二任务的第二任务切片运行,处理所述重新运行的数据版本。
如在说明书及权利要求当中使用了某些词汇来指称特定组件。本领域技术人员应可 理解,硬件制造商可能会用不同名词来称呼同一个组件。本说明书及权利要求并不以名称的差异来作为区分组件的方式,而是以组件在功能上的差异来作为区分的准则。如在通篇说明书及权利要求当中所提及的“包含”为一开放式用语,故应解释成“包含但不限定于”。“大致”是指在可接收的误差范围内,本领域技术人员能够在一定误差范围内解决所述技术问题,基本达到所述技术效果。此外,“耦接”一词在此包含任何直接及间接的电性耦接手段。因此,若文中描述一第一装置耦接于一第二装置,则代表所述第一装置可直接电性耦接于所述第二装置,或通过其他装置或耦接手段间接地电性耦接至所述第二装置。说明书后续描述为实施本申请的较佳实施方式,然所述描述乃以说明本申请的一般原则为目的,并非用以限定本申请的范围。本申请的保护范围当视所附权利要求所界定者为准。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的商品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种商品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的商品或者系统中还存在另外的相同要素。
上述说明示出并描述了本申请的若干优选实施例,但如前所述,应当理解本申请并非局限于本文所披露的形式,不应看作是对其他实施例的排除,而可用于各种其他组合、修改和环境,并能够在本文所述申请构想范围内,通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本申请的精神和范围,则都应在本申请所附权利要求的保护范围内。

Claims (12)

  1. 一种作业调度方法,其特征在于,应用于分布式系统中,所述分布式系统至少包括一中心节点、与所述中心节点连接的多个控制节点以及每一控制节点分别连接的多个计算节点;所述中心节点为各个控制节点分配所述作业的任务;所述控制节点调度为其分配的任务的各个任务切片在与其连接的计算节点中运行,所述方法包括:
    调度第一任务的第一控制节点在所述第一任务的至少一个任务切片运行结束时,通知调度第二任务的第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据;其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务;
    所述第二控制节点获取所述第一任务的至少一个任务切片生成的运行数据,并分配给所述第二任务的每一个任务切片;
    所述第二控制节点调度所述第二任务的至少一个任务切片运行,处理所述运行数据。
  2. 如权利要求1所述的方法,其特征在于,所述第二控制节点调度所述第二任务的至少一个任务切片运行,处理所述运行数据包括:
    调度所述第二任务中,需求数据版本与所述运行数据版本匹配的任务切片运行,处理所述运行数据。
  3. 如权利要求1或2所述的方法,其特征在于,在所述第二任务中的第一任务切片的需求数据版本与所述运行数据不匹配时,所述方法还包括:
    所述第二控制节点请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行;
    所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,并在重新运行结束之后,通知所述第二控制节点获取所述重新生成的运行数据;
    所述第二控制节点获取所述重新生成的运行数据,并在所述重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配时,调度所述第一任务切片运行;否则请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,直至重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配。
  4. 如权利要求1或2所述的方法,其特征在于,所述第二控制节点调度所述第二任务的任一任务切片运行,处理所述运行数据之后,所述方法还包括:
    所述第二控制节点在所述第二任务中的第二任务切片处理所述运行数据失败时,请 求所述第一控制节点调度所述第一任务中,所述运行数据版本对应的任务切片重新运行;
    所述第一控制节点调度所述第一任务中,所述运行数据版本对应的任务切片重新运行;并在所述第一任务中所述运行数据版本对应的任务切片重新运行结束之后,通知所述第二控制节点获取重新生成的运行数据;
    所述第二控制节点获取所述重新生成的运行数据,并调度所述第二任务的所述第二任务切片运行,处理所述重新生成的运行数据。
  5. 如权利要求4所述的方法,其特征在于,所述第二控制节点请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行包括:
    所述第二控制节点通过提高需求数据版本,请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行。
  6. 一种分布式系统,其特征在于,包括一中心节点、与所述中心节点连接的多个控制节点以及每一个控制节点连接的多个计算节点;
    所述中心节点,用于为各个控制节点分配作业的任务;
    所述计算节点,用于运行与其连接的控制节点分配的任务切片;
    所述控制节点中的第一控制节点,用于在为其分配的第一任务的至少一个任务切片运行结束时,通知调度第二任务的第二控制节点获取所述第一任务的至少一个任务切片运行生成的运行数据;其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务;
    所述第二控制节点,用于获取所述第一任务的至少一个任务切片运行生成的运行数据,并分配给所述第二任务的每一个任务切片;调度所述第二任务的至少一个任务切片运行,处理所述运行数据。
  7. 如权利要求6所述的系统,其特征在于,所述第二控制节点调度所述第二任务的每一个任务切片运行,处理所述运行数据包括:
    调度所述第二任务中,需求数据版本与所述运行数据版本匹配的任务切片运行,处理所述运行数据。
  8. 如权利要求6或7所述的系统,其特征在于,所述第二控制节点还用于:
    在所述第二任务中的第一任务切片的需求数据版本与所述运行数据不匹配时,请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行;获取所述第一控制节点重新生成的运行数据版本,并在所述重新生成的运行数据版本与所 述第一任务切片的需求数据版本匹配时,调度所述第一任务切片运行;否则请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行,直至重新生成的运行数据版本与所述第一任务切片的需求数据版本匹配;
    所述第一控制节点还用于接收到所述第二控制节点的请求时,调度所述第一任务中所述运行数据版本对应的任务切片重新运行;在所述第一任务中所述运行数据版本对应的任务切片重新运行结束之后,通知所述第二控制节点获取重新生成的运行数据。
  9. 如权利要求6或7所述的系统,其特征在于,所述第二控制节点还用于:在所述第二任务中的第二任务切片处理所述运行数据失败时,请求所述第一控制节点调度所述第一任务中,所述运行数据版本对应的任务切片重新运行;获取重新生成的运行数据版本,并调度所述第二任务的任务切片运行,处理所述重新运行的数据版本;
    所述第一控制节点,还用于接收到所述第二控制节点的请求时,调度所述第一任务中,所述运行数据版本对应的任务切片重新运行;在所述第一任务中所述运行数据版本对应的任务切片重新运行结束之后,通知所述第二控制节点获取所述重新生成的运行数据。
  10. 如权利要求9所述的系统,其特征在于,所述第二控制节点请求所述第一控制节点调度所述第一任务中,所述运行数据版本对应的任务切片重新运行具体是:
    通过提高需求数据版本,请求所述第一控制节点调度所述第一任务中所述运行数据版本对应的任务切片重新运行。
  11. 一种作业调度装置,其特征在于,应用于分布式系统的控制节点中,所述分布式系统至少包括中心节点、与中心节点连接的多个控制节点以及每一个控制节点分别连接的多个计算节点;所述控制节点获取所述中心节点分配的所述作业的一任务,并调度所述任务的各个任务切片在与其连接的计算节点中运行,所述装置包括:
    通知模块,用于在第一任务的至少一个任务切片运行结束时,通知调度第二任务的控制节点获取所述第一任务的至少一个任务切片运行生成的运行数据;以便于所述调度第二任务的控制节点将获取的所述运行数据,分配给所述第二任务的每一个任务切片;调度所述第二任务的每一个任务切片运行,处理所述运行数据;
    其中,所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务。
  12. 一种作业调度装置,其特征在于,应用于分布式系统的控制节点中,所述分布式系统至少包括中心节点、与中心节点连接的多个控制节点以及每一个控制节点分别连 接的多个计算节点;所述控制节点获取所述中心节点分配的所述作业的一任务,并调度所述任务的各个任务切片在与其连接的计算节点中运行,所述装置包括:
    获取模块,用于接收到调度第一任务的控制节点的通知时,获取所述第一任务的至少一个任务切片运行生成的运行数据;其中,所述通知为所述调度第一任务的控制节点在所述第一任务的至少一个任务切片运行结束之后发送的;所述第一任务为所述作业中任一运行的任务;所述第二任务为所述作业中依赖所述第一任务的任一任务;
    分配模块,用于将获取的所述第一任务的至少一个任务切片生成的运行数据,分配给所述第二任务的每一个任务切片;
    调度模块,用于调度所述第二任务的至少一个任务切片运行,处理所述运行数据。
PCT/CN2016/086102 2015-06-26 2016-06-17 作业调度方法、装置及分布式系统 WO2016206564A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/852,786 US10521268B2 (en) 2015-06-26 2017-12-22 Job scheduling method, device, and distributed system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510362989.5 2015-06-26
CN201510362989.5A CN106293893B (zh) 2015-06-26 2015-06-26 作业调度方法、装置及分布式系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/852,786 Continuation US10521268B2 (en) 2015-06-26 2017-12-22 Job scheduling method, device, and distributed system

Publications (1)

Publication Number Publication Date
WO2016206564A1 true WO2016206564A1 (zh) 2016-12-29

Family

ID=57586566

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/086102 WO2016206564A1 (zh) 2015-06-26 2016-06-17 作业调度方法、装置及分布式系统

Country Status (3)

Country Link
US (1) US10521268B2 (zh)
CN (1) CN106293893B (zh)
WO (1) WO2016206564A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783634A (zh) * 2019-11-06 2021-05-11 长鑫存储技术有限公司 任务处理系统、方法及计算机可读存储介质

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107589985B (zh) * 2017-07-19 2020-04-24 山东大学 一种面向大数据平台的两阶段作业调度方法及系统
KR102477404B1 (ko) * 2017-08-31 2022-12-13 캠브리콘 테크놀로지스 코퍼레이션 리미티드 칩 장치 및 관련 제품
CN108614460B (zh) * 2018-06-20 2020-11-06 东莞市李群自动化技术有限公司 分布式多节点控制系统及方法
CN109815011A (zh) * 2018-12-29 2019-05-28 东软集团股份有限公司 一种数据处理的方法和装置
US20220276893A1 (en) * 2019-08-27 2022-09-01 Microsoft Technology Licensing, Llc Workflow-based scheduling and batching in multi-tenant distributed systems
CN110597608B (zh) * 2019-09-12 2023-08-22 创新先进技术有限公司 任务处理方法和装置、分布式系统以及存储介质
CN113239028B (zh) * 2021-05-10 2023-03-14 成都新潮传媒集团有限公司 一种数据仓库调度的数据修复方法、装置和可读存储介质
CN113268337B (zh) * 2021-07-20 2021-10-22 杭州朗澈科技有限公司 Kubernetes集群中Pod调度的方法和系统
CN113535405A (zh) * 2021-07-30 2021-10-22 上海壁仞智能科技有限公司 云端服务系统及其操作方法
CN114064609A (zh) * 2021-11-12 2022-02-18 中交智运有限公司 一种数据仓库任务调度方法、装置、系统及存储介质
CN114090266B (zh) * 2021-12-01 2022-12-09 中科三清科技有限公司 空气质量预报生成方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060112388A1 (en) * 2004-11-22 2006-05-25 Masaaki Taniguchi Method for dynamic scheduling in a distributed environment
US8015564B1 (en) * 2005-04-27 2011-09-06 Hewlett-Packard Development Company, L.P. Method of dispatching tasks in multi-processor computing environment with dispatching rules and monitoring of system status
CN102387173A (zh) * 2010-09-01 2012-03-21 中国移动通信集团公司 一种MapReduce系统及其调度任务的方法和装置
CN102541640A (zh) * 2011-12-28 2012-07-04 厦门市美亚柏科信息股份有限公司 一种集群gpu资源调度系统和方法
CN102567312A (zh) * 2011-12-30 2012-07-11 北京理工大学 一种基于分布式并行计算框架的机器翻译方法
CN103279385A (zh) * 2013-06-01 2013-09-04 北京华胜天成科技股份有限公司 一种云计算环境中集群任务调度方法及系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102004670B (zh) * 2009-12-17 2012-12-05 华中科技大学 一种基于MapReduce的自适应作业调度方法
US9552206B2 (en) * 2010-11-18 2017-01-24 Texas Instruments Incorporated Integrated circuit with control node circuitry and processing circuitry
CN102111337B (zh) * 2011-03-14 2013-05-15 浪潮(北京)电子信息产业有限公司 任务调度方法和系统
US9529596B2 (en) * 2011-07-01 2016-12-27 Intel Corporation Method and apparatus for scheduling instructions in a multi-strand out of order processor with instruction synchronization bits and scoreboard bits
CN202565304U (zh) * 2012-05-21 2012-11-28 成都因纳伟盛科技股份有限公司 分布式计算任务调度及执行系统
CN103810029A (zh) * 2014-02-08 2014-05-21 南开大学 一种基于虚拟机出租通用计算能力的系统及方法
US10061577B2 (en) * 2014-10-14 2018-08-28 Electric Cloud, Inc. System and method for optimizing job scheduling within program builds

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060112388A1 (en) * 2004-11-22 2006-05-25 Masaaki Taniguchi Method for dynamic scheduling in a distributed environment
US8015564B1 (en) * 2005-04-27 2011-09-06 Hewlett-Packard Development Company, L.P. Method of dispatching tasks in multi-processor computing environment with dispatching rules and monitoring of system status
CN102387173A (zh) * 2010-09-01 2012-03-21 中国移动通信集团公司 一种MapReduce系统及其调度任务的方法和装置
CN102541640A (zh) * 2011-12-28 2012-07-04 厦门市美亚柏科信息股份有限公司 一种集群gpu资源调度系统和方法
CN102567312A (zh) * 2011-12-30 2012-07-11 北京理工大学 一种基于分布式并行计算框架的机器翻译方法
CN103279385A (zh) * 2013-06-01 2013-09-04 北京华胜天成科技股份有限公司 一种云计算环境中集群任务调度方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112783634A (zh) * 2019-11-06 2021-05-11 长鑫存储技术有限公司 任务处理系统、方法及计算机可读存储介质
CN112783634B (zh) * 2019-11-06 2022-04-26 长鑫存储技术有限公司 任务处理系统、方法及计算机可读存储介质

Also Published As

Publication number Publication date
CN106293893A (zh) 2017-01-04
CN106293893B (zh) 2019-12-06
US20180121240A1 (en) 2018-05-03
US10521268B2 (en) 2019-12-31

Similar Documents

Publication Publication Date Title
WO2016206564A1 (zh) 作业调度方法、装置及分布式系统
CN107580023B (zh) 一种动态调整任务分配的流处理作业调度方法及系统
CN107491351B (zh) 一种基于优先级的资源分配方法、装置和设备
EP3567829B1 (en) Resource management method and apparatus
US11275622B2 (en) Utilizing accelerators to accelerate data analytic workloads in disaggregated systems
US8943353B2 (en) Assigning nodes to jobs based on reliability factors
CN107515786B (zh) 资源分配方法、主装置、从装置和分布式计算系统
WO2015196931A1 (zh) 基于磁盘io的虚拟资源分配方法及装置
US9092272B2 (en) Preparing parallel tasks to use a synchronization register
US9189381B2 (en) Managing CPU resources for high availability micro-partitions
WO2018108001A1 (en) System and method to handle events using historical data in serverless systems
US10860385B2 (en) Method and system for allocating and migrating workloads across an information technology environment based on persistent memory availability
CN110597614B (zh) 一种资源调整方法及装置
US9244825B2 (en) Managing CPU resources for high availability micro-partitions
US9158470B2 (en) Managing CPU resources for high availability micro-partitions
US20150172095A1 (en) Monitoring file system operations between a client computer and a file server
WO2022111466A1 (zh) 任务调度方法、控制方法、电子设备、计算机可读介质
CN115858667A (zh) 用于同步数据的方法、装置、设备和存储介质
Sumalatha et al. CLBC-Cost effective load balanced resource allocation for partitioned cloud system
US20240061698A1 (en) Managing the assignment of virtual machines to non-uniform memory access nodes
CN115080199A (zh) 任务调度方法、系统、设备、存储介质及程序产品
JP2015170270A (ja) 情報処理装置、及び、そのリソースアクセス方法、並びに、リソースアクセスプログラム
JP2009211604A (ja) 情報処理装置、情報処理方法、プログラム、及び、記憶媒体

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16813680

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16813680

Country of ref document: EP

Kind code of ref document: A1