WO2021212967A1 - Task scheduling for distributed data processing - Google Patents

Task scheduling for distributed data processing Download PDF

Info

Publication number
WO2021212967A1
WO2021212967A1 PCT/CN2021/075572 CN2021075572W WO2021212967A1 WO 2021212967 A1 WO2021212967 A1 WO 2021212967A1 CN 2021075572 W CN2021075572 W CN 2021075572W WO 2021212967 A1 WO2021212967 A1 WO 2021212967A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
suspended
node
running
resources
Prior art date
Application number
PCT/CN2021/075572
Other languages
English (en)
French (fr)
Inventor
Venkata Ramana GOLLAMUDI
Raghunandan SUBRAMANYA
Ravindra PESALA
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to CN202180025709.5A priority Critical patent/CN115362434A/zh
Publication of WO2021212967A1 publication Critical patent/WO2021212967A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/461Saving or restoring of program or task context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/485Task life-cycle, e.g. stopping, restarting, resuming execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present subject matter relates, in general, to distributed data processing, and in particular, the present subject matter relates to scheduling of tasks for distributed data processing.
  • a distributed system is a group of computing devices inter-connected over a network and is useful for processing large amounts of data in parallel, such as for big data analytics.
  • Data processing tasks are distributed across the devices to increase reliability and scalability, provide faster data processing, and improve response time.
  • one of the devices acts as a master device that distributes and schedules the tasks amongst the remaining devices, called slave devices.
  • Distributed systems such as Apache Spark, Map Reduce, Hive and the like, which are used for big data processing, are data intensive as well as memory intensive systems, where task scheduling mechanisms are extensively used.
  • aspects of the present invention provide methods and devices for task scheduling to improve the resource utilization of a distributed system.
  • Distributed systems are also referred to as distributed data processing systems.
  • a method for task scheduling in a distributed data processing system comprises receiving, by a selected node, a first task to be executed, a priority indication of the first task, and a first instruction to suspend a running task being executed by the node.
  • the running task is suspended by the node and task resources associated with the running task are saved.
  • the first task is executed by the node and resources from the suspended running task are sequentially released based on resource requirement of the first task during its execution.
  • an acknowledgement regarding a completion of the first task is sent by the node to a master device.
  • a method for task scheduling in a distributed data processing system comprises determining, by a master device, a node from a plurality of nodes for execution of a first task; and sending, by the master device, the first task to be executed, a priority indication of the first task, and a first instruction.
  • the first instruction comprises instructions to suspend a running task, save task resources associated with the running task, and sequentially release resources from the suspended running task based on resource requirement of the first task.
  • a node to execute a task scheduled in a distributed data processing system comprising a task execution module configured to receive a first task to be executed, a priority indication of the first task, and a first instruction to suspend a running task; suspend the running task and saving task resources associated with the running task, and execute the first task and sequentially releasing resources from the suspended running task based on resource requirement of the first task.
  • a resource management module is provided in the node to monitor a resource availability based on the resource requirement and a memory management module is provided to control and store data of task resources associated with a suspended task.
  • a master device to schedule a task in a distributed data processing system comprising a scheduling module configured to determine a node from a plurality of nodes for execution of a first task.
  • the master device also includes a sending module configured to send a first task to be executed, a priority indication of the first task, and a first instruction, wherein the first instruction comprises instructions for suspending the running task, saving task resources associated with the running task, and sequentially release resources from the suspended running task based on resource requirement of the first task.
  • Fig. 1 illustrates a distributed data processing system for task scheduling, in accordance with an example of the present subject matter.
  • Fig. 2 illustrates a block diagram of a master device to schedule a task in a distributed data processing system, in accordance with an example of the present subject matter.
  • Fig. 3 illustrates a block diagram of a node to execute a task scheduled in a distributed data processing system, in accordance with an example of the present subject matter.
  • Fig. 4 illustrates a schematic representation of a waiting suspended task, in accordance with an example of the present subject matter.
  • Fig. 5 illustrates a schematic representation of a waiting suspended task with partial spill of memory resources, in accordance with an example of the present subject matter.
  • Fig. 6 illustrates a data check-pointing method, in accordance with an example of the present subject matter.
  • Fig. 7 illustrates a task scheduling method in a distributed data processing system, in accordance with an example of the present subject matter.
  • Fig. 8 illustrates a task scheduling method in a distributed data processing system, implemented by a master device, in accordance with an example of the present subject matter.
  • Fig. 9 illustrates a task scheduling method in a distributed data processing system, implemented by a node, in accordance with an example of the present subject matter.
  • resources are reserved for a user for a particular job to be executed to ensure a quick response.
  • the reserving of resources does not allow resource sharing between jobs even if some resources are free.
  • the non-availability of resources for sharing leads to resource wastage, thereby affecting the overall resource utilization of the system.
  • the incoming job has to wait till the resources that are to be reserved for the incoming job become free.
  • Fair and first-in-first-out (FIFO) type of task scheduling is implemented for equitably allocating resources.
  • FIFO Fair and first-in-first-out
  • the task scheduling process allows a high priority task to be executed on receipt by pre-empting a running low priority task. This ensures that high priority jobs/queries get a better share of resources, as the high priority tasks are scheduled more often, which improves their response time. However, it degrades the performance of low priority long running job (like data load job or full scan job) , as in this method, the pre-empted task has to restart. In one scenario where a low priority task is under execution and has completed 95 percent of its execution, if a high priority task is scheduled, the low priority task is still pre-empted and rescheduled for execution after completion of the high priority task. The task that was pre-empted has to start executing the task from the beginning on resumption.
  • This method increases the cost involved in restarting the task that was pre-empted. Also, once the pre-empted task has been scheduled to restart, another high priority task may be received and may result in pre-empting the task again. Multiple such interruptions and re-starting the task on every interruption decreases the efficiency and response time of the system.
  • a task scheduling technique which may be designed as a combination of the above-mentioned techniques like Fair or FIFO along with pre-emption increases the complexity of balancing the response time of the system and resource utilization.
  • the present subject matter disclosed herein relates to a task scheduling method in distributed systems, such as systems that deal with big data.
  • the task scheduling method implements a suspend and resume method, where the method comprises suspending a long running low priority task to schedule a high priority task to be executed first. This process of suspending and resuming a task is performed by providing various steps to fallback such that the cost of resuming a task is minimal.
  • the present subject matter implements a sequential suspension of resource approach in which, based on the resource requirement of the high priority task to be executed, and the resources available for executing the high priority task, resources from the suspended task may be released in a sequential manner to provide resources for the high priority task to be executed completely. Further, the sequentially suspending resource approach is implemented along with a suspend and resume method which overcomes the issues associated with the techniques mentioned above by providing effective utilization of existing resources by avoiding resource wastage and providing a faster response time.
  • This method of suspend and resume of the present subject matter ensures safe suspending and resuming of the long running low priority tasks with minimal impact on its execution. Further, the Service Level Agreement (SLA) of high priority tasks can be effectively ensured with minimal impact on the current running jobs.
  • SLA Service Level Agreement
  • This method can be implemented on systems with planned workloads and systems with unplanned adhoc workloads like query processing.
  • the suspend and resume method of the present subject matter provides better concurrency of execution and can also be implemented with other approaches like resource sharing and task scheduling, such as with Fair, FIFO, and the like.
  • FIG. 1 illustrates a distributed data processing environment 100 for task scheduling in a distributed data processing system, in accordance with an example of the present subject matter.
  • An example distributed data processing environment 100 comprises a master device 102 and a plurality of nodes 104-1, 104-2, 104-3... 104-n, individually referred to as node 104.
  • the master device 102 may receive various incoming jobs 106 from other devices (not shown in the figure) , such as user devices.
  • the user devices may run various applications that send the jobs 106 to the master device 102 over a network (not shown) .
  • the master device 102 may further be inter-connected over a network 110 to the plurality of nodes 104-1, 104-2....104-n, for data processing.
  • the nodes 104 may be slave devices.
  • the master device 102 and each of the plurality of nodes 104-1, 104-2.... 104-n, may be implemented as any computing device, such as a server, a desktop, a laptop, and the like.
  • the network 110 may be a wireless network or a combination of a wired and wireless network.
  • the network 110 can also include a collection of individual networks, interconnected with each other and functioning as a single large network, such as the Internet. Examples of such individual networks include, but are not limited to, Global System for Mobile Communication (GSM) network, Universal Mobile Telecommunications System (UMTS) network, Personal Communications Service (PCS) network, Time Division Multiple Access (TDMA) network, Code Division Multiple Access (CDMA) network, Next Generation Network (NGN) , Public Switched Telephone Network (PSTN) , Long Term Evolution (LTE) , and Integrated Services Digital Network (ISDN) .
  • GSM Global System for Mobile Communication
  • UMTS Universal Mobile Telecommunications System
  • PCS Personal Communications Service
  • TDMA Time Division Multiple Access
  • CDMA Code Division Multiple Access
  • NTN Next Generation Network
  • PSTN Public Switched Telephone Network
  • LTE Long Term Evolution
  • ISDN Integrated Services Digital Network
  • the master device 102 may receive the set of incoming jobs 106 to be scheduled and executed. Every incoming job may be divided into a plurality of tasks by the master device 102. Each task may comprise a plurality of attributes. In one example, each task may comprise a priority indication as one of its attributes. A task may be selected by the master device 102 based on the attributes of the task and may be scheduled to be executed by one of the nodes 104. In one example, the task that may be selected to be executed first may be referred to as a first task.
  • the master device 102 may be configured to determine one or more nodes from the plurality of nodes 104-1, 104-2.... 104-n for sending the tasks, and thereby the jobs, for execution.
  • the determination of the nodes for task execution may be based on various parameters, such as a locality of data of the nodes, resource availability of the node, etc.
  • a node 104 selected from the plurality of nodes 104-1, 104-2.... 104-n for execution of the first task may be referred to as a selected node.
  • the master device 102 may send the first task to be executed, a priority indication of the first task, and a first instruction to the selected node.
  • the first instruction sent by the master device 102 may instruct the selected node 104 to suspend a running task.
  • a running task may be understood as a task that is currently being executed by the selected node.
  • the master device 102 may instruct the selected node 104 to save the position at which the running task was suspended.
  • the master device 102 may instruct the selected node 104 to save the task resources associated with the running task and sequentially release resources from the suspend running task, alternatively referred to as a suspended task, based on the resource requirement of the first task to be executed.
  • the selected node 104 may execute the first task scheduled by the master device 102 and on completion of the first task, the selected node 104 may send an acknowledgement to the master device 102.
  • the acknowledgement sent by the selected node 104 may comprise information regarding a completion of the first task.
  • the master device 102 may send a second instruction to the selected node 104 to resume the suspended task.
  • the selected node on receiving the second instruction, may resume the suspended task from the state at which it was suspended and may execute the resumed task.
  • the suspended task may be safely resumed from the status at which it was suspended to complete its execution, once the first task has been completely executed, thereby imposing minimal costs for the suspend and resume process.
  • the time taken to complete low priority tasks may not be unduly affected by higher priority tasks.
  • Fig. 2 illustrates a block diagram of a master device to schedule a task in a distributed data processing system, in accordance with an example of the present subject matter.
  • the master device 102 may comprise a processor 200, various modules, including a scheduling module 202, and a memory 204.
  • the memory 204 may include any non-transitory computer-readable medium including, for example, volatile memory, such as RAM, or non-volatile memory, such as EPROM, flash memory, and the like.
  • the memory 204 may store the various modules and the processor 200 may be coupled to the memory 204 to fetch and execute the instructions corresponding to the modules.
  • the various module (s) may be coupled to the processor (s) directly.
  • the module (s) include routines, programs, objects, components, data structures, and the like, which perform particular tasks or implement particular abstract data types.
  • the processor (s) 200 may be implemented as microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor (s) 200 may be coupled to the memory 204 to fetch and execute the instructions corresponding to the modules.
  • the functions of the various elements shown in the figure, including any functional blocks labeled as “processor (s) ” may be provided through the use of dedicated hardware as well as hardware capable of executing machine readable instructions.
  • the memory 204 may serve, amongst other things, as a repository or a storage device for storing data that may be fetched, processed, received, or generated by the module (s) .
  • the memory 204 is shown internal to the master device 102, it may be understood that the data can reside in an external storage device, which may be coupled to the master device 102.
  • the scheduling module 202 of the master device 102 may be configured to receive the jobs 106 that may have to be scheduled and executed from various user devices.
  • the scheduling module 202 may comprise a main task scheduler.
  • the main task scheduler may be a distributed scheduler.
  • the main task scheduler may divide the incoming jobs 106 into a plurality of tasks and schedule these tasks to be executed to the plurality of nodes 104-1, 104-2.... 104-n.
  • the scheduling module 202 may schedule the first task to a particular node. For example, if there are five nodes in the distributed data processing system and the 4 th node 104 is currently executing a task with a lower priority than the incoming high priority first task, then the scheduling module 202 of the master device 102 may schedule the high priority first task with the 4 th node 104 for its execution. Thus, for the scheduling of the high priority first task, the scheduling module 202 may determine a node 104 from the plurality of nodes 104-1, 104-2.... 104-n and issue a first instruction for being sent to the selected node.
  • the scheduling module 202 of the master device 102 may be configured to send the first instruction to the node 104 that is selected from a plurality of nodes 104-1, 104-2.... 104-n to carry out the execution of the high priority task.
  • the first instruction that may be sent by the scheduling module 202 may instruct the selected node 104 to suspend the running task and save the task resources associated with the running task.
  • the node 104 selected to execute the first task may be instructed to release resources sequentially from the suspended running task alternatively referred to a suspended task, based on the resource requirement of the first task.
  • the node 104 may accordingly suspend the running task and execute the high priority first task while sequentially releasing resources based on the resource required by the first task
  • the memory 204 of the master device 102 may be configured to store the scheduling technique that may be implemented by the main task scheduler and information that may be received by the master device 102, such as the status of the tasks being executed by the plurality of nodes 104-1, 104-2....104-n, the status information of the high priority tasks that may have been scheduled to the nodes, and the like.
  • the scheduling module 202 may receive an acknowledgement from the node 104 when the high priority first task is completed. The scheduling module 202 may then send a second instruction to the node 104 for instructing the node 104 to resume the suspended task. The second instruction may cause the node 104 to resume the suspended task from the point of suspension, as explained below.
  • Fig. 3 illustrates a block diagram of a node to execute a task scheduled in a distributed data processing system, in accordance with an example of the present subject matter.
  • Each node 104 from the plurality of nodes 104-1, 104-2.... 104-n may comprise a processor 300, a task execution module 302, a resource management module 304 and a memory management module 306.
  • the node 104 may also include a memory 308.
  • the memory 308 may store the various modules and the processor 300 may be coupled to the memory 308.
  • the module (s) include routines, programs, objects, components, data structures, and the like, which perform particular tasks or implement particular abstract data types.
  • the memory 308 serves, amongst other things, as a repository or a storage device for storing data that may be fetched, processed, received, or generated by the module (s) . Although the memory 308 is shown internal to the node 104, it may be understood that the memory 308 can reside in an external storage device, which may be coupled to the node 104.
  • the processor (s) 300 may be implemented as microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor (s) 300 may be coupled to the memory 308 to fetch and execute the instructions corresponding to the modules.
  • the functions of the various elements shown in the figure, including any functional blocks labeled as “processor (s) ” may be provided through the use of dedicated hardware as well as hardware capable of executing machine readable instructions.
  • the task execution module 302 may be configured to receive the first task to be executed, the priority indication of the first task, and the first instruction to suspend the running task from the master device 102.
  • the task execution module 302 may comprise a node task scheduler. The node task scheduler may be implemented to schedule the tasks allocated to the node 104.
  • the node 104 is considered to be the selected node to which a high priority task is allocated and where the first instruction is received for executing the high priority task.
  • the first instruction received by the node 104 may instruct the node 104 to suspend a running task, where the running task is a task that is currently being executed by the node 104.
  • the task execution module 302 in addition to suspending the running task being executed by the selected node 104 may save all the task resources associated with the running task.
  • Execution of the first task by the task execution module 302 may be dependent on the resources available for the first task to be executed and the resources required for executing the first task.
  • the task execution module 302 may sequentially release resources from the suspended task based on resource requirement of the first task.
  • a sequentially suspending resource approach may be implemented. The sequential suspending resource approach may increase the resource availability as per the resource requirement of the first task.
  • the sequential suspending of resources helps to minimize the cost of resuming a suspended task to complete its execution, by making resources available for the first task incrementally on need basis.
  • the task execution module 302 may suspend the running task, alternatively referred to as the suspended task.
  • the running task initially only the CPU or processing resources are released, and all the other resources associated with the task, such as resource memory, open handles, file handles, and the like may remain untouched. The status of these other resources are saved in the memory 308.
  • the released CPU resources may be allocated to the first task by the task execution module 302 of the selected node 104 and the first task may begin its execution.
  • the resource requirement of the first task may be monitored by the resource management module 304.
  • the CPU allocated to the first task along with available/free resources at the node 104 may be sufficient for the execution of the first task and the first task may be completely executed without requiring further resources related to the suspended task.
  • the resource management module 304 may monitor the resource requirements of the task being executed. When the high priority task being executed is resource intensive, then the resource management module 304 may perform one or more processes sequentially to ensure the execution of the high priority task is not affected. In one example implementation, in a first process the resource management module 304 may trigger a memory management module 306 to incrementally release the memory occupied by the suspended task by spilling a few objects of the memory into the hard disk.
  • the resource management module 304 may directly instruct the memory management module 306 to partially suspend the memory to increase resource availability based on the resource requirement of the first task being executed. In another example, the resource management module 304 may send a message to the master device 102 to instruct the memory management module 306 to partially suspend the memory to increase resource availability based on the resource requirement of the first task being executed.
  • the memory management module 306 may be intrinsically inbuilt in the node 104. In another example, when a node does not intrinsically comprise an inbuilt memory management module 306, the memory management module 306 may be implemented in the data processing system 100.
  • the memory management module 306 may be a custom memory management module.
  • the custom memory management module may be implemented to identify the various blocks of data stored by the suspended task. When the resource requirement of the high priority task increases, in one example, the custom memory management module may spill objects stored in the memory to the hard disk, which results in release of additional resources for completion of the high priority task to be completed as explained in detail later.
  • the resource management module 304 may check if a data checkpointing method may be performed for the suspended task.
  • a data checkpointing method may be performed for the suspended task.
  • the suspended task may be pre-empted.
  • an epoch-based checkpointing method may be implemented for the suspended task and it may then be terminated.
  • the checkpointing method may be implemented as known in the art for both streaming data processing tasks as well as non-streaming data processing tasks, as will be explained in detail with respect to Fig. 6.
  • a streaming data processing system may be a Spark system.
  • the task execution module 302 may send an acknowledgement to the master device 102 and may receive a second instruction to resume the suspended task. Accordingly, the task execution module 302 may resume the suspended task from the point of suspension. For example, if only the CPU was released at the time of suspension without releasing any of the other resources, the task execution module 302 may allocate CPU resources back the suspended task and use the information saved for the other resources to resume the execution of the suspended task. In another example, if following the first process, partial memory resources had been spilled to the hard disk, the task execution module 302 may use the memory management module 306 to retrieve the spilled memory resources and allocate the CPU resources to the suspended task and resume its execution based on other saved data, such as open filed handles, etc.
  • the task execution module 302 may use the memory management module 306 to obtain the data checkpoint and then resume the suspended task.
  • the suspended task if the suspended task had been pre-empted following the second process, then the suspended task is re-started from the beginning. Thus, the suspended task has to be re-started in few conditions and can be restarted from a point of suspension in other cases for more efficient data processing.
  • Fig. 4 illustrates a schematic representation of a waiting suspended task, in accordance with an example of the present subject matter.
  • Every task that may be executed by the task execution module 302, when suspended may periodically check a wait point 402.
  • the suspended task may periodically check the wait point 402 for an instruction to resume.
  • the instruction may be provided by the master device 102 as the second instruction.
  • the suspended task on reaching the wait point, may either resume the task, pre-empt the task or continue to wait in the suspended state and continue to monitor the status at the wait point 402 periodically.
  • the running task or, the task that is currently being executed may also periodically check the wait point 402 for an instruction, such as the first instruction.
  • the instruction may be provided by the master device 102.
  • the running task on receiving the wait point 402, may suspend the running task or continue executing the running task depending on whether it receives the instruction or not.
  • the CPU of the suspended task may be given to the high priority task to be executed, by the operating system of that particular node 104.
  • the suspended task saves the task resources 404 associated with the running task, other than the CPU.
  • the resources 404 like open handles, file handles and the like may be saved in their current state when a task is suspended.
  • a memory associated with the suspended task 406 may be saved in its current state when the task is suspended.
  • the memory associated with the suspended task 406, and other resources 404 like open handles, file handles and the like may resume from the status at which they were saved during suspension.
  • the node 104 may receive the second instruction to resume the suspended task.
  • the second instruction may comprise a wakeup event 408, where the wakeup event 408 is to resume the suspended task.
  • the task execution module 302 of the node 104 continues to execute the suspended task.
  • the low priority long running job may be suspended again to execute the high priority task.
  • the low priority task may be suspended and resumed multiple times without having to restart every time.
  • overall responsiveness of the distributed data processing system 100 for the low priority task is improved.
  • Fig. 5 illustrates a schematic representation of a waiting suspended task with partial spill of memory resources, in accordance with an example of the present subject matter.
  • memory resources are additionally released to the high priority task for its execution.
  • the resource management module 304 monitors the resource requirement of the first task being executed and determines the resource availability for executing the first task.
  • the resource management module 304 may allocate resources to the first task for execution. In one example, the resource management module 304 first releases only the CPU associated with the suspended task as discussed above. The resource management module 304 may then monitor the resource requirement of the high priority task being executed and if the resource requirement increases, the resource management module 304 may trigger the memory management module 306 to partially release the memory associated with the suspended task 406 by partially spilling 502 the memory associated with the suspended task to a disk 504. In one example, the partial spilling 502 of the memory associated with the suspended task comprises spilling of a memory object. A memory object of the suspended task may be stored by the memory management module 306 in the form of data blocks, so that the memory management module 306 may identify the data blocks allocated to the suspended task. In one example, the disk 504 may be the hard-disk of the slave system.
  • the resource management module 304 may trigger the memory management module 306 to spill a particular amount of data, for example, 1 GB of data, onto the disk, and then the resource management module 304 may continue to monitor the resource requirement of the first task being executed. When it is observed that the first task being executed requires more resources, in this case more than 1 GB of memory, associated with the suspended task, the resource management module 304 may further trigger the memory management module 306 to spill additional 1GB of data into the disk. This sequential releasing of resources may be performed till the memory management module 306 sends a message to the resource management module 304, that the memory is full and further data cannot be spilled into the disk 504. The resource management module 304, in turn, informs the master device 102 of the same.
  • a particular amount of data for example, 1 GB of data
  • the master device 102 may check if a data checkpointing system can be implemented in the system. In one example, when the master device 102 determines that the data check-pointing system cannot be implemented, the master device 102 sends an instruction to the selected node 104 to pre-empt the task. In another example, when the master device 102 determines that the data checkpointing method can be implemented in the system, the data-checkpointing method is performed.
  • Fig. 6 illustrates a data check-pointing method, in accordance with an example of the present subject matter.
  • data flow may be represented in a pipeline 604.
  • the data flow in the pipeline 602 and 604 in one example, maybe a streaming data for real-time analysis or, in another example may be a non-streaming block of data applied for non-streaming processing systems to checkpoint the processed state.
  • a continuous churning of data takes place and hence, checkpointing of data at every stage is performed.
  • the stages and checkpoints may be predefined as part of the data processing steps and the results of the data processing may be stored at each of the checkpoints. In case of any interruption, any further processing can be resumed from the last completed checkpoint instead of having to start from the beginning of the process.
  • a pipeline 604 of the system may be processing the counting of a number of cards.
  • the cards deck may comprise a blue color card and a green color card. Five blue cards and ten green cards may be counted in the same sequence till the total number of cards is 100.
  • a checkpoint 606, may be to store a state of a process.
  • the checkpoint 606a may be set to every 100 cards that may have been processed.
  • the checkpoint 606b stores that up till this particular point 100 cards have been read, and contains an x value of blue cards and a y value of green cards, where x and y may be integer values that take a value from 1-n.
  • the checkpoint 606b stores a count value as 100.
  • the next checkpoint 606c stores the count value of 200 and so on. So, at every checkpoint 606, the state of the process is stored.
  • the checkpoint 606 may store the process status in the same slave device.
  • the checkpoint 606 may store the process state in an external storage which may be shared by a multiple number of slave devices and the master device 102.
  • the process state may be the number of blue cards and the number of green cards counted in total.
  • the latest checkpoint may store the process state executed until that check point.
  • the process state stored is a count value of 100 cards.
  • Suspending a task by the data check pointing method may be equivalent to pre-empting a task, but the latest process state just before suspending the task may be saved and resuming the task can be done from that check point.
  • resuming the suspended task can be done by re-executing the suspended task from the previous checkpoint 606b.
  • the resume checkpoint 606f will take the value of the checkpoint 606b. So, the checkpoint 606b will be re-executed, and the process state value or the count value in this example will be 100 cards. And hence the task will be resumed with the count value equal to 100.
  • the resuming of the suspended task by the data check pointing method may be executed on the same node that was executing the task earlier. In another example, the resuming of the suspended task by the data check pointing method may be executed on a different system connected to the master device 102 on the same distributed network. In one example the master device 102 may determine the node 104 on which the suspended task may be resumed by the data checkpointing method. The resuming of the suspended task by the data check pointing method may be performed by sharing the checkpoint information to the node 104 determined to resume the task.
  • the data checkpointing method can be implemented in systems with systematic data processing, but in systems that process data by shuffling, iterative computations, and the like, the data checkpointing system may be complex to be implemented. However, sequential suspension of resources by suspending only the CPU at the first instance, or incrementally releasing the memory associated to the suspended task by spilling a few objects of the memory into the hard disk may still be performed in such systems.
  • Fig. 7 illustrates a task scheduling method in a distributed data processing system, in accordance with an example of the present subject matter.
  • the master device 102 selects a node 104 from a plurality of nodes 104-1, 104-2.... 104-n over the network 110 to schedule a task.
  • Node 1 is selected by the master device 102 to execute the first task.
  • the master device has scheduled two high priority jobs on Node 1.
  • the two high priority jobs are represented as Job 3 and Job 2 which comprise the high priority tasks, Task 1 of Job 3 and Task 1 of Job 2.
  • a low priority job represented as Job 1 comprising a Task 1 was being executed by Node 1 represented by low priority Task 1 of Job 1.
  • a first instruction is sent to the Node 1 to suspend the low priority Task 1 of Job 1 and execute the high priority Task 1 of Job 3.
  • the Node 1 suspends the low priority running task and schedules the high priority Task 1 of Job 3 to be executed.
  • the resource management module 304 monitors the resource requirement of the high priority task and the resource availability for the first task to be executed as explained above and implements the sequentially suspending resource approach. Resources from the suspended task are released sequentially to provide resources to the high priority task, based on the resources required by the high priority task being executed. The sequential suspending of resources is also explained in detail with Fig. 9.
  • the master device 102 sends a second instruction to the Node 1, to resume the execution of the suspended low priority task 1 of Job 1.
  • the master device 102 may instruct the Node 1 to execute the high priority task 1 of Job 2 after completely executing the high priority Task 1 of Job 3 before resuming the low priority Task 1 of Job 1.
  • Fig. 8 and 9 illustrate task scheduling methods in a distributed data processing system as implemented by a master device and a node respectively, in accordance with an example of the present subject matter.
  • the order in which a method 800 and 900 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement method 800 and 900 or an alternative method. Additionally, individual blocks may be deleted from the method 800 and 900 without departing from the spirit and scope of the subject matter described herein.
  • the method 800 and 900 may be implemented in any suitable hardware, computer readable instructions, firmware, or combination thereof. For discussion, the method 800 and 900 is described with reference to the implementations illustrated in Fig (s) . 1-7.
  • steps of the method 800 and 900 can be performed by programmed computing devices.
  • the method 800 and 900 may be implemented in any suitable hardware, computer readable instructions, firmware, or combination thereof.
  • some examples are also intended to cover program storage devices and non-transitory computer readable medium, for example, digital data storage media, which are computer readable and encode computer-executable instructions, where said instructions perform some or all of the steps of the described methods.
  • the program storage devices may be, for example, digital memories, magnetic storage media, such as magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media.
  • the method 800 comprises determining, by a master device, a node from a plurality of nodes for execution of a first task.
  • the method 800 comprises sending, by the master device, a first task to be executed, a priority indication of the first task, and a first instruction, wherein the first instruction comprises suspending the running task and saving task resources associated with the running task and sequentially release resources from the suspended running task based on resource requirement of the first task.
  • the priority indication of the first task may be indicated as a high priority.
  • the master device of method 800 may be implemented by the master device 102 of the distributed data processing system.
  • the master device 102 as discussed above may include a distributed task scheduler which schedules tasks to various nodes based on the priority of the tasks to be executed.
  • the node may be the node 104.
  • the first instruction that may be sent to the node 104 comprises suspending a task currently being executed by the selected node 104 and waiting for the completion of the first task.
  • the master device 102 receives an acknowledgement regarding a completion of the first task by the node 104, the master device 102 sends a second instruction.
  • the second instruction comprises resuming the suspended task by sending a wakeup event, where the wakeup event is to resume the suspended task.
  • the method 900 comprises receiving, by a node, a first task to be executed, a priority indication of the first task, and a first instruction to suspend a running task.
  • the priority of the first task is higher than the priority of the running task.
  • the first instruction to the selected node104 may be sent by the master device 102.
  • the method 900 comprises suspending, by the node, the running task and saving task resources associated with the running task .
  • all the resources associated with the suspended task may be saved at their current process state.
  • information regarding the suspension of the running task may be sent to the master device 102 as an acknowledgement.
  • the resources other than the CPU comprises a memory associated with the suspended task, open handles such a file handles, external links with the operating system or open sockets and the like.
  • the process state of these open handles may be saved when the task is suspended.
  • the process state of 5 files being open at the time of suspending the task is saved.
  • the information regarding the process state that is saved when a task is suspended may be utilized when the suspended task is resumed, so that when the suspended task is resumed, the task can continue to be executed from the process state at which it was suspended without re-executing the task from the beginning.
  • the method 900 comprises executing, by the node, the first task and sequentially releasing resources from the suspended running task based on a resource requirement of the first task.
  • the process of sequentially releasing resources from the suspended task comprises releasing only a CPU at a first instance, without releasing a resource other than the CPU associated with the suspended task.
  • the resources other than the CPU comprises a memory associated with the suspended task, open handles such a file handles, external links with the operating system or open sockets and the like.
  • the resource management module 304 may be configured to releasing only the CPU at a first instance. Further, the resource requirement of the first task being executed is monitored by the resource management module 304. The resource availability for executing the first task is then determined by the resource management module 304 and resources to the first task are allocated for its execution.
  • sequentially releasing resources from the suspended task further comprises determining whether there is an increase in the resource requirement of the first task after releasing only the CPU, and suspending the memory associated with the suspended task, by incrementally releasing the memory associated with the suspended task at a second instance based on the resource requirement determined.
  • the resource management module 304 allocates resources to the first task by triggering a memory management module 306 to partially spill the memory associated with the suspended task to a disk. In one example, the memory management module 306 spills the memory related to the data stored by the suspended task.
  • the master device 102 in one example may decide to pre-empt the suspended task.
  • the resource management module 304 may then allocate all the resources of the suspended task to the first task to complete its execution.
  • the master device determines whether there is an increase in the resource requirement of the first task after releasing only the CPU and incrementally releasing the memory associated with the suspended task.
  • the master device determines whether a data check-pointing method can be implemented.
  • suspending the task by a data check pointing method may be implemented as discussed with respect to Fig. 6.
  • the suspended task may be pre-empted, and all the resources associated with the suspended task may be assigned to the first task to complete its execution.
  • the method comprises sending, by the node, an acknowledgement regarding a completion of the first task to a master device.
  • the node 104 may receive a second instruction to resume the suspended task by the master device.
  • the second instruction to resume the suspended task by the master device may be based on a time taken for the complete execution of the first task.
  • the node 104 then resumes the suspended task, from the state at which the running task was suspended.
  • the memory management module 306 provides the suspended memory associated with the suspended task from a state at which the running task was suspended. In one example, the memory management module 306 provides the data that was partially spilled at the second instance back to the task once it resumes its execution.
  • the present subject matter thus provides a suspend and resume task scheduling method for distributed data processing systems with a sequential suspension of resources which provides for better resource utilization and increases the response time of the data processing system analyzing big data.
  • the present subject matter also minimizes the overall cost incurred in restart a pre-empted task by providing the sequential suspension of resources technique.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Retry When Errors Occur (AREA)
PCT/CN2021/075572 2020-04-24 2021-02-05 Task scheduling for distributed data processing WO2021212967A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202180025709.5A CN115362434A (zh) 2020-04-24 2021-02-05 分布式数据处理的任务调度

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202041017714 2020-04-24
IN202041017714 2020-04-24

Publications (1)

Publication Number Publication Date
WO2021212967A1 true WO2021212967A1 (en) 2021-10-28

Family

ID=78270254

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/075572 WO2021212967A1 (en) 2020-04-24 2021-02-05 Task scheduling for distributed data processing

Country Status (2)

Country Link
CN (1) CN115362434A (zh)
WO (1) WO2021212967A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080254A (zh) * 2022-08-24 2022-09-20 北京向量栈科技有限公司 一种用于调整计算集群中计算任务资源的方法和系统
CN117149440A (zh) * 2023-10-26 2023-12-01 北京趋动智能科技有限公司 一种任务调度方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7721291B2 (en) * 2004-10-15 2010-05-18 International Business Machines Corporation Apparatus, system, and method for automatically minimizing real-time task latency and maximizing non-real time task throughput
CN103699445A (zh) * 2013-12-19 2014-04-02 北京奇艺世纪科技有限公司 一种任务调度方法、装置及系统
CN104951372A (zh) * 2015-06-16 2015-09-30 北京工业大学 一种基于预测的Map/Reduce数据处理平台内存资源动态分配方法
CN107168777A (zh) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 分布式系统中资源的调度方法以及装置
CN109086135A (zh) * 2018-07-26 2018-12-25 北京百度网讯科技有限公司 资源伸缩方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7721291B2 (en) * 2004-10-15 2010-05-18 International Business Machines Corporation Apparatus, system, and method for automatically minimizing real-time task latency and maximizing non-real time task throughput
CN103699445A (zh) * 2013-12-19 2014-04-02 北京奇艺世纪科技有限公司 一种任务调度方法、装置及系统
CN104951372A (zh) * 2015-06-16 2015-09-30 北京工业大学 一种基于预测的Map/Reduce数据处理平台内存资源动态分配方法
CN107168777A (zh) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 分布式系统中资源的调度方法以及装置
CN109086135A (zh) * 2018-07-26 2018-12-25 北京百度网讯科技有限公司 资源伸缩方法、装置、计算机设备及存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080254A (zh) * 2022-08-24 2022-09-20 北京向量栈科技有限公司 一种用于调整计算集群中计算任务资源的方法和系统
CN115080254B (zh) * 2022-08-24 2023-09-22 北京向量栈科技有限公司 一种用于调整计算集群中计算任务资源的方法和系统
CN117149440A (zh) * 2023-10-26 2023-12-01 北京趋动智能科技有限公司 一种任务调度方法、装置、电子设备及存储介质
CN117149440B (zh) * 2023-10-26 2024-03-01 北京趋动智能科技有限公司 一种任务调度方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN115362434A (zh) 2022-11-18

Similar Documents

Publication Publication Date Title
US8914805B2 (en) Rescheduling workload in a hybrid computing environment
US11941434B2 (en) Task processing method, processing apparatus, and computer system
US6411982B2 (en) Thread based governor for time scheduled process execution
CA2785398C (en) Managing queries
US7996593B2 (en) Interrupt handling using simultaneous multi-threading
US9507631B2 (en) Migrating a running, preempted workload in a grid computing system
US8739171B2 (en) High-throughput-computing in a hybrid computing environment
US7441240B2 (en) Process scheduling apparatus, process scheduling method, program for process scheduling, and storage medium recording a program for process scheduling
CA3168286A1 (en) Data flow processing method and system
US8689226B2 (en) Assigning resources to processing stages of a processing subsystem
CN109564528B (zh) 分布式计算中计算资源分配的系统和方法
US20130061220A1 (en) Method for on-demand inter-cloud load provisioning for transient bursts of computing needs
WO2021212967A1 (en) Task scheduling for distributed data processing
CN107515781B (zh) 一种基于多处理器的确定性任务调度及负载均衡系统
JPWO2009150815A1 (ja) マルチプロセッサシステム
Liu et al. Optimizing shuffle in wide-area data analytics
US10523746B2 (en) Coexistence of a synchronous architecture and an asynchronous architecture in a server
CN114237891A (zh) 资源调度方法、装置、电子设备及存储介质
CN116048756A (zh) 一种队列调度方法、装置及相关设备
JP2013152513A (ja) タスク管理システム、タスク管理サーバ、タスク管理方法、及びタスク管理プログラム
US11403138B2 (en) Method and electronic device for handling relative priority based scheduling procedure
CN112328359A (zh) 避免容器集群启动拥塞的调度方法和容器集群管理平台
CN115080199A (zh) 任务调度方法、系统、设备、存储介质及程序产品
CN117311957A (zh) 一种资源调度方法、装置及系统
CN115794449A (zh) 动态线程池构建方法、远程过程调用方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21792393

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21792393

Country of ref document: EP

Kind code of ref document: A1