WO2009113034A1 - Look-ahead task management - Google Patents

Look-ahead task management Download PDF

Info

Publication number
WO2009113034A1
WO2009113034A1 PCT/IB2009/051035 IB2009051035W WO2009113034A1 WO 2009113034 A1 WO2009113034 A1 WO 2009113034A1 IB 2009051035 W IB2009051035 W IB 2009051035W WO 2009113034 A1 WO2009113034 A1 WO 2009113034A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
ready
tasks
pointer
finished
Prior art date
Application number
PCT/IB2009/051035
Other languages
French (fr)
Inventor
Andrei Sergeevich Terechko
Ghiath Al-Kadi
Marc Andre Georges Duranton
Magnus SJÄLANDER
Original Assignee
Nxp B.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nxp B.V. filed Critical Nxp B.V.
Priority to US12/921,573 priority Critical patent/US20110004881A1/en
Publication of WO2009113034A1 publication Critical patent/WO2009113034A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/484Precedence

Definitions

  • the present application relates to a method comprising receiving tasks for execution on at least one processor, and processing at least one of the tasks within one processor.
  • the application further relates to a task management unit comprising input means for receiving tasks for execution on at least one processor, a microprocessor comprising a storage for storing task information, a system with a task management unit and a microprocessor, as well as a computer program comprising instructions operable to cause a task management unit to receive tasks for execution on at least one processor.
  • the current trend in computer architecture is to use more and more microprocessors, a.k.a. cores, within one chip for processing tasks in parallel to increase application performance.
  • a.k.a. cores within one chip for processing tasks in parallel to increase application performance.
  • the application performance is increased.
  • the code for managing task scheduling might become a bottle neck for a huge number of small tasks.
  • the code for managing tasks is generally simple, consisting of arithmetic operations such as addition, subtraction, comparing, branching, and atomic loads and stores.
  • the parallel processing requires checking dependencies of tasks, e.g., whether one task can be started or not depending on other tasks that might be necessary to be executed beforehand. Therefore, dependencies of tasks need to be updated for each finished task, such that other tasks can become ready to be executed. If the dependency check is executed after a task has finished and the dependencies has been updated, the current dependency state is known. This allows for verifying, which tasks can be executed. However, the dependency check can introduce delays, since the check is performed before the next task can be executed.
  • architectures with task queues are known.
  • the execution of a task is followed by a piece of code for updating dependencies and checking for a task ready status or not.
  • Fig. 1 illustrates a commonly known dependency check with twelve different tasks 2, 4, 6, 8.
  • tasks 2a-2c are executed.
  • tasks 4a-4c are executed.
  • tasks 6a-6c are executed.
  • the tasks 8a-8c are executed.
  • twelve different tasks 2, 4, 6, 8 are executed on four separate cores 10-16.
  • a task dependency check 18 is executed.
  • Fig. 1 for reason of simplicity, it is assumed that each task is identical in execution time. As can be seen, the dependency check operation 18 consumes time, within which the cores 10-16 are not operative, i.e. do not process a particular task. For example, for a video decoder under the H.264 standard, it has been found that the dependency check operation 18 increases the overall task execution time by 9% on average. This results in the embodiment according to Fig. 1 in a requirement of one complete core for managing the dependency check for every eleven other cores in the architecture.
  • a method comprising receiving tasks for execution on at least one processor, processing at least one of the tasks within one processor, parallel to processing the at least one task, verifying readiness of at least one next task assuming the currently processed task is finished, preparing a ready structure for the at least one task verified as ready, and starting the at least one task verified as ready using the ready- structure after the currently processed task is finished.
  • verifying the readiness of at least one next task may comprise checking task dependencies between the at least one received task, and the currently processed task. This allows for checking, as a look ahead technique, whether at least one of the received tasks may be ready for execution, once the currently processed task is finished, in parallel with the actual execution of the task. If the at least one received task, which is not executed yet, only depends on the currently processed task, it can be marked as ready even during execution of the currently processed task. This look-ahead technique provides for reducing the start time of the received tasks after the currently processed task is finished.
  • a task queue at least one of the ready-structures of tasks and/or the task verified as ready.
  • several processors may verify the readiness of at least one next task. The results of this verification can be a plurality of tasks in the ready stage.
  • This at least one ready task can be stored in the task queues.
  • the task queues do provide information about tasks in the ready state which are currently not being executed by a processor. This way, tasks may be distributed between different cores. The distribution of task queues allows for storing information about ready tasks within a scalable architecture.
  • the ready-structure may comprise at least one of a function pointer and/or an argument list.
  • the function pointer may point to the first instruction of the task being verified as ready.
  • the argument list may comprise information about arguments for the task to be executed.
  • the argument list may be used for a data prefetching.
  • the arguments for the task to be executed next may already be fetched during the currently processed task is processed, allowing the next task to start immediately after the currently processed task is finished.
  • a partially-ready-structure for at least one task which is not verified as ready is provided.
  • the partially-ready- structure allows for providing information about task dependencies of tasks which are not ready in the next processing sequence.
  • the partially-ready-structure may comprise information about task dependencies being not met. Thus, if dependencies have not been satisfied, the dependencies may be stored in the partially-ready-structure. It may be possible that after the started regular task ends, the unsatisfied dependencies being stored in the partially-ready-structure are checked. This way dependencies already satisfied during the execution of the current tasks will not delay next task creation.
  • the verification of the partially-ready-structure may be possible with a reduced software overhead. According to embodiments, verifying readiness of at least one task within a partially-ready-structure after a currently processed task is finished is possible.
  • a processor may comprise, according to embodiments, a dedicated storage area may hold necessary information about candidate tasks, i.e. tasks with a partially-ready- structure. Each processor may directly access the information about the tasks to be executed.
  • the dedicated storage may also hold information about ready tasks, i.e. with a ready- structure. It may also be possible.
  • the task information may comprise at least one of a task pointer, a look-ahead pointer, a dependency pointer, an argument pointer, or a flag.
  • the task pointer may hold information about the instruction address of the first instruction of the task.
  • the argument pointer may hold the address to where arguments for the tasks are stored.
  • the look-ahead pointer may comprise information about a look-ahead function to be executed if the task will be executed by the core. This function may allow for calculating and determining, which dependencies are resolved, when the currently processed task is executed.
  • a dependency pointer may hold the address to a memory location that stores the number of dependencies that still have to be resolved before the task can be executed.
  • a flag may be used for synchronizing the processor with a task management unit.
  • the information about the task stored in the processor allows for speeding up the turn around time between tasks being executed.
  • the flag may allow for calculating and determining, which dependencies are resolved, when the currently processed task is executed.
  • the flag may be one bit used for synchronizing between the task management unit and the processor.
  • the flag may also comprise several bits, indicating, for example, the state of a task, the time of processing, i.e. while it is executed. If a task is ready for execution, then the task pointer and argument pointer will be read and the processor can start the execution of the new task.
  • the task management unit can then, in parallel with the execution of the task, decrement the value given by the dependency pointer for all the tasks not being executed.
  • the processor finishes with a currently processed task it can wait until task dependencies are updated and a task becomes ready for execution.
  • the speed-up of verifying a ready status may be achieved in that only the dependencies of candidate tasks not found ready for execution by the look-ahead function need to be updated.
  • the look-ahead function may check, which tasks may be necessary in the future. If these tasks are dependent on the currently processed task, their dependency can be updated. If tasks are ready , no update is necessary. Therefore, the look-ahead function reduces the number of dependency checks.
  • dependency information for tasks from the current task may be obtained from the task information.
  • a task management unit comprising input means for receiving tasks for execution on at least one processors, verifying means arranged for verifying readiness of at least one next task, assuming the currently processed task is finished, parallel to processing the at least one task, preparation means arranged for preparing a ready-structure for the at least one task verified as ready, and output means for putting out the ready structure after the currently processed task is finished for starting the at least one task verified as ready.
  • a microprocessor comprising a storage for storing task information, where the storage comprises a memory area for storing a task pointer, a storage area for storing an argument pointer, and a storage area for storing a dependency pointer.
  • access means may be provided for providing access to the storage for storing task information using a task management unit of as previously described.
  • Another aspect is a system with a task management unit and a microprocessor as previously described.
  • a further aspect is a computer program comprising instructions operable to cause the task management unit to receive tasks for execution on at least one processors, provide the task for processing to at least one processor, parallel to processing the at least one task verify readiness of at least one next task assuming the currently processed task is finished, prepare a ready-structure for the at least one task verified as ready, and starting the at least one task verified as ready, using the ready structure after the currently processed task is finished within the processor.
  • Fig. 1 illustrates task execution for a conventional architecture
  • Fig. 2 an illustration of dependencies between macro-blocks within a video compression standard
  • FIG. 3 an illustration of a task dependency graph
  • Fig. 4 an illustration of execution of tasks according to embodiments
  • Fig. 5a a ready structure for a task
  • Fig. 5b a partially-ready structure for a task
  • Fig. 6 an illustration of an architecture with several processors and several task management units
  • Fig. 7 an illustration of task information
  • Fig. 8 a schematic illustration of a task management unit.
  • H.264 video decoding is exemplarily described herein after.
  • H.264 video decoding in super HD requires a multi-core architecture, to reach the performance necessary for decoding 30 to frames per second.
  • each frame being decoded is first entropy decoded, consisting of either context-adaptive binary arithmetic coding or a context-adaptive variable length coding, which both are sequential by their natures.
  • a frame is then passed on to a picture prediction stage, where each frame is divided into macro blocks, for example 16 times 16 pixels. For each macro block, inter- picture prediction and motion vector estimation is calculated.
  • the frame is then filtered through a deblocking filter to reduce artifacts from the picture prediction stage at block boundaries.
  • the resulting frame has then been decoded and can be passed onto the display.
  • the picture prediction and deblocking filter is suitable for parallelization, where the execution of the macro-block can be treated as a task. Such execution is illustrated in Fig. 2.
  • FIG. 2 As can be seen, there are several macro blocks 42 at boundaries to a macro block 44. In order to process picture prediction and deblocking of macro block 44, it is necessary that macro blocks 42 are executed before macro block 44 is filtered. By that, macro-block 44 cannot be executed before macro-blocks 42 have been executed. This introduces task dependencies, as the tasks for filtering macro block 44 require the prior execution of filtering of macro blocks 42.
  • Such a task dependency can be illustrated in a graph, for example as illustrated in Fig. 3.
  • the graph of Fig. 3 illustrates several tasks 0/0-4/4, which can be dependent on certain other tasks.
  • a first task 0/0 is independent.
  • the second task 1/0 can only start, when the first task 0/0 has been executed.
  • Each of the new tasks can potentially start the execution of one or two other tasks, for example, after task 1/0, both tasks 2/0 and 0/1 can start.
  • These task dependencies as illustrated in a graph of Fig. 3, can be tracked by storing the number of tasks that each task depends on. For each finished task, this value of task dependencies can be updated. The task can execute, once its value of dependencies becomes zero.
  • a look-ahead task management unit capable of execution of task-dependency checks in parallel with the execution of the tasks.
  • Each task management unit can offload dependency checks and dependency updates from a number of conventional processors and can try to schedule dependent tasks onto these processors.
  • the distribution of tasks between various task management units can be done through a task queue.
  • Such a parallel task dependency check is illustrated in Fig. 4.
  • Fig. 4 there are illustrated tasks 2, 4, 6, 8, a readiness verifying stage 20, and a task dependency update 46.
  • the twelve tasks 2a-2c, 4a-4c, 6a-6c, 8a-8c are being executed on four different cores 10- 16.
  • a look-ahead code is being executed for verifying, whether these tasks provide for readiness of a consecutive task.
  • a candidate task 2b was found with its dependencies fulfilled.
  • This second task 2b can be started immediately, once processor 10 finishes the current execution of task 2a.
  • Task dependency update 46 updates dependencies of tasks, and after a task dependency update was executed, the tasks 4b, 6b, 8b could be executed. However, the task dependency update 46 is much faster than the verifying stage 20, thus allowing tasks 4b, 6b, 8b to be executed a lot closer in time to the fmalization of a previous task.
  • the second verifying stage 20 determines that task 4c is ready right after task 4b has been finished. Thus, on the second processor 12, task 4c is started immediately after task 4b is finalized.
  • Task ready structures 24 are created.
  • Task ready structures 24 may comprise a function pointer 24a and an argument list 24b.
  • the function pointer and the argument list can be read, and the processor can execute the new task immediately.
  • the task ready structure 24 may, though not illustrated, comprise also a look-ahead function pointer. Also, an argument pointer may also be comprised.
  • a partially-ready-structure 28 as illustrated in Fig. 5b can be created.
  • the partially ready structure 28 may comprise a task pointer 28a, as well as information 28b about task dependencies being not met. These information 28b can be updated in step 46, as illustrated in Fig. 4, upon which a partial-ready-structure may indicate a task being executable.
  • the verification step 20 and the update step 46 can be processed within a task management unit, as illustrated in Fig. 6.
  • the purpose of the task management unit 32 may be to offload the management of tasks from processors 10, 12, 14, 16 in a multi-core- architecture as illustrated in Fig. 6. While the tasks are being executed on the process source 10-16, the task management units 32 try to find tasks that are ready to be executed and have them prepared, so that a processor 10-16 can directly start executing a new task when it finishes their current task execution. For each task being executed, the task management unit 32 executes a function that looks ahead in time, in order to try to find tasks that will be ready for execution. When doing so, the task management units 32 assume the currently processed tasks on processors 10-16 being finished. As is illustrated in Fig.
  • a scalable architecture that connects several task management units 32 with a defined number of processors 10-16 allows for processing more look-ahead functions than with a single task management unit 32.
  • Each task management unit 32 offloads the look-ahead control from the processors.
  • tasks that are found to be ready can be stored. This way, the task management units 32 may obtain information about tasks being ready within a task-ready structure 24 from task queue 26. This information allows for the processors 10-16 to execute tasks being found as ready using the task-ready structure.
  • each processor 10-16 may have a dedicated task information 30 list as illustrated in Fig. 7 storing candidate tasks and the information for executing these tasks.
  • This information can be a task pointer 3Od, an argument pointer 30e, a look-ahead pointer 30b, a dependency pointer 30c, and a flag 30a. If there is a task ready for execution, the task pointer 30d and the argument pointer 30e can be read by the processor and execution can start. The task management unit 32 can then, in parallel with the execution of the task, decrement the value given by the dependency pointer for all the tasks not being executed. Only the dependencies of candidate tasks not found ready for execution by the look-ahead function of the task management unit 32 need to be updated, thus reducing the number atomic accesses for updating the information 30.
  • the task management unit 32 may check the state of the task queue, the flag 30a of the information 30 for each core 10-16, and for incoming tasks and messages. If there is an idle processor 10-16 and a task being found ready in the task queue 26, the task can be fetched from the task queue 26, information 30 with a processor 10-16 can be updated, telling the processor 10-16 that the task is ready for execution.
  • a routine may first check for tasks that are ready for execution with an information 30. If these tasks are not executed by the processor 10-16 itself, these tasks can be stored in the task queue 26 for execution at a later time. Then, dependency values for tasks not ready to be executed can be decremented. Eventually, a look-ahead pointer 30b and an argument pointer 30c can be read from the task currently being executed by the core and the look-ahead function can be executed by the task management unit 32.
  • a task management unit 32 may comprise, as illustrated in Fig. 8, input means 34 for receiving tasks for execution on at least one processors. Further, there may be provided verifying means 36 for verifying readiness of at least one next task, assuming the currently process task is finished parallel to processing the at least one task. The verifying means 36 may have access onto information 30 and may read the flags 30a and may update the dependency pointers 30c.
  • preparation means 38 for preparing the task ready structure as illustrated in Fig. 5a.
  • output means 40 for putting out the ready- structure either to the task queue 26 or to the processors 10-16 into information 30.
  • the cores may offload dependency checks to a task management unit. This enhances, for example video processing.

Abstract

A method comprising receiving tasks for execution on at least one processor, and processing at least one task within one processor. To decrease the turn-around time of task processing, a method comprises parallel to processing the at least one task, verifying readiness of at least one next task assuming the currently processed task is finished, preparing a readystructure for the at least one task verified as ready, and starting the at least one task verified as ready using the ready-structure after the currently processed task is finished.

Description

Look-ahead task management
FIELD OF THE INVENTION
The present application relates to a method comprising receiving tasks for execution on at least one processor, and processing at least one of the tasks within one processor. The application further relates to a task management unit comprising input means for receiving tasks for execution on at least one processor, a microprocessor comprising a storage for storing task information, a system with a task management unit and a microprocessor, as well as a computer program comprising instructions operable to cause a task management unit to receive tasks for execution on at least one processor.
BACKGROUND OF THE INVENTION
The current trend in computer architecture is to use more and more microprocessors, a.k.a. cores, within one chip for processing tasks in parallel to increase application performance. In particular in embedded domain systems, where multi-core solutions are common, the application performance is increased. In order to utilize the increased processing power of multi-core solutions, it is necessary to partition the programs into tasks that can be run in parallel on separate cores.
It is apparent that the more tasks are processed in parallel, the more the overall performance is accelerated. As the numbers of cores increases in multi-core solutions, it becomes necessary to partition applications into more and more smaller tasks, in order to keep all the cores busy and to accelerate application performance. The creation and distribution of tasks, a.k.a. task scheduling, has commonly been handled by software. However, as tasks become smaller and increase in number, a task schedule being performed by software introduces overheads in view of data transfer and processing of the scheduling. This will decrease the efficiency of parallel task processing.
In particular the code for managing task scheduling might become a bottle neck for a huge number of small tasks. The code for managing tasks is generally simple, consisting of arithmetic operations such as addition, subtraction, comparing, branching, and atomic loads and stores. The parallel processing requires checking dependencies of tasks, e.g., whether one task can be started or not depending on other tasks that might be necessary to be executed beforehand. Therefore, dependencies of tasks need to be updated for each finished task, such that other tasks can become ready to be executed. If the dependency check is executed after a task has finished and the dependencies has been updated, the current dependency state is known. This allows for verifying, which tasks can be executed. However, the dependency check can introduce delays, since the check is performed before the next task can be executed.
In particular for a plurality of tasks, architectures with task queues are known. In this type of architectures, the execution of a task is followed by a piece of code for updating dependencies and checking for a task ready status or not.
Fig. 1 illustrates a commonly known dependency check with twelve different tasks 2, 4, 6, 8. On a first core 10, tasks 2a-2c are executed. On a second core 12, tasks 4a-4c are executed. On a third core 14, the tasks 6a-6c are executed. And on a forth core 16, the tasks 8a-8c are executed. Thus, twelve different tasks 2, 4, 6, 8 are executed on four separate cores 10-16. After completion of each task 2-8, a task dependency check 18 is executed.
In Fig. 1, for reason of simplicity, it is assumed that each task is identical in execution time. As can be seen, the dependency check operation 18 consumes time, within which the cores 10-16 are not operative, i.e. do not process a particular task. For example, for a video decoder under the H.264 standard, it has been found that the dependency check operation 18 increases the overall task execution time by 9% on average. This results in the embodiment according to Fig. 1 in a requirement of one complete core for managing the dependency check for every eleven other cores in the architecture.
For the reasons set forth above, it is an object of the present application to increase performance of processing of applications that have task dependencies, i.e. in multi- core architectures. It is another object to increase image and video decoding speed by parallel task processing. A further object is to reduce die size by reducing dependency check overhead. Another object is to increase energy efficiency by reducing the number of required processors for parallel processing.
SUMMARY OF THE INVENTION
These and other objects are solved by a method comprising receiving tasks for execution on at least one processor, processing at least one of the tasks within one processor, parallel to processing the at least one task, verifying readiness of at least one next task assuming the currently processed task is finished, preparing a ready structure for the at least one task verified as ready, and starting the at least one task verified as ready using the ready- structure after the currently processed task is finished. By verifying the readiness of at least one next task assuming the currently processed task is finished parallel to processing at least one task, allows for immediate starting the execution of the next task upon finishing a currently processed task. While a task is being executed, it may be possible to find out what dependencies will be solved by the currently executed task by assuming that the currently executed task is finished. This allows for verifying, whether a next task is ready or not, prior to finishing the processing of the currently processed task. If there are tasks that only depend on the currently executed task, they will be ready for execution, once the currently executed task is finished. In order to provide for immediate starting the ready tasks, these could be prepared for execution by a task management unit, such that once the current processor (core) finishes the current execution, the next task can start. Dependencies can be updated in parallel with the execution of the task, thus decreasing task execution time.
During the execution of the task, it may be possible to find all tasks that depend on the currently executed task. All found tasks may then be marked as candidate tasks to be executed by the processor. According to embodiments, verifying the readiness of at least one next task may comprise checking task dependencies between the at least one received task, and the currently processed task. This allows for checking, as a look ahead technique, whether at least one of the received tasks may be ready for execution, once the currently processed task is finished, in parallel with the actual execution of the task. If the at least one received task, which is not executed yet, only depends on the currently processed task, it can be marked as ready even during execution of the currently processed task. This look-ahead technique provides for reducing the start time of the received tasks after the currently processed task is finished.
According to embodiments, it may be possible, to store within a task queue at least one of the ready-structures of tasks and/or the task verified as ready. For example, in architectures, which have more than one core, in particular in architectures that are scalable to more than a few cores, several processors may verify the readiness of at least one next task. The results of this verification can be a plurality of tasks in the ready stage. This at least one ready task can be stored in the task queues. The task queues do provide information about tasks in the ready state which are currently not being executed by a processor. This way, tasks may be distributed between different cores. The distribution of task queues allows for storing information about ready tasks within a scalable architecture.
According to embodiments, the ready-structure may comprise at least one of a function pointer and/or an argument list. The function pointer may point to the first instruction of the task being verified as ready. The argument list may comprise information about arguments for the task to be executed.
According to embodiments, the argument list may be used for a data prefetching. By performing data prefetching, the arguments for the task to be executed next may already be fetched during the currently processed task is processed, allowing the next task to start immediately after the currently processed task is finished.
It may also be possible that some tasks are not ready, even if the currently processed task is finished. This may be because of further dependencies, e.g. the task is dependent on other tasks than the currently processed task. In order to account for such tasks, a partially-ready-structure for at least one task which is not verified as ready is provided. The partially-ready- structure allows for providing information about task dependencies of tasks which are not ready in the next processing sequence.
According to embodiments, the partially-ready-structure may comprise information about task dependencies being not met. Thus, if dependencies have not been satisfied, the dependencies may be stored in the partially-ready-structure. It may be possible that after the started regular task ends, the unsatisfied dependencies being stored in the partially-ready-structure are checked. This way dependencies already satisfied during the execution of the current tasks will not delay next task creation. The verification of the partially-ready-structure may be possible with a reduced software overhead. According to embodiments, verifying readiness of at least one task within a partially-ready-structure after a currently processed task is finished is possible.
To keep track of candidate tasks and speed up the turn around time of executed tasks, a processor may comprise, according to embodiments, a dedicated storage area may hold necessary information about candidate tasks, i.e. tasks with a partially-ready- structure. Each processor may directly access the information about the tasks to be executed. The dedicated storage may also hold information about ready tasks, i.e. with a ready- structure. It may also be possible.
According to embodiments, the task information may comprise at least one of a task pointer, a look-ahead pointer, a dependency pointer, an argument pointer, or a flag. The task pointer may hold information about the instruction address of the first instruction of the task. The argument pointer may hold the address to where arguments for the tasks are stored. The look-ahead pointer may comprise information about a look-ahead function to be executed if the task will be executed by the core. This function may allow for calculating and determining, which dependencies are resolved, when the currently processed task is executed. A dependency pointer may hold the address to a memory location that stores the number of dependencies that still have to be resolved before the task can be executed. A flag may be used for synchronizing the processor with a task management unit. The information about the task stored in the processor allows for speeding up the turn around time between tasks being executed. The flag may allow for calculating and determining, which dependencies are resolved, when the currently processed task is executed. The flag may be one bit used for synchronizing between the task management unit and the processor. The flag may also comprise several bits, indicating, for example, the state of a task, the time of processing, i.e. while it is executed. If a task is ready for execution, then the task pointer and argument pointer will be read and the processor can start the execution of the new task. The task management unit can then, in parallel with the execution of the task, decrement the value given by the dependency pointer for all the tasks not being executed. In case there is no ready task, when the processor finishes with a currently processed task, it can wait until task dependencies are updated and a task becomes ready for execution. The speed-up of verifying a ready status may be achieved in that only the dependencies of candidate tasks not found ready for execution by the look-ahead function need to be updated. The look-ahead function may check, which tasks may be necessary in the future. If these tasks are dependent on the currently processed task, their dependency can be updated. If tasks are ready , no update is necessary. Therefore, the look-ahead function reduces the number of dependency checks. According to embodiments, dependency information for tasks from the current task may be obtained from the task information.
Another aspect is a task management unit comprising input means for receiving tasks for execution on at least one processors, verifying means arranged for verifying readiness of at least one next task, assuming the currently processed task is finished, parallel to processing the at least one task, preparation means arranged for preparing a ready-structure for the at least one task verified as ready, and output means for putting out the ready structure after the currently processed task is finished for starting the at least one task verified as ready. A further aspect is a microprocessor comprising a storage for storing task information, where the storage comprises a memory area for storing a task pointer, a storage area for storing an argument pointer, and a storage area for storing a dependency pointer. According to embodiments, access means may be provided for providing access to the storage for storing task information using a task management unit of as previously described.
Another aspect is a system with a task management unit and a microprocessor as previously described.
A further aspect is a computer program comprising instructions operable to cause the task management unit to receive tasks for execution on at least one processors, provide the task for processing to at least one processor, parallel to processing the at least one task verify readiness of at least one next task assuming the currently processed task is finished, prepare a ready-structure for the at least one task verified as ready, and starting the at least one task verified as ready, using the ready structure after the currently processed task is finished within the processor.
BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 illustrates task execution for a conventional architecture; Fig. 2 an illustration of dependencies between macro-blocks within a video compression standard;
Fig. 3 an illustration of a task dependency graph; Fig. 4 an illustration of execution of tasks according to embodiments; Fig. 5a a ready structure for a task; Fig. 5b a partially-ready structure for a task;
Fig. 6 an illustration of an architecture with several processors and several task management units;
Fig. 7 an illustration of task information; Fig. 8 a schematic illustration of a task management unit.
DETAILED DESCRIPTION OF THE DRAWINGS
As has been mentioned above, in combination with description of Fig. 1, in multi-processor, a.k.a. multi-core, solutions, a plurality of tasks need to be processed in parallel, which might lead to processor contention and ineffective task processing. In particular in the multimedia domain, the partitioning of an application will commonly introduce the dependencies between tasks. The dependencies between tasks force the tasks to be executed in a certain order to meet these dependencies. For example, such dependencies can be found in a video decoder, for example a H.264 video decoder. In such a video decoder, a high amount of tasks needs to be processed, with a lot of dependencies, which poses a task management problem. Task dependencies need to be monitored and need to be checked when a task is ready to be executed. The algorithms for dependency checking are often not complex, but they can introduce large overhead. For example, in a super HD H.264 decoder, 9% of the execution time is consumed by checking task dependencies and task management.
When processing tasks in parallel, it needs to be distinguished between tasks that are dependent and tasks that are not dependent. For example, for parallel video decoding with macro-blocks and spatial-temporal motion prediction, parallel tasks introduce dependencies. This kind of applications differ from other parallel work loads, such as server work loads with multiple incoming requests, desktop work loads consisting of multiple programs, and scientific work loads, where the tasks are commonly independent of each other and can be executed randomly. However, for applications with inter-task dependencies, the execution order is crucial for correct application behavior. The execution order cannot always be totally statically determined at compile time, because of variations in computational load, task execution time and load balancing. Hence, a dynamic task management at run time is necessary, as is introduced by the present embodiments.
One example of task parallelism is video decoding, such as H.264 video decoding. Such a decoding will be exemplarily described herein after. H.264 video decoding in super HD requires a multi-core architecture, to reach the performance necessary for decoding 30 to frames per second. For video decoding, each frame being decoded is first entropy decoded, consisting of either context-adaptive binary arithmetic coding or a context-adaptive variable length coding, which both are sequential by their natures. A frame is then passed on to a picture prediction stage, where each frame is divided into macro blocks, for example 16 times 16 pixels. For each macro block, inter- picture prediction and motion vector estimation is calculated. The frame is then filtered through a deblocking filter to reduce artifacts from the picture prediction stage at block boundaries. The resulting frame has then been decoded and can be passed onto the display. The picture prediction and deblocking filter is suitable for parallelization, where the execution of the macro-block can be treated as a task. Such execution is illustrated in Fig. 2. As can be seen, there are several macro blocks 42 at boundaries to a macro block 44. In order to process picture prediction and deblocking of macro block 44, it is necessary that macro blocks 42 are executed before macro block 44 is filtered. By that, macro-block 44 cannot be executed before macro-blocks 42 have been executed. This introduces task dependencies, as the tasks for filtering macro block 44 require the prior execution of filtering of macro blocks 42.
Such a task dependency can be illustrated in a graph, for example as illustrated in Fig. 3. The graph of Fig. 3 illustrates several tasks 0/0-4/4, which can be dependent on certain other tasks. As can be seen in Fig. 3, a first task 0/0 is independent. However, the second task 1/0 can only start, when the first task 0/0 has been executed. Each of the new tasks can potentially start the execution of one or two other tasks, for example, after task 1/0, both tasks 2/0 and 0/1 can start. These task dependencies, as illustrated in a graph of Fig. 3, can be tracked by storing the number of tasks that each task depends on. For each finished task, this value of task dependencies can be updated. The task can execute, once its value of dependencies becomes zero.
In order to provide parallelism, there is provided a look-ahead task management unit, capable of execution of task-dependency checks in parallel with the execution of the tasks. Each task management unit can offload dependency checks and dependency updates from a number of conventional processors and can try to schedule dependent tasks onto these processors. The distribution of tasks between various task management units can be done through a task queue. By executing the task-dependency checks in parallel with the conventional processing of the tasks, a total execution time speed- up of 4,5% for a multi-processor architecture for video decoding can be achieved.
Such a parallel task dependency check is illustrated in Fig. 4. In Fig. 4, there are illustrated tasks 2, 4, 6, 8, a readiness verifying stage 20, and a task dependency update 46. The twelve tasks 2a-2c, 4a-4c, 6a-6c, 8a-8c are being executed on four different cores 10- 16. For each task 2, 4, 6, 8, within the verifying stage 20, in parallel to processing the tasks, a look-ahead code is being executed for verifying, whether these tasks provide for readiness of a consecutive task. In the illustrated example, in the verifying stage 20, for the first task 2a, executed on processor 10, a candidate task 2b was found with its dependencies fulfilled. This second task 2b can be started immediately, once processor 10 finishes the current execution of task 2a. Task dependency update 46 updates dependencies of tasks, and after a task dependency update was executed, the tasks 4b, 6b, 8b could be executed. However, the task dependency update 46 is much faster than the verifying stage 20, thus allowing tasks 4b, 6b, 8b to be executed a lot closer in time to the fmalization of a previous task.
Further, the second verifying stage 20 determines that task 4c is ready right after task 4b has been finished. Thus, on the second processor 12, task 4c is started immediately after task 4b is finalized.
In the verifying stage 20, task ready structures 24, as illustrated in Fig. 5a, are created. Task ready structures 24 may comprise a function pointer 24a and an argument list 24b. The function pointer and the argument list can be read, and the processor can execute the new task immediately. The task ready structure 24 may, though not illustrated, comprise also a look-ahead function pointer. Also, an argument pointer may also be comprised.
During the verifying stage 20, tasks may also be found as partially-ready. For these tasks, a partially-ready-structure 28, as illustrated in Fig. 5b can be created. The partially ready structure 28 may comprise a task pointer 28a, as well as information 28b about task dependencies being not met. These information 28b can be updated in step 46, as illustrated in Fig. 4, upon which a partial-ready-structure may indicate a task being executable.
The verification step 20 and the update step 46 can be processed within a task management unit, as illustrated in Fig. 6. The purpose of the task management unit 32 may be to offload the management of tasks from processors 10, 12, 14, 16 in a multi-core- architecture as illustrated in Fig. 6. While the tasks are being executed on the process source 10-16, the task management units 32 try to find tasks that are ready to be executed and have them prepared, so that a processor 10-16 can directly start executing a new task when it finishes their current task execution. For each task being executed, the task management unit 32 executes a function that looks ahead in time, in order to try to find tasks that will be ready for execution. When doing so, the task management units 32 assume the currently processed tasks on processors 10-16 being finished. As is illustrated in Fig. 6, a scalable architecture that connects several task management units 32 with a defined number of processors 10-16 allows for processing more look-ahead functions than with a single task management unit 32. Each task management unit 32 offloads the look-ahead control from the processors. Within a task queue 26, tasks that are found to be ready can be stored. This way, the task management units 32 may obtain information about tasks being ready within a task-ready structure 24 from task queue 26. This information allows for the processors 10-16 to execute tasks being found as ready using the task-ready structure. In order to decrease the turn around time between executed tasks, each processor 10-16 may have a dedicated task information 30 list as illustrated in Fig. 7 storing candidate tasks and the information for executing these tasks. This information can be a task pointer 3Od, an argument pointer 30e, a look-ahead pointer 30b, a dependency pointer 30c, and a flag 30a. If there is a task ready for execution, the task pointer 30d and the argument pointer 30e can be read by the processor and execution can start. The task management unit 32 can then, in parallel with the execution of the task, decrement the value given by the dependency pointer for all the tasks not being executed. Only the dependencies of candidate tasks not found ready for execution by the look-ahead function of the task management unit 32 need to be updated, thus reducing the number atomic accesses for updating the information 30. The task management unit 32 may check the state of the task queue, the flag 30a of the information 30 for each core 10-16, and for incoming tasks and messages. If there is an idle processor 10-16 and a task being found ready in the task queue 26, the task can be fetched from the task queue 26, information 30 with a processor 10-16 can be updated, telling the processor 10-16 that the task is ready for execution. When a processor 10-16 finishes the execution of the task, a routine may first check for tasks that are ready for execution with an information 30. If these tasks are not executed by the processor 10-16 itself, these tasks can be stored in the task queue 26 for execution at a later time. Then, dependency values for tasks not ready to be executed can be decremented. Eventually, a look-ahead pointer 30b and an argument pointer 30c can be read from the task currently being executed by the core and the look-ahead function can be executed by the task management unit 32.
In order to perform the look-ahead function, a task management unit 32 may comprise, as illustrated in Fig. 8, input means 34 for receiving tasks for execution on at least one processors. Further, there may be provided verifying means 36 for verifying readiness of at least one next task, assuming the currently process task is finished parallel to processing the at least one task. The verifying means 36 may have access onto information 30 and may read the flags 30a and may update the dependency pointers 30c.
Further, there may be provided preparation means 38 for preparing the task ready structure as illustrated in Fig. 5a. Eventually, there may be provided output means 40 for putting out the ready- structure either to the task queue 26 or to the processors 10-16 into information 30.
By providing the parallel dependency checks, the execution time of parallel tasks may be significantly decreased. The cores may offload dependency checks to a task management unit. This enhances, for example video processing.

Claims

Look-ahead task management CLAIMS:
1. Method comprising: receiving tasks (2, 4, 6, 8) for execution on at least one processor (10, 12, 14, 16), processing at least one of the tasks (2, 4, 6, 8) within one processor (10, 12, 14, 16), parallel to processing the at least one task (2, 4, 6, 8), verifying (20) readiness of at least one next task assuming the currently processed task (2a, 4a, 6a, 8a) is finished, preparing a ready-structure (24) for the at least one task (2b) verified as ready, and - starting the at least one task (2b) verified as ready using the ready- structure
(24) after the currently processed task (2a) is finished.
2. The method of claim 1 , wherein verifying the readiness (20) of the at least one next task (2b) comprises checking task dependencies between the at least one received task (2b, 4b, 6b, 8b) and the currently processed task (2a).
3. The method of claim 1, further comprising storing within a task queue (26) at least one of
A) the ready-structures of tasks (24), B) the tasks (2b) verified as ready.
4. The method of claim 1, wherein the ready- structure (24) comprises at least one of:
A) a function pointer (24a); C) an argument list (24b).
5. The method of claim 4, wherein the ready- structure (2a) comprises at least the argument list (24b) for data prefetching.
6. The method of claim 1, further comprising preparing a partially-ready- structure (28) for at least one task (2c) which is not verified as ready.
7. The method of claim 6, wherein the partially-ready-structure (28) comprises information about task dependencies being not met.
8. The method of claim 6, further comprising verifying readiness of at least one task within the partially-ready-structure (28) after a currently processes task is finished.
9. The method of claim 1 , wherein verifying readiness of at least one tasks within a partially-ready-structure (28) comprises checking task dependencies being marked within the partially-ready-structure (28).
10. The method of claim 1, further comprising storing within at least one processor (10, 12, 14, 16) task information (30) about tasks (2, 4, 6, 8) to be executed.
11. The method of claim 10, wherein the task information (30) comprises at least one of
A) a task pointer (30a), B) a look-ahead pointer (30b),
C) a dependency pointer (30c),
D) an argument pointer (30d),
E) a flag (3Of).
12. The method of claim 10, further comprising obtaining dependency information for tasks from the current task (2a) from the task information (30).
13. Task management unit (32) comprising: input means (34) for receiving tasks for execution on a at least one processors, - verifying means (36) arranged for verifying readiness of at least one next task assuming the currently processed task is finished parallel to processing the at least one task, preparation means (38) arranged for preparing a ready- structure for the at least one task verified as ready, and output means (40) for putting out the ready-structure after the currently processed task is finished for starting the at least one task verified as ready.
14. A microprocessor (10) comprising : - a storage (30) for storing task information (30a), wherein the storage comprises; a memory area for storing a task pointer (30c), a memory area for storing an argument pointer and a memory area for storing a dependency pointer (30c).
15. The microprocessor of claim 14, further comprising access means for providing access to the storage for storing task information using a task management unit of claim 13.
16. A system with a task management unit of claim 13 and a microprocessor of claim 13.
17. A computer program comprising instructions operable to cause a task management unit to - receiving tasks for execution on at least one processors, provide the task for processing to at least one processor, parallel to processing the at least one task, verify readiness of at least one next task assuming the currently processed task is finished, prepare a ready-structure for the at least one task verified as ready, and - starting the at least one task verified as ready using the ready-structure after the currently processed task is finished within the processor.
PCT/IB2009/051035 2008-03-12 2009-03-12 Look-ahead task management WO2009113034A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/921,573 US20110004881A1 (en) 2008-03-12 2009-03-12 Look-ahead task management

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP08102525 2008-03-12
EP08102525.6 2008-03-12

Publications (1)

Publication Number Publication Date
WO2009113034A1 true WO2009113034A1 (en) 2009-09-17

Family

ID=40673894

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2009/051035 WO2009113034A1 (en) 2008-03-12 2009-03-12 Look-ahead task management

Country Status (2)

Country Link
US (1) US20110004881A1 (en)
WO (1) WO2009113034A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103069797A (en) * 2010-08-26 2013-04-24 飞思卡尔半导体公司 Video processing system and method for parallel processing of video data
WO2015173533A1 (en) * 2014-05-11 2015-11-19 Safetty Systems Ltd. A monitoring unit as well as method for predicting abnormal operation of time-triggered computer systems

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101689736B1 (en) * 2010-08-18 2016-12-27 삼성전자주식회사 Work processing unit having a function of work scheduling, control unit for scheduling activation and work scheduling method over the symetric multi-processing environment
US8949853B2 (en) * 2011-08-04 2015-02-03 Microsoft Corporation Using stages to handle dependencies in parallel tasks
US9158610B2 (en) 2011-08-04 2015-10-13 Microsoft Technology Licensing, Llc. Fault tolerance for tasks using stages to manage dependencies
WO2013147878A1 (en) * 2012-03-30 2013-10-03 Intel Corporation Prediction-based thread selection in a multithreading processor
US9952898B2 (en) * 2013-03-15 2018-04-24 Tact.Ai Technologies, Inc. Dynamic construction and management of task pipelines
CA2950180C (en) 2014-05-28 2020-04-21 Arris Enterprises Llc Acceleration of context adaptive binary arithmetic coding (cabac) in video codecs
WO2015184067A1 (en) 2014-05-28 2015-12-03 Arris Enterprises, Inc. Content aware scheduling in a hevc decoder operating on a multi-core processor platform
US9678790B2 (en) 2015-01-07 2017-06-13 Qualcomm Incorporated Devices and methods implementing operations for selective enforcement of task dependencies
US11182207B2 (en) * 2019-06-24 2021-11-23 Nvidia Corporation Pre-fetching task descriptors of dependent tasks
JP7326969B2 (en) * 2019-07-30 2023-08-16 富士通株式会社 Information processing device, storage system and scheduling program
CN111243587A (en) 2020-01-08 2020-06-05 北京松果电子有限公司 Voice interaction method, device, equipment and storage medium
US11416176B2 (en) * 2020-06-23 2022-08-16 Western Digital Technologies, Inc. Function processing using storage controllers for load sharing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0274339A2 (en) * 1986-10-29 1988-07-13 United Technologies Corporation Event driven executive
US5809325A (en) * 1994-08-19 1998-09-15 Intel Corporation Circuit and method for scheduling instructions by predicting future availability of resources required for execution

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5774739A (en) * 1996-09-20 1998-06-30 Bay Networks, Inc. Using a lockup processor to search a table of keys whose entries contain instruction pointer values of code to execute if key is found
US7376693B2 (en) * 2002-02-08 2008-05-20 Jp Morgan Chase & Company System architecture for distributed computing and method of using the system
US7600221B1 (en) * 2003-10-06 2009-10-06 Sun Microsystems, Inc. Methods and apparatus of an architecture supporting execution of instructions in parallel
EP1731998A1 (en) * 2004-03-29 2006-12-13 Kyoto University Data processing device, data processing program, and recording medium containing the data processing program
US7898679B2 (en) * 2005-05-27 2011-03-01 Computer Associates Think, Inc. Method and system for scheduling jobs in a computer system
JP4923240B2 (en) * 2006-01-17 2012-04-25 国立大学法人東京工業大学 Program processing apparatus, parallel processing program, program processing method, parallel processing compiler, recording medium storing parallel processing compiler, and multiprocessor system
US7380104B2 (en) * 2006-04-25 2008-05-27 International Business Machines Corporation Method and apparatus for back to back issue of dependent instructions in an out of order issue queue
JP5101128B2 (en) * 2007-02-21 2012-12-19 株式会社東芝 Memory management system
US7948500B2 (en) * 2007-06-07 2011-05-24 Nvidia Corporation Extrapolation of nonresident mipmap data using resident mipmap data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0274339A2 (en) * 1986-10-29 1988-07-13 United Technologies Corporation Event driven executive
US5809325A (en) * 1994-08-19 1998-09-15 Intel Corporation Circuit and method for scheduling instructions by predicting future availability of resources required for execution

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NOGUERA J ET AL: "Multitasking on reconfigurable architectures: microarchitecture support and dynamic scheduling", ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, ACM, NEW YORK, NY, US, vol. 3, no. 2, 1 May 2004 (2004-05-01), pages 385 - 406, XP002398662, ISSN: 1539-9087 *
STAVROU K ET AL: "Chip multiprocessor based on data-driven multithreading model", INTERNATIONAL JOURNAL OF HIGH PERFORMANCE SYSTEMS ARCHITECTURE INDERSCIENCE ENTERPRISES LTD. SWITZERLAND, vol. 1, no. 1, 2007, pages 24 - 43, XP002531675, ISSN: 1751-6528 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103069797A (en) * 2010-08-26 2013-04-24 飞思卡尔半导体公司 Video processing system and method for parallel processing of video data
WO2015173533A1 (en) * 2014-05-11 2015-11-19 Safetty Systems Ltd. A monitoring unit as well as method for predicting abnormal operation of time-triggered computer systems
US9830211B2 (en) 2014-05-11 2017-11-28 Safetty Systems Ltd Framework as well as method for developing time-triggered computer systems with multiple system modes

Also Published As

Publication number Publication date
US20110004881A1 (en) 2011-01-06

Similar Documents

Publication Publication Date Title
US20110004881A1 (en) Look-ahead task management
US8407454B2 (en) Processing long-latency instructions in a pipelined processor
EP0813145B1 (en) Pipelined instruction dispatch unit in a superscalar processor
US8458707B2 (en) Task switching based on a shared memory condition associated with a data request and detecting lock line reservation lost events
KR102616212B1 (en) Data drive scheduler on multiple computing cores
US10268519B2 (en) Scheduling method and processing device for thread groups execution in a computing system
US20140143524A1 (en) Information processing apparatus, information processing apparatus control method, and a computer-readable storage medium storing a control program for controlling an information processing apparatus
JP2012108576A (en) Multi-core processor, process execution method, and program
US20170262291A1 (en) Heterogeneous computing system with a shared computing unit and separate memory controls
US20200201678A1 (en) Multi-Phased and Multi-Threaded Program Execution Based on SIMD Ratio
JP4756553B2 (en) Distributed processing method, operating system, and multiprocessor system
US20240086359A1 (en) Dynamic allocation of arithmetic logic units for vectorized operations
Själander et al. A look-ahead task management unit for embedded multi-core architectures
CN114153500A (en) Instruction scheduling method, instruction scheduling device, processor and storage medium
CN114168202B (en) Instruction scheduling method, instruction scheduling device, processor and storage medium
US20210049014A1 (en) Multi-thread processing
US20210042123A1 (en) Reducing Operations of Sum-Of-Multiply-Accumulate (SOMAC) Instructions
US20230214325A1 (en) Register File Prefetch
KR101420592B1 (en) Computer system
KR100837400B1 (en) Method and apparatus for processing according to multi-threading/out-of-order merged scheme
KR20230124598A (en) Compressed Command Packets for High Throughput and Low Overhead Kernel Initiation
US9015720B2 (en) Efficient state transition among multiple programs on multi-threaded processors by executing cache priming program
US11809219B2 (en) System implementing multi-threaded applications
US20230205680A1 (en) Emulating performance of prior generation platforms
US11416261B2 (en) Group load register of a graph streaming processor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09719715

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 12921573

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09719715

Country of ref document: EP

Kind code of ref document: A1