WO2022261867A1 - 一种任务调度方法和装置 - Google Patents

一种任务调度方法和装置 Download PDF

Info

Publication number
WO2022261867A1
WO2022261867A1 PCT/CN2021/100415 CN2021100415W WO2022261867A1 WO 2022261867 A1 WO2022261867 A1 WO 2022261867A1 CN 2021100415 W CN2021100415 W CN 2021100415W WO 2022261867 A1 WO2022261867 A1 WO 2022261867A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
barrier
graph
information table
trigger condition
Prior art date
Application number
PCT/CN2021/100415
Other languages
English (en)
French (fr)
Inventor
张森
赵庆贺
杨意
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180097744.8A priority Critical patent/CN117222980A/zh
Priority to PCT/CN2021/100415 priority patent/WO2022261867A1/zh
Publication of WO2022261867A1 publication Critical patent/WO2022261867A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt

Definitions

  • the embodiments of the present application relate to the field of computer technologies, and in particular, to a task scheduling method and device.
  • computing devices can generally adopt a multi-core heterogeneous system architecture, which includes a multi-core central processing unit (central processing unit, CPU) and an accelerator.
  • the multi-core CPU is used to perform general computing tasks.
  • the accelerator Used to perform specialized computing tasks.
  • the scheduling software can determine the dependencies between multiple tasks according to the input and output, and schedule the ready tasks to be executed on the CPU cores or accelerators that can execute the tasks.
  • the CPU must re-initialize the dependency relationship corresponding to the task graph into the scheduling device, and then the scheduling device maintains the Dependencies between multiple tasks in the task graph to ensure that the computation proceeds properly. This will cause the scheduling device to schedule the task graph and analyze the dependencies of the task graph, and the initialization time will be longer, and the interaction between the CPU and the scheduling device will be redundant, which will affect the computing efficiency.
  • Embodiments of the present application provide a task scheduling method and device, which can save time for loading dependencies into the task scheduling device and improve computing efficiency.
  • a task scheduling device includes one or more task graph templates, and each task graph template is used to indicate the dependencies between multiple tasks included in the task graph template relationship, and the processing method of each task; the task scheduling device is configured to: obtain task information of the first task graph; the task information of the first task graph includes input data of the first task graph and tasks corresponding to the first task graph Diagram template identification; based on the task diagram template identification corresponding to the first task diagram, determine the task diagram template corresponding to the first task diagram in one or more task diagram templates included in the task scheduling device; based on the input data of the first task diagram and The task map template corresponding to the first task map is used to schedule the first task map.
  • task graph templates corresponding to multiple task graphs with the same processing mode and same dependency relationship are the same.
  • task graph 1 is used to calculate (1+2)*(4-3)
  • task graph 2 is used to calculate (5+6)*(8-7)
  • the dependencies between multiple tasks in task graph 1 The dependencies among multiple tasks in task graph 2 are the same, and the calculation method of multiple tasks in task graph 1 is also the same as that of multiple tasks in task graph 2 . Therefore, the task graph templates of task graph 1 and task graph 2 are the same, both are (a+b)*(c-d).
  • the difference between task graph 1 and task graph 2 is that the input data of the two task graphs are different.
  • multiple tasks included in the task graph template can be executed serially or in parallel.
  • some tasks may be executed serially during execution, and some tasks may be executed in parallel during execution.
  • the task scheduling device supports the static task graph template built into the task scheduling device, when multiple task graphs with the same dependencies and processing methods as the task graph template are subsequently executed, there is no need to Dependencies and processing methods are initialized into the task scheduling device, so when the multiple task graphs are subsequently executed, only the dynamic input data and the identification of the task graph template to be used can be obtained, which can save the time for loading dependencies into the task scheduling device. time. That is to say, the task scheduling device provided by the embodiment of the present application can repeatedly execute the task graph with the same processing method and dependency relationship as the task graph template multiple times by creating a task graph template once, and execute the multiple task graphs subsequently. When creating a task graph, there is no need to load the static processing method and dependency relationship into the task scheduling device again, which can save the time of loading the static processing method and dependency relationship into the task scheduling device, and improve computing efficiency.
  • the above task scheduling device is further configured to acquire one or more task map templates; each task map template includes a task information table, a first synchronization information table, and a second synchronization information table; wherein, The task information table includes multiple task IDs, and the processing method corresponding to each task ID; the first synchronization information table includes multiple events, and one or more barrier barrier IDs corresponding to each event, multiple events and multiple Tasks correspond one-to-one, and each event is used to indicate the execution completion of its corresponding task; the second synchronization information table includes multiple barriers, one or more trigger conditions corresponding to each barrier, and each barrier satisfies its corresponding trigger condition The ID of the task to be executed at that time.
  • the task graph template acquired by the task scheduling device may be sent by the CPU to the task scheduling device, or may be preset in the task scheduling device, which is not limited in this embodiment of the present application.
  • each task map template can be described by three tables: task information table, first synchronization information table, and second synchronization information table.
  • the task scheduling device obtains the task map template by obtaining the task map template The corresponding task information table, the first synchronization information table and the second synchronization information table, so that the task scheduling device can schedule based on these three tables when subsequently scheduling the task graph, so as to ensure that the calculation of multiple tasks in the task graph template can be normal conduct.
  • the task scheduling device subsequently schedules multiple task graphs whose processing methods and dependencies are the same as those of the task graph template, it can directly schedule tasks based on the task graph template, without loading the processing methods and dependencies into the task scheduling device again. Save the time of loading static processing methods and dependencies into the task scheduling device, and improve computing efficiency.
  • the above-mentioned task scheduling device includes a coupled first interface, a task graph control circuit, a task state machine, and a second interface; wherein, the task graph control circuit is used to acquire A task graph template, and task information of the first task graph; a task state machine, configured to, based on the second synchronization information table, when determining that the value of the first barrier satisfies its corresponding first trigger condition, according to the first task identifier, the second The input data of a task map, and the task information table corresponding to the first task map, obtain the first task corresponding to the first task identifier from the task map control circuit, and send the first task to the computing unit through the second interface; the first task The task identifier is the identifier of the task to be executed when the value of the first barrier satisfies the first trigger condition. .
  • the circuit sends the first task trigger signal to the task state machine, and the first task trigger signal is used to instruct the task state machine to query the second synchronization information table to determine whether the value of b1 satisfies the trigger condition corresponding to the first task.
  • the task state machine determines that the value of b1 satisfies the trigger condition corresponding to the first task, the task state machine obtains from the task map control circuit according to the first task identifier, the input data of the first task map, and the task information table corresponding to the first task map task content of the first task, and send the task content of the first task to the computing unit through the second interface.
  • the task map control circuit may also send the first task execution signal to the task state machine, and the task state The computer obtains the task content of the first task from the task map control circuit according to the first task execution signal, and sends the task content of the first task to the computing unit through the second interface.
  • the task state machine may query the second synchronization information table when the value of the barrier is updated to determine whether the value of the barrier meets its corresponding trigger condition.
  • the task state machine is used to determine whether the value of each barrier satisfies its corresponding trigger condition based on the second synchronization information table, and when the value of the barrier meets the trigger condition, determine the identifier of the task to be executed, and control it from the task graph
  • the circuit obtains the task content of the task to be executed, and sends the task to the computing unit.
  • the task scheduling device can schedule each task according to its corresponding task graph template when scheduling each task in the task graph, and when the task scheduling device schedules tasks, it can obtain the processing method according to the task graph template in the task scheduling device, There is no need to reload the task processing method into the task scheduling device, so the time for loading the static processing method into the task scheduling device can be saved, and the calculation efficiency can be improved.
  • the computing unit executes the multiple first tasks in parallel.
  • the above-mentioned task scheduling device further includes a coupled event parsing circuit and a synchronous counting circuit
  • the synchronous counting circuit includes a plurality of counters, and each counter corresponds to a barrier
  • the event parsing circuit is used for When the execution of the first task is completed, the first event is received through the first interface, and the identifier of the second barrier corresponding to the first event is determined based on the first synchronization information table, and the synchronous counting circuit is notified to modify the value of the counter corresponding to the second barrier ;
  • the first event is used to indicate that the execution of the first task is completed;
  • the synchronous counting circuit is used to modify the value of the counter corresponding to the second barrier.
  • the value of the counter corresponding to the second barrier can be increased by one, or the value of the counter corresponding to the second barrier can be decremented by one, or the value of the counter corresponding to the second barrier can be decreased by one.
  • the value of the counter corresponding to the second barrier is added or subtracted by other values.
  • whether to increase (for example, add one) or decrease (for example, subtract one) the value of the counter is related to the initial value of the counter.
  • the first interface parses the first event, and sends the first event to the event parsing circuit.
  • the event parsing circuit receives the first event, queries the first synchronization information table, determines the identifier of the second barrier corresponding to the first event, and notifies the synchronous counting circuit to modify the value of the counter corresponding to the second barrier. After the synchronous counting circuit modifies the value of the counter corresponding to the second barrier, it notifies the task state machine of the identity of the second barrier.
  • the task state machine judges whether the value of the second barrier satisfies its corresponding trigger condition based on the second synchronization information table, and obtains the next pending condition from the task map control circuit when the value of the second barrier meets its corresponding trigger condition
  • the task to execute and send the task to the computing unit. Until all tasks in the first task graph are executed.
  • this solution Based on this solution, through the first synchronization information table and the second synchronization information table, the dependencies among multiple tasks in the task graph can be correctly maintained to ensure normal execution of each task.
  • this solution does not need to load the dependencies into the task scheduling device again, and can directly rely on the first synchronization information table and the second synchronization information table in the task scheduling device, so it can Save the time of loading static dependencies into the task scheduling device and improve computing efficiency.
  • the above task map control circuit is further configured to modify or delete the task map template.
  • the task scheduling device can modify, delete and add multiple task map templates stored in it, so that the task map templates in the task scheduling device are more flexible and applicable to more scenarios.
  • the task graph template includes a first task and a second task, and the first task and the second task reuse the same barrier.
  • multiple tasks reuse the same barrier means that the triggering of the multiple tasks can depend on the same barrier. That is, in the second synchronization information table, when multiple tasks reuse the same barrier and the value of the barrier satisfies one or more trigger conditions, the corresponding tasks to be executed are the multiple tasks.
  • multiple tasks in the task map template can reuse the same barrier. Since the value of a barrier can be maintained by a counter, when multiple tasks reuse the same barrier, the number of counters can be reduced, thereby reducing the chip area.
  • the first task and the second task meet at least one of the following conditions: neither the first task nor the second task has a parent node; or, the first task and the second task have the same Parent node; or, the first task is the only parent node of the second task; or, the root nodes of the first task and the second task share the same barrier, and the first task is the only parent node of the second task.
  • the multiple tasks in a task graph template meet any one or more of the above four conditions, the multiple tasks can reuse the same barrier.
  • the situation that multiple tasks reuse the same barrier is not limited to the above four situations. Specifically, it can be determined whether multiple tasks can reuse the same barrier according to the dependencies of multiple tasks in the task graph template.
  • one barrier corresponds to multiple trigger conditions
  • the multiple trigger conditions include the first trigger condition and other trigger conditions
  • the trigger sequence of the first trigger condition is earlier than the trigger sequence of other trigger conditions
  • the second synchronization information table includes a first sub-information table and a second sub-information table
  • the first sub-information table includes multiple barriers, the first trigger condition corresponding to each barrier, and each barrier satisfies its corresponding first trigger condition
  • the second sub-information table includes multiple barriers, other trigger conditions corresponding to each barrier, and the identifier of the task to be executed when each barrier meets its corresponding other trigger conditions.
  • the above-mentioned first sub-information table is stored in a cache of the task scheduling device, and the second sub-information table is stored in a memory.
  • the chip area of the task scheduling device can be reduced.
  • the task diagram control circuit when there are multiple other trigger conditions corresponding to the barrier, the multiple other trigger conditions corresponding to the barrier in the second sub-information table are arranged in order of triggering; the task diagram control circuit , is also used to read the next other trigger condition from the memory according to the trigger sequence of the multiple other trigger conditions corresponding to the barrier in the second sub-information table when the value of the barrier satisfies the corresponding first trigger condition, and Replace the first trigger condition corresponding to the barrier with this other trigger condition.
  • the trigger sequence of the next other trigger condition is immediately after the first trigger condition. That is, the next other trigger condition is the next trigger condition that will be triggered by the first barrier after the value of the first barrier satisfies the first trigger condition. For example, the second trigger condition.
  • the task control circuit replaces the first trigger condition corresponding to the barrier stored in the cache with the second trigger condition corresponding to the barrier. Therefore, the task map control circuit is also used to read from the memory according to the trigger sequence of multiple other trigger conditions corresponding to the barrier in the second sub-information table when the value of the barrier satisfies the second trigger condition in the cache The third trigger condition, and replace the second trigger condition in the cache with the third trigger condition.
  • the trigger conditions can be loaded into the cache in sequence.
  • the chip area of the task scheduling device can be reduced, and the scalability of the chip can be improved.
  • the second aspect of the embodiments of the present application provides a task scheduling method, which is applied to a task scheduling device, and the task scheduling device includes one or more task map templates, and each task map template is used to indicate the number of tasks included in the task map template.
  • the dependency relationship among the tasks, and the processing method of each task; the method includes: the task scheduling device obtains the task information of the first task graph; the task information of the first task graph includes the input data of the first task graph and the first task graph.
  • the task scheduling device determines the task map template corresponding to the first task map from one or more task map templates included in the task scheduling device based on the task map template identifier corresponding to the first task map.
  • the task scheduling device schedules the first task graph based on the input data of the first task graph and the task graph template corresponding to the first task graph.
  • the above method further includes: the task scheduling device acquires one or more task graph templates; each task graph template includes a task information table, a first synchronization information table, and a second synchronization information table; wherein, The task information table includes multiple task IDs, and the processing method corresponding to each task ID; the first synchronization information table includes multiple events, and one or more barrier barrier IDs corresponding to each event, multiple events and multiple Tasks correspond one-to-one, and each event is used to indicate the execution completion of its corresponding task; the second synchronization information table includes multiple barriers, one or more trigger conditions corresponding to each barrier, and each barrier satisfies its corresponding trigger condition The ID of the task to be executed at that time.
  • the above-mentioned task scheduling device includes a coupled first interface, a task graph control circuit, a task state machine, and a second interface; the task scheduling device obtains the task graph and the task information of the first task graph , including: the task map control circuit obtains the task map through the first interface, and the task information of the first task map; the task scheduling device is based on the input data of the first task map and the task map template corresponding to the first task map, for the first task Graph scheduling includes: the task state machine, based on the second synchronization information table, when determining that the value of the first barrier satisfies its corresponding first trigger condition, according to the first task identifier, the input data of the first task graph, and the first The task information table corresponding to the task graph obtains the first task corresponding to the first task identifier from the task graph control circuit, and sends the first task to the computing unit through the second interface; the first task identifier is that the value of the first barrier satisfies the first task An
  • the computing unit executes the multiple first tasks in parallel.
  • the task scheduling device further includes a coupled event parsing circuit and a synchronous counting circuit, the synchronous counting circuit includes a plurality of counters, and each counter corresponds to a barrier; the task scheduling device is based on the first task diagram
  • the input data of the input data and the task diagram template corresponding to the first task diagram are used to schedule the first task diagram, and further include: when the execution of the first task is completed, the event parsing circuit receives the first event through the first interface, and based on the first task diagram
  • a synchronization information table determines the identity of the second barrier corresponding to the first event, and notifies the synchronous counting circuit to modify the value of the counter corresponding to the second barrier; wherein, the first event is used to indicate that the execution of the first task is completed; the synchronous counting circuit modifies the value of the counter corresponding to the second barrier; The value of the counter corresponding to the two barriers.
  • the above method further includes: the task map control circuit modifies or deletes the task map template.
  • the task graph template includes a first task and a second task, and the first task and the second task reuse the same barrier.
  • the first task and the second task meet at least one of the following conditions: neither the first task nor the second task has a parent node; or, the first task and the second task have the same Parent node; or, the first task is the only parent node of the second task; or, the root nodes of the first task and the second task share the same barrier, and the first task is the only parent node of the second task.
  • one barrier corresponds to multiple trigger conditions
  • the multiple trigger conditions include the first trigger condition and other trigger conditions
  • the trigger sequence of the first trigger condition is earlier than the trigger sequence of other trigger conditions
  • the second The synchronization information table includes a first sub-information table and a second sub-information table.
  • the first sub-information table includes multiple barriers, the first trigger condition corresponding to each barrier, and the time when each barrier meets its corresponding first trigger condition.
  • the second sub-information table includes a plurality of barriers, other trigger conditions corresponding to each barrier, and identifiers of tasks to be executed when each barrier satisfies its corresponding other trigger conditions.
  • the first sub-information table is stored in a cache of the task scheduling device, and the second sub-information table is stored in a memory.
  • the multiple other trigger conditions corresponding to the barrier in the second sub-information table are arranged sequentially according to the trigger order; the above method also includes : When the value of the task graph control circuit meets the corresponding first trigger condition, read the next other trigger condition from the memory according to the trigger sequence of multiple other trigger conditions corresponding to the barrier in the second sub-information table, And replace the first trigger condition corresponding to the barrier with the other trigger condition.
  • the third aspect of the embodiment of the present application provides a computing device, the computing device includes a central processing unit CPU, and the task scheduling device as described in the first aspect above, the CPU is used to send to the task scheduling device The task map template.
  • the computing device further includes an enhanced short message service EMS and a computing unit
  • the EMS is configured to receive tasks to be executed from the task scheduling device, and allocate the tasks to be executed For the computing unit, the computing unit is used to execute the task to be executed.
  • FIG. 1 is a schematic structural diagram of a scheduling device provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of another scheduling device provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a task map template provided in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of another task map template provided by the embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of another task map template provided by the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a task graph template when a multi-task multiplexing barrier is provided in an embodiment of the present application
  • FIG. 7 is a schematic structural diagram of a non-reusable barrier and a multiplexed barrier in a task map template provided in an embodiment of the present application;
  • Fig. 8 is a schematic structural diagram of a task scheduling device provided in the embodiment of the application.
  • FIG. 9 is a schematic structural diagram of a computing device provided in an embodiment of the application.
  • Fig. 10 is a schematic flowchart of a task scheduling method provided in the embodiment of the application.
  • Fig. 11 is a schematic flowchart of another task scheduling method provided in the embodiment of the application.
  • At least one item (piece) of a, b or c can represent: a, b, c, a and b, a and c, b and c, or, a and b and c, wherein a, b and c can be single or multiple.
  • words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect, Those skilled in the art can understand that words such as “first” and “second” do not limit the quantity and execution order.
  • first in the first sub-information table and “second” in the second sub-information table in the embodiment of the present application are only used to distinguish different sub-information tables.
  • the first, second, etc. descriptions that appear in the embodiments of this application are only for illustration and to distinguish the description objects. Any limitations of the examples.
  • the computing architecture of computing devices can usually be a heterogeneous computing hardware architecture, which includes a central processing unit ( central processing unit, CPU), and one or more accelerators.
  • the CPU is used to perform general computing tasks
  • the accelerator is used to perform specialized computing tasks.
  • the dedicated computing tasks may include artificial intelligence (AI) processing, such as artificial neural network, machine learning (machine learning, ML) training, ML optimization/learning, inference, classification and other operations, visual data processing, network data processing, object detection, rule analysis, content processing operations, etc.
  • AI artificial intelligence
  • the accelerator can be a neural network processor (neural-network process unit, NPU), and can include a graphics processing unit (graphics processing unit, GPU), a digital signal processor (digital signal processor, DSP), a system on chip (system on chip, One or more of SOC), field-programmable gate array (Field-Programmable Gate Array, FPGA), application specific integrated circuit (application specific integrated circuit, ASIC), etc.
  • NPU neural network processor
  • NPU neural-network process unit
  • GPU graphics processing unit
  • DSP digital signal processor
  • DSP digital signal processor
  • system on chip system on chip
  • One or more of SOC field-programmable gate array
  • FPGA Field-Programmable Gate Array
  • ASIC application specific integrated circuit
  • the scheduling software can determine the dependencies between multiple tasks according to the input and output, and schedule the ready tasks to be executed on the CPU cores or accelerators that can execute the tasks.
  • a scheduling device as shown in Figure 1, the scheduling device can be referred to as a task master (Task Maestro), and the method for scheduling tasks of the scheduling device can include the following steps:
  • the main processor core adds the task descriptor (Task Description) to the task master (Task Maestro);
  • Task Maestro stores the task descriptor in the Task Pool
  • Check Deps checks the dependencies between tasks, dispatches ready tasks to the computing unit (Worker Core) for execution through the scheduling module, and modifies the dependency table (Dependence Table);
  • the check module continues to check whether any tasks are ready, and continues to dispatch the ready tasks to the computing unit (Worker Core) until all tasks are executed.
  • the scheduling device shown in Figure 1 schedules tasks, for multiple task graphs with the same processing method and dependency relationship, each time the task graph is executed, the main processor core Master Core will renew the corresponding task graph.
  • Task descriptors (task descriptors include task processing methods and task dependencies) are initialized into the scheduling device, and then the scheduling device maintains the dependencies among multiple tasks in the task graph to ensure normal calculation. Therefore, the scheduling device shown in Figure 1 does not support static dependencies and processing methods built into the scheduling device, which will cause a longer initialization time when the scheduling device resolves the dependencies of the task graph, and the gap between the Master Core and the scheduling device Interaction redundancy, which affects computing efficiency.
  • a task scheduling graph (task scheduling graph, TSG) device as shown in FIG. 2, the TSG device includes a task library, an event counter and a refresh command module.
  • the method for scheduling tasks of the TSG device may include the following steps:
  • the main processor core (Master Core) initializes the task descriptor (Task Description), dependency table, and event counter Counter;
  • the TSG device dispatches ready tasks to the computing unit (Worker Core);
  • the TSG device After all tasks are executed, the TSG device notifies the main processor core (Master Core), and the main processor core (Master Core) re-initializes new tasks.
  • the scheduling device shown in Figure 2 schedules tasks, for multiple task graphs with the same processing method and dependency relationship, every time the task graph is executed, the Master Core must re-initialize the task descriptor corresponding to the task graph to the TSG device, and then the TSG device maintains the dependency relationship between multiple tasks in the task graph through event counters and refresh commands, so as to ensure that the calculation is performed normally. Therefore, the scheduling device shown in Figure 2 does not support static dependencies and processing methods built into the scheduling device, and each task graph must be re-initialized to the TSG device, which will cause a long initialization time and affect computing efficiency. . And in this solution, each task is configured with a counter, so the number of counters is large, resulting in a large chip area of the TSG device.
  • the embodiment of the present application provides a task scheduling device.
  • the task scheduling device By creating a task graph template once, the task scheduling device subsequently executes multiple task graphs with the same processing method and dependency relationship as the task graph template. There is no need to reload static processing methods and dependencies into the task scheduling device. That is, the task scheduling device provided by the embodiment of the present application supports built-in static task graph templates, thereby saving time for loading static processing methods and dependencies into the task scheduling device and improving computing efficiency.
  • An embodiment of the present application provides a task scheduling device, and the task scheduling device can be applied to fields such as communication processors, HPC, and AI computing.
  • the task scheduling device may be a chip in a communication device or a computing device.
  • the task scheduling device is used to obtain one or more task map templates.
  • the task graph template acquired by the task scheduling device may be sent by the CPU to the task scheduling device, or may be preset in the task scheduling device, which is not limited in this embodiment of the present application.
  • Each task graph template in the task scheduling device is used to indicate the dependency among multiple tasks included in the task graph template, and the processing method of each task.
  • the processing method of each task may include the computing method, data copying method, and data moving method of each task.
  • the task graph template includes task 1, task 2, and task 3, and the task graph template is used to calculate (a+b)*(c-d) as an example, the processing method of task 1 is a+b, and the processing method of task 2 is c-d, task 3 is processed by multiplying the calculation result e of task 1 by the calculation result f of task 2.
  • task 3 can only be executed after task 1 and task 2 are executed. That is, the execution of task 3 depends on the calculation results of task 1 and task 2.
  • This task graph template is used to indicate the calculation methods of task 1, task 2, and task 3, and the dependencies among task 1, task 2, and task 3.
  • task graph 1 is used to calculate (1+2)*(4-3)
  • task graph 2 is used to calculate (5+6)*(8-7)
  • the dependencies between multiple tasks in task graph 1 The dependencies among multiple tasks in task graph 2 are the same, and the calculation method of multiple tasks in task graph 1 is also the same as that of multiple tasks in task graph 2 . Therefore, the task graph templates of task graph 1 and task graph 2 are the same, both are (a+b)*(c-d).
  • the difference between task graph 1 and task graph 2 is that the input data of the two task graphs are different.
  • the task scheduling device in the embodiment of the present application supports building a static task graph template into the task scheduling device, so that when multiple task graphs corresponding to the task graph template are subsequently executed, there is no need to re-initialize the dependencies and processing methods In the task scheduling device, therefore, only the dynamic input data and the identifiers of the task graph templates to be used are obtained when the multiple task graphs are subsequently executed, which can save the time for loading dependencies into the task scheduling device.
  • multiple tasks included in the task graph template can be executed serially or in parallel.
  • some of the tasks can be executed serially, and some of the tasks can be executed in parallel.
  • the embodiment of the present application does not limit the specific execution methods of the multiple tasks in the task map template. .
  • the task graph template in the task scheduling device in the embodiment of the present application can be used in any multi-task computing scenario.
  • Figure 3 is a task diagram template for L2 scheduling in wireless communication.
  • the scheduling process can be abstracted as Figure 3 shows the task map template.
  • T0 is a transmission time interval (transport time interval, TTI) interrupt timing trigger task
  • T1 is a cell-level scheduling task
  • T2 is an air domain scheduling task
  • T3 is a frequency domain scheduling task
  • T4 is a user-level post-processing task.
  • the main processor core Master Core can load the task graph template shown in Figure 3 into the task scheduling device of the present application, and the task scheduling device of the present application can complete the analysis and scheduling of the dependencies of multiple tasks, ensuring the parallelism of task execution Spend.
  • Fig. 4 is a task diagram template for matrix calculation in HPC.
  • T1 is the multiplication operation.
  • the main processor core Master Core abstracts the task graph template in Figure 4 into two T0 and one T1 and loads it into the task scheduling device of the present application, and the task scheduling device of the present application completes the dependency analysis and scheduling of multiple tasks, Guarantee the parallelism of task execution.
  • matrix calculations in HPC can also include matrix LU decomposition and other matrix calculations.
  • Figure 4 only uses the task graph template including addition and multiplication as an example to illustrate.
  • FIG. 5 is a task graph template of convolutional neural networks (CNN) in an AI computing scenario, and the calculation steps of CNN shown in (a) in Figure 5 can be abstracted as (b) in Figure 5
  • T1 is pre-processing
  • T2, T3, and T4 are direct memory access (DMA) tasks
  • T5 is VADD
  • T6, T7 are convolution CONV
  • T8 is pooling Pool
  • T9 is DMA.
  • the main processor core Master Core loads the task graph template shown in Figure 5 into the task scheduling device of the present application, and the task scheduling device of the present application completes the dependency analysis and scheduling of multiple tasks to ensure the parallelism of task execution.
  • the task graph templates in the task scheduling device are not limited to the task graph templates of the above three scenarios, and any multi-task computing scenario can abstract the dependencies and processing methods among multiple tasks into a task graph template.
  • each task graph template may include a task information table, a first synchronization information table, and a second synchronization information table.
  • the data structure of each task graph template in the embodiment of the present application can be described by using three tables: task information table, first synchronization information table and second synchronization information table. The three tables are described below.
  • the task information table includes a plurality of task identifiers, and a processing method corresponding to each task identifier.
  • the processing method corresponding to each task identifier may be a specific calculation method for each task.
  • the task information table may be as shown in Table 1.
  • the task scheduling device can obtain the processing mode of each task.
  • the task information table shown in Table 1 is used to indicate the static processing mode of the task graph template.
  • TaskType0 to TaskTypeN in Table 1 represent task identifiers, and each task identifier in the task information table corresponds to a function relationship function, and TaskInfo represents the pointer position of a specific variable in the function, and the task scheduling device is based on TaskInfo and function can get the specific calculation method corresponding to each task ID.
  • TaskInfo0 is used to indicate the pointer positions of variable a and variable b.
  • the task ID TaskType0 look up Table 1
  • the task corresponding to TaskType0 is a+b.
  • the processing method of each task stored in the above task information table is static data, therefore, multiple task maps can use the above task information table to obtain the task processing method, but the values of dynamic data are different for different task maps.
  • the task information table may also include information such as task priority, queue number, task calculation amount, affinity tag TAG (the same affinity tag TAG can be sent to the same computing unit for execution).
  • the load balancing module After the load balancing module receives the tasks scheduled by the task scheduling device, it can perform operations such as load balancing, priority scheduling, and affinity scheduling based on information such as task priority, queue number, task calculation amount, and affinity tag TAG.
  • the first synchronization information table includes multiple events and one or more barrier identifiers corresponding to each event, the multiple events correspond to multiple tasks one by one, and each event is used to indicate the execution completion of its corresponding task.
  • the barrier is used to coordinate multiple tasks to work in parallel, and the next task can only be executed when the barrier meets the trigger condition.
  • Each barrier can correspond to a counter, and the value of the barrier is the value of its corresponding counter.
  • each task may correspond to an Event, and each event is used to indicate the execution completion of its corresponding task, and the first synchronization information may be as shown in Table 2.
  • the task scheduling device can obtain the identifier of the barrier corresponding to each event based on the first synchronization information table, and modify the value of the counter corresponding to the barrier based on the identifier of the barrier corresponding to the event.
  • the task scheduling device modifies the value of the counter corresponding to the barrier, it may add one to the value of the counter corresponding to the barrier, or may decrement the value of the counter corresponding to the second barrier by one.
  • the task scheduling device modifies the value of the counter corresponding to the barrier, whether to add or subtract one to the value of the counter is related to the initial value of the counter.
  • the initial values of the barriers are all 0, and the task scheduling apparatus modifies the value of the counter corresponding to the barrier by adding one as an example for illustration.
  • an event can correspond to one or more barriers. As shown in Table 2, when an event corresponds to multiple barriers, after the task corresponding to the event is executed, the values of the multiple barriers corresponding to the event are modified.
  • multiple events can also correspond to the same barrier. That is, the value of the barrier must be modified after each of the multiple tasks corresponding to the multiple events is executed. For example, taking Event0, Event1, and Event2 respectively representing the execution completion of Task0, Task1, and Task2 as an example, as shown in Table 2, since the identifiers of the barriers corresponding to Event0, Event1, and Event2 all include 0x1, after Task0 is executed, the task scheduling device Add one to the value of barrier0x1, after Task1 is executed, the task scheduling device adds one to the value of barrier0x1, and after Task2 is executed, the task scheduling device adds one to the value of barrier0x1.
  • the second synchronization information table includes a plurality of barriers, a trigger condition corresponding to each barrier, and a task identifier to be executed when each barrier meets its corresponding trigger condition.
  • each barrier may correspond to one or more trigger conditions, and there may be one or more tasks to be executed when each barrier meets one of its corresponding trigger conditions.
  • the second synchronization information table may further include a valid bit of each barrier, and the valid bit of each barrier is used to indicate whether the barrier is valid.
  • the second synchronization information table may also include execution times corresponding to each to-be-executed task identifier.
  • the second synchronization information may be as shown in Table 3.
  • the task scheduling device queries the second synchronization information table based on the identifier of the barrier to determine whether the value of the barrier meets its corresponding trigger condition.
  • the task scheduling device determines the task identifier to be executed based on the second synchronization information table, and obtains the task content corresponding to the task identifier based on the task information table (Table 1), and then sends The computing unit sends the task.
  • a barrier can correspond to multiple trigger conditions, and when the value of the barrier satisfies different trigger conditions, the tasks to be executed are different.
  • the multiple trigger conditions in the second synchronization information table may be arranged in order of triggering. For example, the trigger sequence of trigger_condition0 in Table 3 is earlier than the trigger sequence of trigger_condition1.
  • multiple tasks can reuse the same barrier.
  • Multiple tasks reuse the same barrier means that the triggering of multiple tasks can depend on the same barrier. That is, in the second synchronization information table, when multiple tasks reuse the same barrier and the value of the barrier satisfies one or more trigger conditions, the corresponding tasks to be executed are the multiple tasks. Since the value of a barrier can be maintained by a counter, when multiple tasks reuse the same barrier, the number of counters can be reduced, thereby reducing the chip area.
  • the first task and the second task may reuse the same barrier.
  • the first task can be the task to be executed when barrier0 meets the first trigger condition
  • the second task can be the task to be executed when barrier0 meets the second trigger
  • the condition is the corresponding task to be executed, and the first trigger condition and the second trigger condition may be the same or different. That is to say, the trigger conditions of multiple tasks that reuse the same barrier can be the same or different.
  • the first trigger condition and the second trigger condition are the same, the first task and the second task are two tasks executed in parallel.
  • the first task and the second task when the first task and the second task meet at least one of the following four conditions, the first task and the second task can reuse the same barrier.
  • the first task and the second task can reuse the same barrier.
  • task T1, task T2 and task T3 are the parent nodes of task T4 as an example.
  • task T1, task T2 and task T3 are root nodes, that is, task T1, task T2 and task T3 have no parent nodes, so task T1, task T2 and task T3 can reuse the same a barrier.
  • task T4 and task T5 can be executed in parallel as an example.
  • the parent nodes of task T4 and task T5 are both tasks T1 to T3, the parent nodes of task T4 and task T5 are the same, so task T4 and task T5 can reuse the same barrier .
  • Case 3 The first task is the only parent node of the second task.
  • task T1, task T2 and task T3 are the parent nodes of task T4, and task T4 is the parent node of task T5 as an example.
  • task T4 since task T4 is the only parent node of task T5, task T4 and task T5 can reuse the same barrier.
  • task T1, task T2 and task T3 are the parent nodes of task T4, and task T4 is the parent node of task T5 as an example.
  • task T4 since the root nodes of task T4 and task T5 are task T1, task T2 and task T3, task T1, task T2 and task T3 can reuse the same barrier, and task T4 is a task The only parent node of T5, so task T4 and task T5 can reuse the same barrier.
  • the trigger conditions of task T4 and task T5 are different.
  • the multiple tasks in a task graph template meet any one or more of the above four conditions, the multiple tasks can reuse the same barrier.
  • the situation that multiple tasks reuse the same barrier is not limited to the above four situations. Specifically, it can be determined whether multiple tasks can reuse the same barrier according to the dependencies of multiple tasks in the task graph template.
  • the task map template including 8 tasks, namely task T1 to task T8 as an example.
  • task T1 in the task graph template shown in (a) in Figure 7 The trigger to task T8 depends on barrier1 to barrier7. That is, the triggering of tasks T1 to T8 depends on different barriers.
  • the second information table corresponding to the task graph template shown in (a) in FIG. 7 is shown in Table 4 below.
  • task T1 For another example, take the task map template including 8 tasks, namely task T1 to task T8 as an example.
  • task T1 is the only parent node of task T2
  • task T1 is the only parent node of task T3
  • task T1 is the only parent node of task T2
  • task T3 is task T1
  • task T2 is the only parent node of task T4
  • task T1 task T2
  • Both task T3 and task T4 can reuse the same barrier, which is recorded as b1 shown in (b) in FIG. 7 .
  • task T7 is the only parent node of task T8, task T7 and task T8 can reuse the same barrier, which is recorded as b4 shown in (b) in Figure 7.
  • the second information table corresponding to the task graph template shown in (b) in FIG. 7 is shown in Table 5 below.
  • the triggering of task T1 to task T8 depends on 7 barriers from barrier1 to barrier7, while multiplexing In the case of barriers (or counters), the triggering of tasks T1 to T8 depends on a total of 4 barriers from barrier1 to barrier4. Since the value of a barrier is maintained by a counter, for the same task graph template, multiplexing a barrier can greatly reduce the number of counters and reduce the chip area compared to not multiplexing a barrier.
  • the multiple tasks to be executed can be executed in parallel.
  • the multiple tasks to be executed can be executed in parallel. For example, as shown in Table 5, when the value of b1 is 1, the tasks to be executed are task T2 and task T3, and the computing unit can execute the task T2 and task T3 in parallel.
  • the task information table, the first synchronization information table and the second synchronization information table stored in the task scheduling device may be sent to the task scheduling device by the CPU (for example, Master core), or may be pre-configured in the task scheduling device , which is not limited in this embodiment of the present application.
  • the data structure of the task map template in the embodiment of the present application can be described by three tables: the task information table (Table 1), the first synchronization information table (Table 2) and the second synchronization information table (Table 3) .
  • the task scheduling device can schedule multiple tasks based on these three tables.
  • the task scheduling device can modify and delete multiple task map templates stored in it, and can also add new task map templates.
  • the task scheduling process of the task scheduling device includes: obtaining task information of the first task map; determining the first task map in one or more task map templates stored in the task scheduling device based on the task map template identification corresponding to the first task map
  • the corresponding task map template based on the input data of the first task map and the task map template corresponding to the first task map, the first task map is scheduled.
  • the task information of the first task graph includes the input data of the first task graph and the identifier of the task graph template corresponding to the first task graph.
  • the task scheduling device may receive input data of the first task graph from the CPU, Master core or accelerator and a task graph template identifier corresponding to the first task graph.
  • the data structure of each task map template can be passed through the above task information table (Table 1), the first synchronization information table (Table 2) and the second synchronization information Table (Table 3) to describe.
  • the task scheduling device can determine the task information table (Table 1) and the first synchronization information table (Table 2) corresponding to the first task graph in a plurality of task graph templates stored in it according to the tag of the task graph template corresponding to the first task graph and the second synchronization information table (Table 3).
  • the task scheduling apparatus may schedule the first task graph based on the task information table, the first synchronization information table, and the second synchronization information table corresponding to the first task graph.
  • the task scheduling process of the task scheduling device will be described in detail in combination with each circuit module.
  • the task scheduling apparatus may include a coupled first interface 801 , a task graph control circuit 802 , a task state machine 803 , and a second interface 804 .
  • the task map control circuit 802 is configured to acquire a task map template and task information of the first task map through the first interface 801 .
  • the first interface 801 is responsible for receiving and identifying commands from upstream modules, and routing different commands to different modules. For example, after receiving the task map template sent by the CPU, the first interface 801 routes the task map template to the task map control circuit 802 . For another example, the first interface 801 parses the event after receiving the event sent by the computing unit indicating that the execution of the task is completed, and routes the event to the event parsing circuit.
  • the task map control circuit 802 may receive a task map template from the CPU through the first interface 801 .
  • the task diagram template is only created once in the task scheduling device, and can be used for subsequent task diagrams to be executed multiple times. It can be understood that the dependencies and processing methods in the task map template in the embodiment of the present application are all static information, and only the dynamic data of the task map and the identification of the task map template to be used are obtained when the task map is executed for multiple times That's it.
  • the task scheduling device creates the task map template once, and when the first task map and the second task map are subsequently executed, there is no need to combine dependencies and processing
  • the way to load the task scheduling device again is to only load the dynamic data of the first task map and the second task map, and the identification of the task map template to be used, so the initialization time of the task map can be saved.
  • the task state machine 803 is configured to, based on the second synchronization information table, when determining that the value of the first barrier satisfies its corresponding first trigger condition, according to the first task identifier, the input data of the first task graph, and the first task graph
  • the corresponding task information table obtains the first task corresponding to the first task identifier from the task graph control circuit 802 , and sends the first task to the computing unit through the second interface 804 .
  • the first task identifier is an identifier of a task to be executed when the value of the first barrier satisfies the first trigger condition.
  • the second interface 804 is responsible for interacting with downstream modules, and is used to send ready tasks to a load balancing unit or a computing unit.
  • the first interface 801 and the second interface 804 may be two different physical interfaces, or may be the same physical interface.
  • the physical interface can receive commands or data, and can also send commands or data.
  • FIG. 8 is only illustrated by taking the first interface 801 and the second interface 804 as different physical interfaces as an example.
  • the timing for the task state machine 803 to determine whether the value of the barrier satisfies its corresponding trigger condition may include the following two situations.
  • the task graph control circuit 802 sends the first task trigger signal to the task state machine 803, and the first task trigger signal is used to instruct the task state machine 803 to query the second synchronization information table to determine whether the value of b1 satisfies the trigger condition corresponding to the first task.
  • the task state machine 803 determines that the value of b1 satisfies the trigger condition corresponding to the first task, the task state machine 803 controls from the task map according to the first task identifier, the input data of the first task map, and the task information table corresponding to the first task map.
  • the circuit 802 acquires the task content of the first task, and sends the task content of the first task to the computing unit through the second interface 804 .
  • the task state machine 803 can query the second synchronization information table when the value of the barrier is updated to determine whether the value of the barrier meets its corresponding trigger condition.
  • the task map control circuit 802 may also send the first task execution signal to the task state machine 803, and the task state machine 803 obtains the task content of the first task from the task map control circuit 802 according to the first task execution signal, and converts the task content of the first task The task content is sent to the computing unit through the second interface 804 .
  • the computing unit executes the multiple first tasks in parallel.
  • the task state machine 803 may send the multiple first tasks to multiple computing units, so that the multiple computing units execute the multiple first tasks in parallel.
  • the task scheduling device may further include an event analysis circuit 805 and a synchronous counting circuit 806.
  • the synchronous counting circuit 806 includes a plurality of counters, each barrier corresponds to a counter, and the value of each barrier is its corresponding counter. value.
  • the event parsing circuit 805 is configured to receive the first event through the first interface 801 when the execution of the first task is completed, determine the identifier of the second barrier corresponding to the first event based on the first synchronization information table, and notify the synchronization counting circuit Step 806 modifies the value of the counter corresponding to the second barrier.
  • the first event is used to indicate that the execution of the first task is completed.
  • the synchronous counting circuit 806 is configured to modify the value of the counter corresponding to the second barrier.
  • the above-mentioned second barrier and the first barrier may be the same barrier or different barriers.
  • the synchronous counting circuit 806 when the synchronous counting circuit 806 modifies the value of the counter corresponding to the second barrier, it may add one to the value of the counter corresponding to the second barrier, or may decrement the value of the counter corresponding to the second barrier by one, or Add or subtract other values to the value of the counter corresponding to the second barrier.
  • the synchronous counting circuit 806 modifies the value of the counter corresponding to the barrier, whether to increase (for example, add one) or decrease (for example, subtract one) the value of the counter is related to the initial value of the counter.
  • the synchronous counting circuit 806 may add one to the value of the counter when modifying the value of the counter corresponding to the second barrier.
  • the barrier satisfies its corresponding trigger condition.
  • the synchronous counting circuit 806 may decrement the value of the counter corresponding to the second barrier by one when modifying the value of the counter corresponding to the second barrier. In this implementation manner, when the value of the barrier decreases to 0, the barrier satisfies its corresponding trigger condition.
  • the embodiment of the present application does not limit the specific method for the synchronous counting circuit 806 to modify the value of the counter corresponding to the barrier.
  • the initial value of the barrier is 0, and the synchronous counting circuit 806 modifies the value of the counter corresponding to the barrier once. Add one as an example for illustration.
  • the computing unit may send a first event indicating the execution completion of the first task to the task scheduling device, the first interface 801 parses the first event, and sends the first event to the event parsing circuit 805 event.
  • the event parsing circuit 805 receives the first event, queries the first synchronization information table, determines the identifier of the second barrier corresponding to the first event, and notifies the synchronization counting circuit 806 to modify the value of the counter corresponding to the second barrier.
  • the synchronous counting circuit 806 modifies the value of the counter corresponding to the second barrier, it notifies the task state machine 803 of the identity of the second barrier.
  • the task state machine 803 judges whether the value of the second barrier satisfies its corresponding trigger condition based on the second synchronization information table, and obtains the following A task to execute and send the task to the compute unit. Until all the tasks in the first task map template are executed.
  • the task scheduling device provided by the embodiment of the present application supports built-in static task graph templates, so that when the task scheduling device executes multiple task graphs with the same processing method and dependency relationship, the CPU does not need to convert the task graph to The corresponding processing methods and dependencies are initialized into the task scheduling device, which reduces the initialization time of the task graph. That is to say, the task scheduling device provided by the embodiment of the present application can repeatedly execute the task graph with the same processing method and dependency relationship as the task graph template multiple times by creating a task graph template once, and execute the multiple task graphs subsequently.
  • multiple tasks in the task graph template provided by the embodiment of the present application can reuse barriers, which can reduce the number of counters in the synchronous counting circuit, thereby reducing the area of the task scheduling device and improving the scalability of the chip.
  • the first synchronization information table corresponding to the task map template shown in (a) in FIG. 7 is shown in Table 6 below.
  • the task map control circuit 802 receives the task map template from the CPU through the first interface 801, and stores the task map template. Table description.
  • the task map control circuit 802 receives the task information of the first task map from the CPU through the first interface 801 , and the task map template corresponding to the first task map is shown in (a) of FIG. 7 .
  • the computing unit After the computing unit executes the task T1, it sends an Event1 indicating that the execution of the task T1 is completed to the first interface 801 .
  • the first interface 801 parses the Event1, and sends the Event1 to the event parsing circuit 805, the event parsing circuit 805 receives the Event1, and looks up Table 6 to determine that the identifier of the barrier corresponding to Event1 is b2, and notifies the synchronous counting circuit 806 to modify b2 to correspond to The value of the counter.
  • the synchronous counting circuit 806 modifies the value of b2 to 1, and notifies the task state machine 803 of the identity of b2.
  • the task state machine 803 controls the circuit 802 from the task graph based on the identifiers of the tasks to be executed and Table 1. Obtain task T2 and task T3, and send task T2 and task T3 to computing unit 1 and computing unit 2.
  • the computing unit executes the task T2 and the task T3 in parallel, and after the computing unit executes the task T2 and the task T3, it sends Event2 and Event3 to the first interface 801 indicating that the execution of the task T2 and the task T3 is completed.
  • the first interface 801 parses the Event2 and Event3, and sends the Event2 and Event3 to the event parsing circuit 805, and the event parsing circuit 805 receives the Event2 and Event3, and looks up Table 6 to determine that the identifiers of the barriers corresponding to Event2 are b3 and b5, and Event3
  • the corresponding barriers are identified as b4 and b5, and the synchronous counting circuit 806 is notified to modify the values of the counters corresponding to b3, b4 and b5.
  • the synchronous counting circuit 806 modifies the value of b3 to 1, modifies the value of b4 to 1, modifies the value of b5 to 2, and notifies the task state machine 803 of the identifications of b3, b4 and b5.
  • the computing unit After the computing unit executes the task T4, it sends an Event4 indicating that the execution of the task T4 is completed to the first interface 801 .
  • the first interface 801 parses the Event4, and sends the Event4 to the event parsing circuit 805, the event parsing circuit 805 receives the Event4, and looks up Table 6 to determine that the identifier of the barrier corresponding to Event4 is b4, and notifies the synchronous counting circuit 806 to modify the b4 correspondence The value of the counter.
  • the synchronous counting circuit 806 modifies the value of b4 to 2, and notifies the task state machine 803 of the identity of b4.
  • the task state machine 803 obtains the task to be executed from the task diagram control circuit 802 based on the task to be executed and Table 1. Execute the task T5 and send the task T5 to the computing unit.
  • the computing unit After the computing unit executes the task T6, it sends an Event6 indicating that the execution of the task T6 is completed to the first interface 801 .
  • the first interface 801 parses the Event6, and sends the Event6 to the event parsing circuit 805, the event parsing circuit 805 receives the Event6, and looks up Table 6 to determine that the identifier of the barrier corresponding to Event6 is b6, and notifies the synchronous counting circuit 806 to modify b6 to correspond to The value of the counter.
  • the computing unit may execute task T4 and task T6 in parallel.
  • the computing unit After the computing unit executes the task T5, it sends an Event5 indicating that the execution of the task T5 is completed to the first interface 801 .
  • the first interface 801 parses the Event5, and sends the Event5 to the event parsing circuit 805, the event parsing circuit 805 receives the Event5, and looks up Table 6 to determine that the identifier of the barrier corresponding to Event5 is b6, and notifies the synchronous counting circuit 806 to modify the b6 correspondence The value of the counter.
  • the synchronous counting circuit 806 modifies the value of b6 to 2, and notifies the task state machine 803 of the identity of b6.
  • the task state machine 803 obtains the task to be executed from the task diagram control circuit 802 based on the task to be executed and Table 1. Execute task T7 and send task T7 to the computing unit.
  • the computing unit After the computing unit executes the task T7, it sends an Event7 indicating that the execution of the task T7 is completed to the first interface 801 .
  • the first interface 801 parses the Event7, and sends the Event7 to the event parsing circuit 805, the event parsing circuit 805 receives the Event7, and looks up Table 6 to determine that the identifier of the barrier corresponding to Event7 is b7, and notifies the synchronous counting circuit 806 to modify b7 to correspond to The value of the counter.
  • the synchronous counting circuit 806 modifies the value of b7 to 1, and notifies the task state machine 803 of the identity of b7.
  • the task state machine 803 obtains the task to be executed from the task map control circuit 802 based on the task to be executed and Table 1. Execute the task T8 and send the task T8 to the computing unit. After task T8 is executed, all tasks in the first task graph template are executed.
  • the first synchronization information table corresponding to the task graph template shown in (b) in FIG. 7 is shown in Table 7 below.
  • the task map control circuit 802 receives the task map template from the CPU through the first interface 801, and stores the task map template. Table description.
  • the task map control circuit 802 receives the task information of the first task map from the CPU through the first interface 801, and the task map template corresponding to the first task map is shown in (b) of FIG. 7 .
  • the computing unit After the computing unit executes the task T1, it sends an Event1 indicating that the execution of the task T1 is completed to the first interface 801 .
  • the first interface 801 parses the Event1, and sends the Event1 to the event parsing circuit 805, the event parsing circuit 805 receives the Event1, and looks up Table 7 to determine that the identifier of the barrier corresponding to Event1 is b1, and notifies the synchronous counting circuit 806 to modify the b1 corresponding The value of the counter.
  • the synchronous counting circuit 806 modifies the value of b1 to 1, and notifies the task state machine 803 of the identity of b1.
  • the task state machine 803 controls the circuit 802 from the task diagram based on the identifiers of the tasks to be executed and Table 1 Obtain task T2 and task T3, and send task T2 and task T3 to computing unit 1 and computing unit 2.
  • the computing unit executes the task T2 and the task T3 in parallel, and after the computing unit executes the task T2 and the task T3, it sends Event2 and Event3 to the first interface 801 indicating that the execution of the task T2 and the task T3 is completed.
  • the first interface 801 parses the Event2 and Event3, and sends the Event2 and Event3 to the event parsing circuit 805, and the event parsing circuit 805 receives the Event2 and Event3, and looks up Table 7 to determine that the identifiers of the barriers corresponding to Event2 are b1 and b2, and Event3
  • the corresponding barriers are identified as b2 and b3, and the synchronous counting circuit 806 is notified to modify the values of the counters corresponding to b1, b2 and b3.
  • the synchronous counting circuit 806 modifies the value of b1 to 2, modifies the value of b2 to 2, modifies the value of b3 to 1, and notifies the task state machine 803 of the identities of b1, b2 and b3.
  • the task state machine 803 is based on the Execute the task identification and Table 1, obtain the task T4 and task T6 to be executed from the task map control circuit 802, and send the task T4 and task T6 to the computing unit.
  • the computing unit After the computing unit executes the task T4, it sends an Event4 indicating that the execution of the task T4 is completed to the first interface 801 .
  • the first interface 801 parses the Event4, and sends the Event4 to the event parsing circuit 805, the event parsing circuit 805 receives the Event4, and looks up Table 7 to determine that the identifier of the barrier corresponding to Event4 is b3, and notifies the synchronous counting circuit 806 to modify the b3 correspondence The value of the counter.
  • the synchronous counting circuit 806 modifies the value of b3 to 2, and notifies the task state machine 803 of the identity of b3.
  • the computing unit After the computing unit executes the task T6, it sends an Event6 indicating that the execution of the task T6 is completed to the first interface 801 .
  • the first interface 801 parses the Event6, and sends the Event6 to the event parsing circuit 805, the event parsing circuit 805 receives the Event6, and looks up the table 7 to determine that the identifier of the barrier corresponding to Event6 is b4, and notifies the synchronous counting circuit 806 to modify the b4 correspondence The value of the counter.
  • the computing unit may execute task T4 and task T6 in parallel.
  • the computing unit After the computing unit executes the task T5, it sends an Event5 indicating that the execution of the task T5 is completed to the first interface 801 .
  • the first interface 801 parses the Event5, and sends the Event5 to the event parsing circuit 805, the event parsing circuit 805 receives the Event5, and looks up Table 7 to determine that the identifier of the barrier corresponding to Event5 is b4, and notifies the synchronous counting circuit 806 to modify b4 corresponding to The value of the counter.
  • the synchronous counting circuit 806 modifies the value of b4 to 2, and notifies the task state machine 803 of the identity of b4.
  • the task state machine 803 obtains the task to be executed from the task diagram control circuit 802 based on the task to be executed and Table 1. Execute task T7 and send task T7 to the computing unit.
  • the computing unit After the computing unit executes the task T7, it sends an Event7 indicating that the execution of the task T7 is completed to the first interface 801 .
  • the first interface 801 parses the Event7, and sends the Event7 to the event parsing circuit 805, the event parsing circuit 805 receives the Event7, and looks up the table 7, determines that the identifier of the barrier corresponding to the Event7 is b4, and notifies the synchronous counting circuit 806 to modify the b4 correspondence The value of the counter.
  • the synchronous counting circuit 806 modifies the value of b4 to 3, and notifies the task state machine 803 of the identity of b4.
  • the task state machine 803 obtains the task to be executed from the task map control circuit 802 based on the task to be executed and Table 1. Execute the task T8 and send the task T8 to the computing unit. After task T8 is executed, all tasks in the first task graph template are executed.
  • the task scheduling device provided by the embodiment of the present application supports built-in static task graph templates, so that when the task scheduling device executes multiple task graphs with the same processing method and dependency relationship, the CPU does not need to convert the task graph to The corresponding processing methods and dependencies are initialized into the task scheduling device, which reduces the initialization time of the task graph. That is to say, the task scheduling device provided by the embodiment of the present application can repeatedly execute the task graph with the same processing method and dependency relationship as the task graph template multiple times by creating a task graph template once, and execute the multiple task graphs subsequently.
  • multiple tasks in the task graph template provided by the embodiment of the present application can reuse barriers, which can reduce the number of counters in the synchronous counting circuit, thereby reducing the area of the task scheduling device and improving the scalability of the chip.
  • the second synchronization information table can be divided into a first sub-information table and a second sub-information table, and the first sub-information table is stored in the task scheduling device, and the second sub-information table is stored in the task scheduling device. Tables are stored in memory.
  • the first sub-information table includes multiple barriers, a first trigger condition corresponding to each barrier, and an identifier of a task to be executed when each barrier satisfies its corresponding first trigger condition.
  • the second sub-information table includes a plurality of barriers, other trigger conditions corresponding to each barrier, and identifiers of tasks to be executed when each barrier satisfies its corresponding other trigger conditions. For the same barrier, the trigger sequence of its corresponding first trigger condition is earlier than the trigger sequence of its corresponding other trigger conditions.
  • the first sub-information table and the second sub-information table are shown in Table 8 and Table 9 respectively .
  • the first sub-information table may be stored in the cache of the task scheduling device, and the second sub-information table may be stored in a memory (for example, double data rate (double data rate, DDR) synchronous dynamic random access memory), the The memory is not in the task scheduling device, but is a memory other than the task scheduling device. It can be understood that in the embodiment of the present application, by storing part of the trigger conditions in the DDR, the chip area of the task scheduling device can be reduced.
  • a memory for example, double data rate (double data rate, DDR) synchronous dynamic random access memory
  • the value when the barrier meets the first trigger condition can be smaller than the value when the barrier meets other trigger conditions, so that the trigger order of the first trigger condition is earlier than The trigger sequence for other trigger conditions.
  • the value when the barrier meets the first trigger condition can be greater than the value when the barrier meets other trigger conditions, so that the first The firing sequence of one trigger condition is earlier than the firing sequence of other trigger conditions.
  • the multiple trigger conditions in the above-mentioned second sub-information table may be arranged in sequence according to the trigger sequence.
  • the task graph control circuit 802 is also used to read the next trigger condition from the memory according to the trigger sequence of multiple other trigger conditions corresponding to the barrier in the second sub-information table when the value of the barrier satisfies the corresponding first trigger condition. Other trigger conditions, and replace the first trigger condition corresponding to the barrier with the other trigger conditions.
  • the trigger sequence of the next other trigger condition is immediately after the first trigger condition. That is, the next other trigger condition is the next trigger condition that will be triggered by the first barrier after the value of the first barrier satisfies the first trigger condition. For example, the second trigger condition.
  • the task graph control circuit 802 is also used to read the value from the memory according to the trigger sequence of multiple other trigger conditions corresponding to the barrier in the second sub-information table when the value of the barrier satisfies the second trigger condition in the cache. Take the third trigger condition, and replace the second trigger condition in the cache with the third trigger condition. By analogy, until the multiple trigger conditions corresponding to the same barrier are all traversed.
  • the three trigger conditions are the first trigger condition, the second trigger condition and the third trigger condition according to the trigger sequence (the second trigger condition and the third trigger condition
  • the cache of the task scheduling device stores the first trigger condition
  • the DDR stores the second trigger condition and the third trigger condition.
  • the task map control circuit 802 reads the second trigger condition from the DDR, and replaces the first trigger condition in the cache with the second trigger condition.
  • the task graph control circuit 802 reads the next other trigger condition (i.e. the third trigger condition) from the DDR, and replaces the second trigger condition in the cache with the third a trigger condition.
  • this solution can reduce the chip area of the task scheduling device and improve the scalability of the chip.
  • the embodiment of the present application also provides a computing device.
  • the computing device includes a central processing unit CPU and a task scheduling device shown in FIG. 8 , and the CPU is configured to send a task graph template to the task scheduling device.
  • the computing device may also include an enhanced short message service (enhanced message severice, EMS) and a computing unit
  • the EMS is used to receive tasks to be executed from the task scheduling device, and distribute the tasks to be executed to the computing unit .
  • the computing unit is used to execute the task to be executed.
  • the computing unit can be an accelerator or a processor.
  • the EMS is a hardware queue management and load balancing module, which is used to evenly distribute tasks to be executed to computing units.
  • an embodiment of the present application further provides a task scheduling method, as shown in FIG. 10 , the task scheduling method is applied to the task scheduling device shown in FIG. 8 , and the task scheduling method includes the following steps:
  • the task scheduling device acquires task information of a first task graph.
  • the task information of the first task map includes input data of the first task map and a task map template identifier corresponding to the first task map.
  • the task scheduling device includes one or more task graph templates.
  • the task graph templates in the task scheduling device may be task graph templates received from the CPU, or task graph templates preset in the task scheduling device.
  • the data structure of the task graph template may be described by using the above three tables, the task information table, the first synchronization information table, and the second synchronization information table.
  • step S1001 may be executed by the task map control circuit 802 in the task scheduling device shown in FIG. information.
  • step S1002 may be executed by the task map control circuit 802 in the task scheduling device shown in FIG. 8 .
  • the task map control circuit 802 may store multiple Determine the task map template corresponding to the first task map from the task map templates.
  • step S1003 may include the following steps:
  • the task state machine determines that the value of the first barrier satisfies its corresponding first trigger condition, according to the first task identifier, the input data of the first task graph, and the corresponding
  • the task information table obtains the first task corresponding to the first task identifier from the task map control circuit, and sends the first task to the computing unit through the second interface.
  • the first task identifier is an identifier of a task to be executed when the value of the first barrier satisfies the first trigger condition.
  • the task diagram control circuit 802 may send the first task trigger signal to the task state machine 803, the first task trigger signal is used to instruct the task state machine 803 Query the second synchronization information table to determine whether the initial value of the barrier (the to-be-executed task corresponding to the value of the barrier in the second synchronization information table meeting the trigger condition is the first task) satisfies the trigger condition corresponding to the first task.
  • the task state machine 803 determines that the initial value of the barrier satisfies the trigger condition corresponding to the first task, the task state machine 803 according to the first task identifier, the input data of the first task map, and the task information table corresponding to the first task map, from the task
  • the map control circuit 802 acquires the task content of the first task, and sends the task content of the first task to the computing unit through the second interface 804 .
  • the task state machine 803 may query the second synchronization information table when the value of the barrier is updated to determine whether the value of the barrier meets its corresponding trigger condition.
  • the computing unit executes the multiple first tasks in parallel.
  • multiple tasks in the task map template can reuse the same barrier.
  • Multiple tasks reuse the same barrier means that the triggering of multiple tasks can depend on the same barrier. That is, in the second synchronization information table, if multiple tasks reuse the same barrier, when the value of the barrier satisfies one or more trigger conditions, the corresponding tasks to be executed are the multiple tasks.
  • the relevant description about multiple tasks multiplexing the same barrier reference may be made to the foregoing embodiments, and details are not repeated here.
  • the event parsing circuit receives the first event through the first interface, determines the identifier of the second barrier corresponding to the first event based on the first synchronization information table, and notifies the synchronization counting circuit to modify the second barrier The value of the counter corresponding to the barrier.
  • the first event is used to indicate that the execution of the first task is completed.
  • the second barrier corresponding to the first event may be one or multiple.
  • the second barrier may be the same as or different from the first barrier.
  • the computing unit may send a first event indicating the completion of the first task to the first interface, and the first interface parses the first event and routes the first event to the event parsing circuit 805.
  • the event parsing circuit 805 queries the first synchronization information table based on the first event identifier, determines the identifiers of one or more second barriers corresponding to the first event, and notifies the synchronization counting circuit 806 to modify the value of the counter corresponding to the second barrier .
  • the synchronous counting circuit modifies the value of the counter corresponding to the second barrier.
  • the synchronous counting circuit modifies the value of the counter corresponding to the second barrier
  • the value of the second barrier is updated.
  • the synchronous counting circuit 806 modifies the value of the counter corresponding to the second barrier, it may notify the task state machine 803 of the identity of the second barrier.
  • the task state machine 803 judges whether the value of the second barrier meets its corresponding trigger condition based on the second synchronization information table, and continues to execute the above steps S10031-S10033 when the value of the second barrier meets its corresponding trigger condition , until all tasks in the first task graph are executed.
  • step S1003 Can also include:
  • the task map control circuit reads the next other trigger from the memory according to the trigger sequence of other trigger conditions corresponding to the first barrier in the second sub-information table condition, and replace the trigger condition corresponding to the first barrier in the task scheduling device with the next other trigger condition.
  • the trigger sequence of the next other trigger condition is immediately after the first trigger condition. That is, the next other trigger condition is the next trigger condition that will be triggered by the first barrier after the value of the first barrier satisfies the first trigger condition.
  • the next other trigger condition is the second trigger condition corresponding to the first barrier.
  • the next other trigger condition is the third trigger condition corresponding to the first barrier.
  • the above step S10034 may include: when the value of the first barrier satisfies the first trigger condition corresponding to the first barrier in the cache, the task map control circuit according to the second sub-information table Read the next other trigger condition from the memory, and replace the first trigger condition corresponding to the first barrier in the cache with the next other trigger condition.
  • the above step S10034 may include: when the value of the first barrier satisfies the first trigger condition corresponding to the first barrier in the cache, the task map control circuit according to the first trigger condition The trigger sequence of other trigger conditions corresponding to the first barrier in the second sub-information table, read the next other trigger condition from the memory, and replace the first trigger condition corresponding to the first barrier in the cache with the next other trigger condition.
  • the three trigger conditions are the first trigger condition, the second trigger condition and the third trigger condition (the second trigger condition and the third trigger condition are the above other trigger conditions)
  • the first trigger condition is stored in the cache of the task scheduling device
  • the second trigger condition and the third trigger condition are stored in the DDR.
  • the task map control circuit 802 reads the second trigger condition corresponding to the first barrier from the DDR, and stores the first barrier corresponding to the first barrier in the cache. The trigger condition is replaced with this second trigger condition.
  • the task map control circuit 802 reads the third trigger condition corresponding to the first barrier from the DDR, and stores the first barrier corresponding to the first barrier in the cache.
  • the second trigger condition is replaced by this third trigger condition.
  • this solution can reduce the chip area of the task scheduling device and improve the scalability of the chip.
  • step S10034 may be performed after step S10031, or may be performed simultaneously with step S10031, which is not limited in this embodiment of the present application.
  • the embodiment of the present application can repeatedly execute the task related to the task graph template once by creating a task graph template once.
  • multiple task graphs with the same processing method and dependency relationship of the graph template and when the multiple task graphs are subsequently executed, there is no need to load the static processing method and dependency relationship into the task scheduling device again, which can save the static processing method and dependency loading time into the task scheduling device, improving computing efficiency.
  • multiple tasks in the task graph template provided by the embodiment of the present application can reuse barriers, which can reduce the number of counters in the synchronous counting circuit, thereby reducing the area of the task scheduling device and improving the scalability of the chip.
  • the embodiment of the present application further stores the first trigger condition in the cache, stores other trigger conditions in the DDR, and dynamically replaces the trigger conditions in the cache, so that the trigger conditions can be sequentially loaded into the cache. This solution can reduce the The chip area of the task scheduling device improves the scalability of the chip.
  • the steps of the methods or algorithms described in conjunction with the disclosure of this application can be implemented in the form of hardware, or can be implemented in the form of a processor executing software instructions.
  • the software instructions can be composed of corresponding software modules, and the software modules can be stored in random access memory (random access memory, RAM), flash memory, erasable programmable read-only memory (erasable programmable ROM, EPROM), electrically erasable Programmable read-only memory (electrically EPROM, EEPROM), registers, hard disk, removable hard disk, CD-ROM, or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
  • the storage medium may also be a component of the processor.
  • the processor and storage medium can be located in the ASIC.
  • the ASIC may be located in the core network interface device.
  • the processor and the storage medium may also exist in the core network interface device as discrete components.
  • the functions described in the present invention may be implemented by hardware, software, firmware or any combination thereof.
  • the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
  • Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种任务调度方法和装置,涉及计算机技术领域,缓解了调度装置每次执行任务图时,CPU都要重新将该任务图对应的依赖关系初始化到调度装置,导致初始化时间较长,影响计算效率的问题。该方法为:任务调度装置包括一个或多个任务图模板,每个任务图模板用于指示该任务图模板包括的多个任务之间的依赖关系,以及每个任务的处理方式;任务调度装置,用于:获取第一任务图的输入数据和第一任务图对应的任务图模板标识;基于第一任务图对应的任务图模板标识,在一个或多个任务图模板中确定第一任务图对应的任务图模板;基于第一任务图的输入数据和第一任务图对应的任务图模板,对第一任务图进行调度。

Description

一种任务调度方法和装置 技术领域
本申请实施例涉及计算机技术领域,尤其涉及一种任务调度方法和装置。
背景技术
目前,为了提高计算设备的计算能力,通常计算设备可以采用多核异构的系统架构,该架构包括多核中央处理器(central processing unit,CPU)和加速器,多核CPU用于执行通用的计算任务,加速器用于执行专用的计算任务。
在多核异构系统中,对于较为复杂的任务图,多个任务之间可能存在相互依赖关系(比如,一个任务的执行依赖于另一个任务的计算结果)。为了减轻并行编程的难度,调度软件可以根据输入和输出确定多个任务之间的依赖关系,并将准备就绪的任务调度到可执行任务的CPU核或者加速器上执行。但是,在多核异构系统中,对于依赖关系相同的多个任务图,每次执行任务图时,CPU都要重新将该任务图对应的依赖关系初始化到调度装置中,再由调度装置维护该任务图中的多个任务之间的依赖关系,以确保计算正常进行。这将造成调度装置调度任务图、解析任务图的依赖关系时,初始化的时间较长,CPU和调度装置之间交互冗余,影响计算效率的问题。
发明内容
本申请实施例提供一种任务调度方法和装置,能够节省依赖关系载入到任务调度装置的时间,提升计算效率。
为达到上述目的,本申请实施例采用如下技术方案:
本申请实施例的第一方面,提供一种任务调度装置,该任务调度装置包括一个或多个任务图模板,每个任务图模板用于指示该任务图模板包括的多个任务之间的依赖关系,以及每个任务的处理方式;任务调度装置,用于:获取第一任务图的任务信息;该第一任务图的任务信息包括第一任务图的输入数据和第一任务图对应的任务图模板标识;基于第一任务图对应的任务图模板标识,在任务调度装置包括的一个或多个任务图模板中确定第一任务图对应的任务图模板;基于第一任务图的输入数据和第一任务图对应的任务图模板,对第一任务图进行调度。
可选的,处理方式和依赖关系均相同的多个任务图对应的任务图模板相同。例如,任务图1用于计算(1+2)*(4-3),任务图2用于计算(5+6)*(8-7),任务图1中多个任务之间的依赖关系与任务图2中多个任务之间的依赖关系相同,而且任务图1中多个任务的计算方式与任务图2中多个任务的计算方式也相同。因此,任务图1和任务图2的任务图模板相同,均为(a+b)*(c-d)。任务图1和任务图2的区别在于两个任务图的输入数据不同。
可选的,任务图模板包括的多个任务可以串行执行,也可以并行执行。例如,任务图模板包括的多个任务中,一部分任务在执行时可以串行执行,一部分任务在执行时可以并行执行。
基于本方案,由于任务调度装置支持将静态的任务图模板内置在该任务调度装置中,从而在后续执行与该任务图模板的依赖关系和处理方式均相同的多个任务图时,无需再次将依赖关系和处理方式初始化到任务调度装置中,因此后续执行该多个任务图时只获取动态的输入数据以及要用的任务图模板的标识即可,能够节省依赖关系载入到任务调度装置的时间。也就是说,本申请实施例提供的任务调度装置,通过创建一次任务图模板,就可以重复多次执行与该任务图模板的处理方式和依赖关系均相同的任务图,而且在后续执行该多个任务图时,无需再次将静态的处理方式和依赖关系载入任务调度装置,能够节省将静态的处理方式和依赖关系载入任务调度装置的时间,提升计算效率。
在一种可能的实现方式中,上述任务调度装置,还用于获取一个或多个任务图模板;每个任务图模板包括任务信息表、第一同步信息表和第二同步信息表;其中,任务信息表包括多个任务标识,以及每个任务标识对应的处理方式;第一同步信息表包括多个事件,以及每个事件对应的一个或多个屏障barrier的标识,多个事件与多个任务一一对应,每个事件用于指示其对应的任务执行完成;第二同步信息表包括多个barrier、每个barrier对应的一个或多个触发条件、以及每个barrier满足其对应的触发条件时的待执行任务标识。
可选的,任务调度装置获取的任务图模板可以是CPU发送给任务调度装置的,也可以是预置在任务调度装置中的,本申请实施例对此并不限定。
基于本方案,每个任务图模板的数据结构可以采用任务信息表、第一同步信息表和第二同步信息表这三张表来描述,任务调度装置获取任务图模板即为获取该任务图模板对应的任务信息表、第一同步信息表和第二同步信息表,从而使得任务调度装置后续调度任务图时可以基于这三张表进行调度,以确保任务图模板中多个任务的计算能够正常进行。而且任务调度装置在后续调度处理方式和依赖关系均与任务图模板相同的多个任务图时,可以直接基于该任务图模板调度任务,无需再次将处理方式和依赖关系载入任务调度装置,能够节省将静态的处理方式和依赖关系载入任务调度装置的时间,提升计算效率。
在另一种可能的实现方式中,上述任务调度装置包括耦合连接的第一接口、任务图控制电路、任务状态机,以及第二接口;其中,任务图控制电路,用于通过第一接口获取任务图模板,以及第一任务图的任务信息;任务状态机,用于基于第二同步信息表,在确定第一barrier的值满足其对应的第一触发条件时,根据第一任务标识、第一任务图的输入数据,以及第一任务图对应的任务信息表,从任务图控制电路获取第一任务标识对应的第一任务,并通过第二接口向计算单元发送第一任务;该第一任务标识为第一barrier的值满足第一触发条件时待执行的任务的标识。。
可选的,在第二同步信息表包括首任务的触发条件(比如,首个任务T1对应的触发条件b1=0)时,对于第一任务图模板中的首个任务,可以由任务图控制电路向任务状态机发送首任务触发信号,该首任务触发信号用于指示任务状态机查询第二同步信息表,确定b1的值是否满足首任务对应的触发条件。在任务状态机确定b1的值满足首任务对应的触发条件时,任务状态机根据首任务标识,第一任务图的输入数据,以及第一任务图对应的任务信息表,从任务图控制电路获取首任务的任务内容,并将该 首任务的任务内容通过第二接口发送给计算单元。
可选的,在第二同步信息表不包括首任务的触发条件时,对于第一任务图模板中的首个任务,也可以由任务图控制电路向任务状态机发送首任务执行信号,任务状态机根据该首任务执行信号,从任务图控制电路获取首任务的任务内容,并将该首任务的任务内容通过第二接口发送给计算单元。
可选的,对于首任务之后的其他任务,任务状态机可以在barrier的值更新时,查询第二同步信息表,确定该barrier的值是否满足其对应的触发条件。
基于本方案,通过任务状态机基于第二同步信息表,确定每个barrier的值是否满足其对应的触发条件,并在barrier的值满足触发条件时,确定待执行任务标识,并从任务图控制电路获取该待执行任务的任务内容,并将该任务发给计算单元。从而能够确保任务调度装置调度任务图中的每个任务时都能按照其对应的任务图模板进行调度,而且任务调度装置在调度任务时,可以依据任务调度装置中的任务图模板获取处理方式,不需要将任务的处理方式重新载入任务调度装置,因此能够节省将静态的处理方式载入任务调度装置的时间,提升计算效率。
在又一种可能的实现方式中,上述第一任务为多个时,计算单元并行执行多个第一任务。
基于本方案,通过将触发条件相同的多个任务并行执行,能够维持好多个任务之间的依赖关系,提升计算效率。
在又一种可能的实现方式中,上述任务调度装置还包括耦合连接的事件解析电路和同步计数电路,该同步计数电路包括多个计数器,每个计数器对应一个barrier;事件解析电路,用于在第一任务执行完成的情况下,通过第一接口接收第一事件,并基于第一同步信息表确定第一事件对应的第二barrier的标识,通知同步计数电路修改第二barrier对应的计数器的值;其中,该第一事件用于指示第一任务执行完成;同步计数电路,用于修改第二barrier对应的计数器的值。
可选的,同步计数电路修改第二barrier对应的计数器的值时,可以将该第二barrier对应的计数器的值加一,也可以将该第二barrier对应的计数器的值减一,还可以将该第二barrier对应的计数器的值加上或减去其他数值。实际应用中,同步计数电路修改barrier对应的计数器的值时,是将计数器的值增大(例如,加一)还是减小(例如,减一),与该计数器的初始值有关。
示例性的,在第一任务执行完成后,通过向任务调度装置发送指示第一任务执行完成的第一事件,第一接口解析该第一事件,并向事件解析电路发送该第一事件。事件解析电路接收该第一事件,并查询第一同步信息表,确定第一事件对应的第二barrier的标识并通知同步计数电路修改该第二barrier对应的计数器的数值。同步计数电路修改第二barrier对应的计数器的数值后,向任务状态机通知该第二barrier的标识。任务状态机基于第二同步信息表判断该第二barrier的值是否满足其对应的触发条件,在该第二barrier的值满足其对应的触发条件的情况下,从任务图控制电路获取下一个待执行的任务,并向计算单元发送该任务。直至第一任务图中的所有任务执行完毕。
基于本方案,通过第一同步信息表和第二同步信息表,能够正确的维护好任务图中多个任务的依赖关系,以确保各个任务正常执行。该方案在维护任务图中多个任务 的依赖关系时,无需再次将依赖关系载入任务调度装置中,直接依据任务调度装置中的第一同步信息表和第二同步信息表即可,因此能够节省将静态的依赖关系载入任务调度装置的时间,提升计算效率。
在又一种可能的实现方式中,上述任务图控制电路,还用于修改或删除任务图模板。
基于本方案,任务调度装置可以对其存储的多个任务图模板进行修改、删除和新增,从而使得任务调度装置中的任务图模板更灵活,能够适用于更多场景。
在又一种可能的实现方式中,上述任务图模板包括第一任务和第二任务,该第一任务和第二任务复用同一个barrier。
可选的,多个任务复用同一个barrrier是指该多个任务的触发可以依赖于同一个barrier。即在第二同步信息表中,当多个任务复用同一个barrrier时,该barrier的值满足一个或多个触发条件时,对应的待执行任务为该多个任务。
基于本方案,任务图模板中的多个任务可以复用同一个barrier。由于一个barrier的值可以通过一个计数器维护,当多个任务复用同一个barrier时,可以减少计数器的数量,从而减小芯片面积。
在又一种可能的实现方式中,第一任务和第二任务满足以下情况中的至少一种:第一任务和第二任务均没有父节点;或者,第一任务和第二任务具有相同的父节点;或者,第一任务为第二任务唯一的父节点;或者,第一任务和第二任务的根节点复用同一个barrier,且第一任务为第二任务唯一的父节点。
需要说明的是,一个任务图模板中的多个任务如果满足以上四种情况中的任一种或多种,该多个任务可以复用同一个barrier。实际应用中,多个任务复用同一个barrier的情况不限于上述四种情况,具体可以根据任务图模板中多个任务的依赖关系,确定多个任务是否能复用同一个barrier。
基于本方案,通过多个任务复用同一个barrier时,可以减少计数器的数量,从而减小芯片面积。
在又一种可能的实现方式中,一个barrier对应多个触发条件,该多个触发条件包括首个触发条件和其他触发条件,该首个触发条件的触发顺序早于其他触发条件的触发顺序;第二同步信息表包括第一子信息表和第二子信息表,第一子信息表包括多个barrier,每个barrier对应的首个触发条件,以及每个barrier满足其对应的首个触发条件时的待执行的任务的标识;第二子信息表包括多个barrier,每个barrier对应的其他触发条件,以及每个barrier满足其对应的其他触发条件时的待执行的任务的标识。
可选的,上述第一子信息表存储在任务调度装置的缓存cache中,第二子信息表存储在内存中。
基于本方案,通过将每个barrier对应的首触发条件存储在DDR中,将每个barrier对应的其他触发条件存储在DDR中,能够减小任务调度装置的芯片面积。
在又一种可能的实现方式中,在barrier对应的其他触发条件为多个的情况下,第二子信息表中该barrier对应的多个其他触发条件按触发顺序先后依次排列;任务图控制电路,还用于在barrier的值满足其对应的首个触发条件时,按照第二子信息表中该barrier对应的多个其他触发条件的触发顺序,从内存中读取下一个其他触发条件,并 将该barrier对应的首个触发条件替换为该其他触发条件。
可选的,barrier对应的多个触发条件中,该下一个其他触发条件的触发顺序紧接着第一触发条件。即该下一个其他触发条件为第一barrier的值满足第一触发条件之后,下一个会被第一barrier触发的触发条件。比如第二个触发条件。
可选的,由于任务控制电路将cache中存储的barrier对应的首个触发条件替换为该barrier对应的第二个触发条件。因此,任务图控制电路,还用于在barrier的值满足cache中的第二个触发条件时,按照第二子信息表中该barrier对应的多个其他触发条件的触发顺序,从内存中读取第三个触发条件,并将cache中的第二个触发条件替换为该第三个触发条件。以此类推,直至同一个barrier对应的多个触发条件全部遍历完。
基于本方案,通过将首个触发条件存储在任务调度装置的cache中,将其他触发条件存储在DDR中,并通过动态的替换cache中的触发条件,可以依次将触发条件载入cache,该方案能够减小任务调度装置的芯片面积,提高芯片的可扩展性。
本申请实施例的第二方面,提供一种任务调度方法,应用于任务调度装置,该任务调度装置包括一个或多个任务图模板,每个任务图模板用于指示该任务图模板包括的多个任务之间的依赖关系,以及每个任务的处理方式;该方法包括:任务调度装置获取第一任务图的任务信息;第一任务图的任务信息包括第一任务图的输入数据和第一任务图对应的任务图模板标识。任务调度装置基于第一任务图对应的任务图模板标识,在任务调度装置包括的一个或多个任务图模板中确定第一任务图对应的任务图模板。任务调度装置基于第一任务图的输入数据和第一任务图对应的任务图模板,对第一任务图进行调度。
在一种可能的实现方式中,上述方法还包括:任务调度装置获取一个或多个任务图模板;每个任务图模板包括任务信息表、第一同步信息表和第二同步信息表;其中,任务信息表包括多个任务标识,以及每个任务标识对应的处理方式;第一同步信息表包括多个事件,以及每个事件对应的一个或多个屏障barrier的标识,多个事件与多个任务一一对应,每个事件用于指示其对应的任务执行完成;第二同步信息表包括多个barrier、每个barrier对应的一个或多个触发条件、以及每个barrier满足其对应的触发条件时的待执行任务标识。
在另一种可能的实现方式中,上述任务调度装置包括耦合连接的第一接口、任务图控制电路、任务状态机,以及第二接口;任务调度装置获取任务图和第一任务图的任务信息,包括:任务图控制电路通过第一接口获取任务图,以及第一任务图的任务信息;任务调度装置基于第一任务图的输入数据和第一任务图对应的任务图模板,对第一任务图进行调度,包括:任务状态机基于第二同步信息表,在确定第一barrier的值满足其对应的第一触发条件时,根据第一任务标识、第一任务图的输入数据,以及第一任务图对应的任务信息表,从任务图控制电路获取第一任务标识对应的第一任务,并通过第二接口向计算单元发送第一任务;该第一任务标识为第一barrier的值满足第一触发条件时待执行的任务的标识。
在又一种可能的实现方式中,上述第一任务为多个时,计算单元并行执行多个第一任务。
在又一种可能的实现方式中,任务调度装置还包括耦合连接的事件解析电路和同 步计数电路,该同步计数电路包括多个计数器,每个计数器对应一个barrier;任务调度装置基于第一任务图的输入数据和第一任务图对应的任务图模板,对第一任务图进行调度,还包括:事件解析电路在第一任务执行完成的情况下,通过第一接口接收第一事件,并基于第一同步信息表确定第一事件对应的第二barrier的标识,通知同步计数电路修改第二barrier对应的计数器的值;其中,该第一事件用于指示第一任务执行完成;同步计数电路修改第二barrier对应的计数器的值。
在又一种可能的实现方式中,上述方法还包括:任务图控制电路修改或删除任务图模板。
在又一种可能的实现方式中,上述任务图模板包括第一任务和第二任务,第一任务和第二任务复用同一个barrier。
在又一种可能的实现方式中,第一任务和第二任务满足以下情况中的至少一种:第一任务和第二任务均没有父节点;或者,第一任务和第二任务具有相同的父节点;或者,第一任务为第二任务唯一的父节点;或者,第一任务和第二任务的根节点复用同一个barrier,且第一任务为第二任务唯一的父节点。
在又一种可能的实现方式中,一个barrier对应多个触发条件,多个触发条件包括首个触发条件和其他触发条件,首个触发条件的触发顺序早于其他触发条件的触发顺序;第二同步信息表包括第一子信息表和第二子信息表,第一子信息表包括多个barrier,每个barrier对应的首个触发条件,以及每个barrier满足其对应的首个触发条件时的待执行的任务的标识;第二子信息表包括多个barrier,每个barrier对应的其他触发条件,以及每个barrier满足其对应的其他触发条件时的待执行的任务的标识。
在又一种可能的实现方式中,上述第一子信息表存储在任务调度装置的缓存cache中,上述第二子信息表存储在内存中。
在又一种可能的实现方式中,在barrier对应的其他触发条件为多个的情况下,第二子信息表中该barrier对应的多个其他触发条件按触发顺序先后依次排列;上述方法还包括:任务图控制电路在barrier的值满足其对应的首个触发条件时,按照第二子信息表中该barrier对应的多个其他触发条件的触发顺序,从内存中读取下一个其他触发条件,并将该barrier对应的首个触发条件替换为该其他触发条件。
上述第二方面以及第二方面的各种实现方式的效果描述可以参考第一方面相应效果的描述,在此不再赘述。
本申请实施例的第三方面,提供一种计算设备,所述计算设备包括中央处理器CPU,以及如上述第一方面所述的任务调度装置,所述CPU用于向所述任务调度装置发送所述任务图模板。
在一种可能的实现方式中,所述计算设备还包括增强型短消息服务EMS和计算单元,所述EMS用于接收来自所述任务调度装置的待执行任务,并将所述待执行任务分配给所述计算单元,所述计算单元用于执行所述待执行任务。
附图说明
图1为本申请实施例提供的一种调度装置的结构示意图;
图2为本申请实施例提供的另一种调度装置的的结构示意图;
图3为本申请实施例提供的一种任务图模板的结构示意图;
图4为本申请实施例提供的另一种任务图模板的结构示意图;
图5为本申请实施例提供的又一种任务图模板的结构示意图;
图6为本申请实施例提供的一种多任务复用barrier时的任务图模板的结构示意图;
图7为本申请实施例提供的一种任务图模板中不复用barrier与复用barrier的结构示意图;
图8为申请实施例提供的一种任务调度装置的结构示意图;
图9为申请实施例提供的一种计算设备的结构示意图;
图10为申请实施例提供的一种任务调度方法的流程示意图;
图11为申请实施例提供的另一种任务调度方法的流程示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。在本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,a和b,a和c,b和c,或,a和b和c,其中a、b和c可以是单个,也可以是多个。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分,本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定。比如,本申请实施例中的第一子信息表中的“第一”和第二子信息表中的“第二”仅用于区分不同的子信息表。本申请实施例中出现的第一、第二等描述,仅作示意与区分描述对象之用,没有次序之分,也不表示本申请实施例中对设备个数的特别限定,不能构成对本申请实施例的任何限制。
需要说明的是,本申请中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
目前,在人工智能、高性能计算(high performance computing,HPC)等场景中,为了提高计算设备的计算能力,通常计算设备的计算架构可以为异构计算的硬件架构,该架构包括中央处理器(central processing unit,CPU),以及一个或多个加速器。CPU用于执行通用的计算任务,加速器用于执行专用的计算任务。该专用的计算任务可以包括人工智能(artificial intelligence,AI)处理,例如人工神经网络,机器学习(machine learning,ML)训练,ML优化/学习,推断,分类等操作,可视数据处理,网络数据处理,对象检测,规则分析,内容处理操作等。该加速器可以为神经网络处理器(neural-network process unit,NPU),可包括图形处理器(graphics processing unit,GPU),数字信号处理器(digital signal processor,DSP),片上系统(system on chip,SOC),现场可编程门阵列(Field-Programmable Gate Array,FPGA)、专用集成电路 (application specific integrated circuit,ASIC)等中的一个或多个。该加速器还可以为本申请下述实施例中的任务调度装置。
在异构系统中,对于较为复杂的任务图,多个任务之间可能存在相互依赖关系(比如,一个任务的执行依赖于另一个任务的计算结果)。为了减轻并行编程的难度,调度软件可以根据输入和输出确定多个任务之间的依赖关系,并将准备就绪的任务调度到可执行任务的CPU核或者加速器上执行。
例如,如图1所示的一种调度装置,该调度装置可以称为任务大师(Task Maestro),该调度装置调度任务的方法可以包括以下几个步骤:
1.主处理器核(Master Core)添加任务描述符(Task Description)到任务大师(Task Maestro);
2.任务大师(Task Maestro)将任务描述符存储至任务池(Task Pool);
3.检查模块(Check Deps)检查任务之间的依赖关系,将准备就绪的任务通过调度模块调度给计算单元(Worker Core)执行,并修改依赖关系表(Dependence Table);
4.检查模块(Check Deps)继续检查是否有任务准备就绪,并将准备就绪的任务继续调度给计算单元(Worker Core)执行,直到所有任务执行完成。
但是,图1所示的调度装置在调度任务时,对于处理方式和依赖关系相同的多个任务图,每次执行该任务图时,主处理器核Master Core都要重新将该任务图对应的任务描述符(任务描述符包括任务的处理方式和任务的依赖关系)初始化到调度装置中,再由调度装置维护该任务图中的多个任务之间的依赖关系,以确保计算正常进行。因此,图1所示的调度装置不支持将静态的依赖关系和处理方式内置在调度装置中,这将造成调度装置解析任务图的依赖关系时,初始化的时间较长,Master Core和调度装置之间交互冗余,影响计算效率的问题。
再例如,如图2所示的一种任务调度图(task scheduling graph,TSG)装置,该TSG装置包括任务库、事件计数器和刷新命令模块。该TSG装置调度任务的方法可以包括以下几个步骤:
1.主处理器核(Master Core)初始化任务描述符(Task Description),依赖关系表,以及事件计数器Counter;
2.TSG装置调度准备就绪的任务调度到计算单元(Worker Core);
3.计算单元(Worker Core)执行完该任务后产生事件,TSG装置通过刷新命令模块,修改事件计数器Counter,事件计数器满足触发条件后,产生准备就绪的任务,TSG装置将该准备就绪的任务继续调度给计算单元(Worker Core);
4.直到所有任务执行完后,TSG装置通知主处理器核(Master Core),主处理器核(Master Core)重新初始化新的任务。
但是,图2所示的调度装置在调度任务时,对于处理方式和依赖关系相同的多个任务图,每次执行该任务图时,Master Core都要重新将该任务图对应的任务描述符初始化到TSG装置中,再由TSG装置通过事件计数器和刷新命令维护该任务图中的多个任务之间的依赖关系,以确保计算正常进行。因此,图2所示的调度装置不支持将静态的依赖关系和处理方式内置在调度装置中,每个任务图都要重新初始化到TSG装置,这将造成初始化时间较长,影响计算效率的问题。而且该方案中每个任务都配置 一个计数器,因此计数器的数量较多,导致TSG装置的芯片面积较大。
为了缓解调度装置在调度处理方式和依赖关系相同的多个任务图时,每次CPU都要重新将该任务图对应的处理方式和依赖关系初始化到调度装置中,造成初始化时间较长,影响计算效率的问题,本申请实施例提供一种任务调度装置,该任务调度装置通过创建一次任务图模板,从而在后续执行与该任务图模板的处理方式和依赖关系均相同的多个任务图时,无需将静态的处理方式和依赖关系再次载入任务调度装置。即,本申请实施例提供的任务调度装置支持将静态的任务图模板内置,从而能够节省将静态的处理方式和依赖关系载入任务调度装置的时间,提升计算效率。
本申请实施例提供一种任务调度装置,该任务调度装置可以应用于通信处理器、HPC和AI计算等领域。该任务调度装置可以为通信设备或计算设备中的芯片。
任务调度装置,用于获取一个或多个任务图模板。可选的,任务调度装置获取的任务图模板可以是CPU发送给任务调度装置的,也可以是预置在任务调度装置中的,本申请实施例对此并不限定。
任务调度装置中的每个任务图模板用于指示该任务图模板包括的多个任务之间的依赖关系,以及每个任务的处理方式。
示例性的,任务图模板包括的多个任务中,如果一个任务的执行依赖于另一个任务的计算结果,那么这两个任务之间存在依赖关系。每个任务的处理方式可以包括每个任务的计算方式、数据复制方式、数据搬移方式等。
例如,以任务图模板包括任务1、任务2和任务3,任务图模板用于计算(a+b)*(c-d)为例,任务1的处理方式为a+b,任务2的处理方式为c-d,任务3的处理方式为将任务1的计算结果e与任务2的计算结果f相乘。该任务图模板中任务1和任务2执行完成之后,任务3才可以开始执行。即,任务3的执行依赖于任务1和任务2的计算结果。该任务图模板用于指示任务1、任务2和任务3的计算方式,以及任务1、任务2和任务3之间的依赖关系。
处理方式和依赖关系均相同的多个任务图对应的任务图模板相同。例如,任务图1用于计算(1+2)*(4-3),任务图2用于计算(5+6)*(8-7),任务图1中多个任务之间的依赖关系与任务图2中多个任务之间的依赖关系相同,而且任务图1中多个任务的计算方式与任务图2中多个任务的计算方式也相同。因此,任务图1和任务图2的任务图模板相同,均为(a+b)*(c-d)。任务图1和任务图2的区别在于两个任务图的输入数据不同。即本申请实施例中的任务调度装置支持将静态的任务图模板内置在该任务调度装置中,从而在后续执行该任务图模板对应的多个任务图时,无需将依赖关系和处理方式再次初始化到任务调度装置中,因此,后续执行该多个任务图时只获取动态的输入数据以及要用的任务图模板的标识即可,能够节省依赖关系载入到任务调度装置的时间。
可选的,任务图模板包括的多个任务可以串行执行,也可以并行执行。例如,任务图模板包括的多个任务中,一部分任务在执行时可以串行执行,一部分任务在执行时可以并行执行,本申请实施例对于任务图模板中多个任务的具体执行方式并不限定。
本申请实施例中任务调度装置中的任务图模板可以用于任意多任务计算的场景,下面简单介绍三种场景下的任务图模板。
图3为一种无线通信中L2调度的任务图模板,如图3所示,在无线通信系统中的5G商用下行共享信道(downlink shared channel,DL-SCH)中,可以将其调度过程抽象为图3所示的任务图模板。其中,T0为传输时间间隔(transport time interval,TTI)中断定时触发任务,T1为小区级调度任务,T2为空域调度任务,T3为频域调度任务,T4为用户级后处理任务。主处理器核Master Core可以将图3所示的任务图模板载入到本申请的任务调度装置中,由本申请的任务调度装置完成多个任务的依赖关系的解析和调度,保证任务执行的并行度。
图4为一种HPC中矩阵计算的任务图模板,HPC中的矩阵计算可以抽象为图4所示的任务图模板,以矩阵计算为A[i]=(a[i]+b[i])*(c[i]+d[i])为例,T0为加法运算,
T1为乘法运算。主处理器核Master Core将图4中的任务图模板抽象为两个T0和一个T1载入到本申请的任务调度装置中,由本申请的任务调度装置完成多个任务的依赖关系解析和调度,保证任务执行的并行度。可以理解的,HPC中的矩阵计算除乘法和加法外,还可以包括矩阵LU分解等多种矩阵计算,图4仅以任务图模板包括加法和乘法为例进行示意。
图5为一种AI计算场景中卷积神经网络(convolutional neural networks,CNN)的任务图模板,图5中的(a)所示的CNN的各个计算步骤可以抽象为图5中的(b)所示的任务图模板。其中,T1是前处理,T2、T3、T4是直接存储器访问(direct memory access,DMA)任务,T5是VADD,T6、T7是卷积CONV,T8是池化Pool,T9是DMA。主处理器核Master Core将图5所示的任务图模板载入到本申请的任务调度装置中,由本申请的任务调度装置完成多个任务的依赖关系解析和调度,保证任务执行的并行度。
任务调度装置中的任务图模板不限于上述三种场景的任务图模板,任何多任务计算的场景都可以将多个任务之间的依赖关系和处理方式抽象为任务图模板。
可选的,每个任务图模板可以包括任务信息表、第一同步信息表和第二同步信息表。本申请实施例中每个任务图模板的数据结构可以采用任务信息表、第一同步信息表和第二同步信息表这三张表来描述。下面分别对这三张表进行介绍。
任务信息表包括多个任务标识,以及每个任务标识对应的处理方式。可选的,每个任务标识对应的处理方式可以为每个任务的具体计算方式。
示例性的,以任务图模板包括N个任务为例,任务信息表可以如表1所示。
表1
Figure PCTCN2021100415-appb-000001
Figure PCTCN2021100415-appb-000002
任务调度装置基于表1所示的任务信息表,可以获取每个任务的处理方式。表1所示的任务信息表用于指示任务图模板的静态处理方式。
示例性的,表1中的TaskType0至TaskTypeN表示任务标识,在任务信息表中每个任务标识对应一个函数关系function,TaskInfo表示函数中具体变量的指针位置,任务调度装置基于表1中的TaskInfo和function可以得到每个任务标识对应的具体计算方式。
例如,以TaskType0对应的任务为a+b为例,function0为加法,TaskInfo0用于指示变量a和变量b的指针位置,根据任务标识TaskType0,查表1可知,TaskType0对应的任务为a+b。可以理解的,变量a和变量b的具体数值为动态的数据,可以根据不同的任务图实时获取该动态数据。而上述任务信息表中存储的每个任务的处理方式为静态数据,因此,多个任务图都可以采用上述任务信息表获取任务的处理方式,只是对于不同任务图动态数据的数值不相同。
可选的,任务信息表还可以包括任务优先级、队列编号、任务计算量、亲和性标签TAG(同一个亲和性标签TAG可以发给同一个计算单元执行)等信息。负载均衡模块接收任务调度装置调度的任务后,可以基于任务优先级、队列编号、任务计算量、亲和性标签TAG等信息进行负载均衡、优先级调度、亲和性调度等操作。
第一同步信息表包括多个事件,以及每个事件对应的一个或多个屏障barrier的标识,该多个事件与多个任务一一对应,每个事件用于指示其对应的任务执行完成。
其中,barrier用于协调多个任务并行工作,只有在barrier满足触发条件时,才能继续执行下一个任务。每个barrier可以对应一个计数器,该barrier的值即为其对应的计数器的数值。
示例性的,以任务图模板包括N个任务为例,每个任务都可以对应一个事件Event,每个事件用于指示其对应的任务执行完成,第一同步信息可以如表2所示。
表2
Figure PCTCN2021100415-appb-000003
Figure PCTCN2021100415-appb-000004
在每个任务执行完成时,任务调度装置基于第一同步信息表,可以获取每个事件对应的barrier的标识,并基于该事件对应的barrier的标识,修改该barrier对应的计数器的数值。可选的,任务调度装置修改barrier对应的计数器的数值时,可以将该barrier对应的计数器的数值加一,也可以将该第二barrier对应的计数器的数值减一。实际应用中,任务调度装置修改barrier对应的计数器的值时,是将计数器的值加一还是减一,与计数器的初始值有关。下述实施例以barrier的初始值均为0,任务调度装置修改一次barrier对应的计数器的值是将其加一为例进行说明。
可选的,一个事件可以对应一个或多个barrier。如表2所示,当一个事件对应多个barrier时,该事件对应的任务执行完成后,修改该事件对应的多个barrier的值。
可选的,多个事件也可以对应同一个barrier。即该多个事件对应的多个任务中每个任务执行完成后都要修改该barrier的值。例如,以Event0、Event1和Event2分别表示Task0、Task1、Task2执行完成为例,如表2所示,由于Event0、Event1和Event2对应的barrier的标识均包括0x1,故Task0执行完成后,任务调度装置将barrier0x1的值加一,Task1执行完成后,任务调度装置将barrier0x1的值再加一,Task2执行完成后,任务调度装置将barrier0x1的值又加一。
第二同步信息表包括多个barrier、每个barrier对应的触发条件,以及每个barrier满足其对应的触发条件时的待执行任务标识。第二同步信息表中,每个barrier可以对应一个或多个触发条件,每个barrier满足其对应的一个触发条件时的待执行任务可以为一个或多个。
可选的,第二同步信息表还可以包括每个barrier的有效位,每个barrier的有效位用于指示该barrier是否有效。第二同步信息表还可以包括每个待执行任务标识对应的执行次数。
示例性的,以一个barrier对应两个触发条件为例,第二同步信息可以如表3所示。
表3
Figure PCTCN2021100415-appb-000005
在barrier的值更新的情况下,任务调度装置基于该barrier的标识,查询第二同步 信息表,确定该barrier的值是否满足其对应的触发条件。在该barrier的值满足其对应的触发条件时,任务调度装置基于第二同步信息表,确定待执行的任务标识,并基于任务信息表(表1)获取该任务标识对应的任务内容,然后向计算单元发送该任务。
如表3所示,一个barrier可以对应多个触发条件,barrier的值满足不同触发条件时,待执行的任务不同。当一个barrier对应多个触发条件时,第二同步信息表中该多个触发条件可以按照触发顺序依次排列。例如,表3中的触发条件trigger_condition0的触发顺序早于trigger_condition1的触发顺序。
根据上述表3可以确定每个barrier的值是否满足其对应的触发条件。在barrier的值满足其对应的触发条件的情况下,依据表3可以获取下一个待执行的任务的标识。
可选的,多个任务可以复用同一个barrier。多个任务复用同一个barrrier是指该多个任务的触发可以依赖于同一个barrier。即在第二同步信息表中,当多个任务复用同一个barrrier时,该barrier的值满足一个或多个触发条件时,对应的待执行任务为该多个任务。由于一个barrier的值可以通过一个计数器维护,当多个任务复用同一个barrier时,可以减少计数器的数量,从而减小芯片面积。
示例性的,以任务图模板包括第一任务和第二任务为例,该第一任务和第二任务可以复用同一个barrier。以第一任务和第二任务复用barrier0为例,在第二同步信息表中,第一任务可以为barrier0满足第一触发条件时对应的待执行任务,第二任务可以为barrier0满足第二触发条件时对应的待执行任务,该第一触发条件和第二触发条件可以相同,也可以不同。也就是说,复用同一个barrier的多个任务的触发条件可以相同,也可以不同。当第一触发条件和第二触发条件相同时,第一任务和第二任务为并行执行的两个任务。
在本申请实施例中,第一任务和第二任务满足以下四种情况中的至少一种时,第一任务和第二任务可以复用同一个barrier。
情况一、第一任务和第二任务均没有父节点。
可选的,在任务图模板中第一任务和第二任务为根节点时,第一任务和第二任务可以复用同一个barrier。
例如,以任务图模板包括任务T1至T4,任务T1、任务T2和任务T3为任务T4的父节点为例。如图6中的(a)所示,由于任务T1、任务T2和任务T3为根节点,即任务T1、任务T2和任务T3没有父节点,因此任务T1、任务T2和任务T3可以复用同一个barrier。如图6中的(a)所示,任务1、任务2和任务3的触发条件可以相同,以任务1、任务2和任务3复用barrier0,任务1、任务2和任务3的触发条件为barrier0=0为例,当barrier0为0时,触发任务1、任务2和任务3,计算单元并行执行任务T1、任务T2和任务T3。
情况二、第一任务和第二任务具有相同的父节点。
例如,以任务图模板包括任务T1至T5,任务T1、任务T2和任务T3为任务T4和任务T5的父节点,任务T4和任务T5可以并行执行为例。如图6中的(b)所示,由于任务T4和任务T5的父节点均为任务T1至T3,因此任务T4和任务T5的父节点相同,故任务T4和任务T5可以复用同一个barrier。如图6中的(b)所示,任务T4和任务T5的触发条件可以相同,以任务T4和任务T5复用barrier0,任务T4和任务 T5的触发条件为barrier0=3为例,当barrier0为3时,触发任务T4和任务T5,计算单元并行执行任务T4和任务T5。
结合上述情况一和情况二可知,任务图模板中触发条件相同的多个任务可以复用同一个barrier。
情况三、第一任务为第二任务唯一的父节点。
例如,以任务图模板包括任务T1至T5,任务T1、任务T2和任务T3为任务T4的父节点,任务T4为任务T5的父节点为例。如图6中的(c)所示,由于任务T4为任务T5唯一的父节点,因此任务T4和任务T5可以复用同一个barrier。如图6中的(c)所示,任务T4和任务T5的触发条件不同,以任务T4的触发条件为barrier0=3,任务T5的触发条件为barrier0=4为例,当barrier0为3时触发任务T4,任务T4执行完成后,barrier0的值修改为4,满足任务T5的触发条件,触发任务T5,计算单元执行任务T5。
情况四、第一任务和第二任务的根节点复用同一个barrier,且第一任务为第二任务唯一的父节点。
例如,以任务图模板包括任务T1至T5,任务T1、任务T2和任务T3为任务T4的父节点,任务T4为任务T5的父节点为例。如图6中的(d)所示,由于任务T4和任务T5的根节点为任务T1、任务T2和任务T3,任务T1、任务T2和任务T3可以复用同一个barrier,而且任务T4为任务T5唯一的父节点,因此任务T4和任务T5可以复用同一个barrier。如图6中的(d)所示,任务T4和任务T5的触发条件不同,以任务T4的触发条件为barrier0=3,任务T5的触发条件为barrier0=4为例,当barrier0为3时触发任务T4,任务T4执行完成后,barrier0的值修改为4,满足任务T5的触发条件,触发任务T5,计算单元执行任务T5。
结合上述情况三和情况四可知,任务图模板中触发条件不同的多个任务也可以复用同一个barrier。
需要说明的是,一个任务图模板中的多个任务如果满足以上四种情况中的任一种或多种,该多个任务可以复用同一个barrier。实际应用中,多个任务复用同一个barrier的情况不限于上述四种情况,具体可以根据任务图模板中多个任务的依赖关系,确定多个任务是否能复用同一个barrier。
例如,以任务图模板包括8个任务,分别为任务T1至任务T8为例。如图7中的(a)所示的一种任务图模板的结构示意图,在不复用barrier(或计数器)的情况下,图7中的(a)所示的任务图模板中的任务T1至任务T8的触发依赖于barrier1至barrier7。即任务T1至任务T8的触发分别依赖于不同的barrier。在不复用barrier(或计数器)的情况下,图7中的(a)所示的任务图模板对应的第二信息表如下表4所示。
表4
Figure PCTCN2021100415-appb-000006
Figure PCTCN2021100415-appb-000007
再例如,以任务图模板包括8个任务,分别为任务T1至任务T8为例。如图7中的(b)所示的一种任务图模板的结构示意图,在复用barrier(或计数器)的情况下,图7中的(b)所示的任务图模板中,由于任务T1为任务T2唯一的父节点,任务T1为任务T3唯一的父节点,任务T2和任务T3的父节点相同均为任务T1,而且任务T2为任务T4唯一的父节点,因此任务T1、任务T2、任务T3和任务T4均可以复用同一个barrier,记为图7中的(b)所示的b1。由于任务T7是任务T8唯一的父节点,因此任务T7和任务T8可以复用同一个barrier,记为图7中的(b)所示的b4。在复用barrier(或计数器)的情况下,图7中的(b)所示的任务图模板对应的第二信息表如下表5所示。
表5
Figure PCTCN2021100415-appb-000008
根据上述表4和表5可知,对于同一个任务图模板,在不复用barrier(或计数器)的情况下,任务T1至任务T8的触发依赖于barrier1至barrier7共7个barrier,而在复用barrier(或计数器)的情况下,任务T1至任务T8的触发依赖于barrier1至barrier4共4个barrier。由于一个barrier的值通过一个计数器维护,因此,对于同一个任务图模板,复用barrier相比于不复用barrier,能够大大减少计数器的数量,减小芯片面积。
可选的,一个barrier满足其对应的一个触发条件时的待执行任务如果为多个,该多个待执行任务可以并行执行。例如,如表5所示,b1的值为1时,待执行的任务为任务T2和任务T3,计算单元可以并行执行该任务T2和任务T3。
可选的,任务调度装置中存储的任务信息表、第一同步信息表和第二同步信息表可以是CPU(例如,Master core)发送给任务调度装置的,也可以是预先配置在任务调度装置中的,本申请实施例对此并不限定。
可以理解的,本申请实施例中的任务图模板的数据结构可以用任务信息表(表1)、第一同步信息表(表2)和第二同步信息表(表3)这三张表描述。任务调度装置可以基于这三张表对多个任务进行调度。
可选的,任务调度装置可以对其存储的多个任务图模板进行修改和删除,也可以新增任务图模板。
任务调度装置调度任务的过程包括:获取第一任务图的任务信息;基于第一任务图对应的任务图模板标识,在该任务调度装置存储的一个或多个任务图模板中确定第一任务图对应的任务图模板;基于第一任务图的输入数据和第一任务图对应的任务图模板,对第一任务图进行调度。
其中,第一任务图的任务信息包括第一任务图的输入数据和第一任务图对应的任务图模板标识。示例性的,任务调度装置可以接收来自CPU、Master core或加速器的第一任务图的输入数据和第一任务图对应的任务图模板标识。
可选的,由于任务调度装置中存储了多个任务图模板,每个任务图模板的数据结构可以通过上述任务信息表(表1)、第一同步信息表(表2)和第二同步信息表(表3)来描述。任务调度装置可以根据第一任务图对应的任务图模板标识,在其存储的多个任务图模板中确定第一任务图对应的任务信息表(表1)、第一同步信息表(表2)和第二同步信息表(表3)。
可选的,任务调度装置可以基于第一任务图对应的任务信息表、第一同步信息表和第二同步信息表对第一任务图进行调度。
下面以任务调度装置包括多个电路模块为例,结合每个电路模块对任务调度装置调度任务的过程进行详细介绍。
示例性的,如图8所示,本申请实施例提供的任务调度装置可以包括耦合连接的第一接口801、任务图控制电路802、任务状态机803,以及第二接口804。
其中,任务图控制电路802,用于通过第一接口801获取任务图模板,以及第一任务图的任务信息。
可选的,该第一接口801负责接收和识别来自上游模块的命令,并将不同命令路由到不同模块。例如,第一接口801接收CPU发送的任务图模板后,将该任务图模板路由至任务图控制电路802。再例如,第一接口801接收计算单元发送的指示任务执行完成的事件后,解析该事件,并将该事件路由至事件解析电路。
示例性的,任务图控制电路802可以通过第一接口801接收来自CPU的任务图模板。该任务图模板只在任务调度装置创建一次,即可供后续任务图执行多次。可以理解的,本申请实施例中的任务图模板中的依赖关系和处理方式均为静态信息,在后续多次执行任务图时,只获取任务图的动态数据以及要用的任务图模板的标识即可。比如,以第一任务图和第二任务图对应的任务图模板相同为例,任务调度装置创建一次该任务图模板,后续执行第一任务图和第二任务图时,无需将依赖关系和处理方式再次载入任务调度装置,只载入第一任务图和第二任务图的动态数据,以及要用的任务图模板的标识即可,因此能够节省任务图的初始化时间。
任务状态机803,用于基于第二同步信息表,在确定第一barrier的值满足其对应的第一触发条件时,根据第一任务标识、第一任务图的输入数据,以及第一任务图对应的任务信息表,从任务图控制电路802获取第一任务标识对应的第一任务,并通过第二接口804向计算单元发送该第一任务。该第一任务标识为第一barrier的值满足第一触发条件时待执行的任务的标识。
可选的,该第二接口804负责和下游模块交互,用于将准备就绪的任务发送给负载均衡单元或者计算单元。该第一接口801和第二接口804可以为两个不同的物理接 口,也可以为同一个物理接口。当第一接口801和第二接口804为同一个物理接口时,该物理接口既可以接收命令或数据,也可以发送命令或数据。图8仅以第一接口801和第二接口804为不同的物理接口为例进行示意。
示例性的,任务状态机803确定barrier的值是否满足其对应的触发条件的时机,可以包括以下两种情况。
第一种情况,在第二同步信息表包括首任务的触发条件(比如,表5中首个任务T1对应的触发条件b1=0)时,对于第一任务图模板中的首个任务,可以由任务图控制电路802向任务状态机803发送首任务触发信号,该首任务触发信号用于指示任务状态机803查询第二同步信息表,确定b1的值是否满足首任务对应的触发条件。在任务状态机803确定b1的值满足首任务对应的触发条件时,任务状态机803根据首任务标识,第一任务图的输入数据,以及第一任务图对应的任务信息表,从任务图控制电路802获取首任务的任务内容,并将该首任务的任务内容通过第二接口804发送给计算单元。
第二种情况,对于首任务之后的其他任务,任务状态机803可以在barrier的值更新时,查询第二同步信息表,确定该barrier的值是否满足其对应的触发条件。
可选的,在第二同步信息表不包括首任务的触发条件(比如,表5中不包括首个任务T1对应的触发条件b1=0)时,对于第一任务图模板中的首个任务,也可以由任务图控制电路802向任务状态机803发送首任务执行信号,任务状态机803根据该首任务执行信号,从任务图控制电路802获取首任务的任务内容,并将该首任务的任务内容通过第二接口804发送给计算单元。
可选的,在第一barrier的值满足其对应的第一触发条件时,如果待执行的第一任务多个,计算单元并行执行该多个第一任务。例如,第一任务为多个时,任务状态机803可以向多个计算单元分别发送该多个第一任务,以使得该多个计算单元并行执行该多个第一任务。
如图8所示,任务调度装置还可以包括事件解析电路805和同步计数电路806,该同步计数电路806包括多个计数器,每个barrier对应一个计数器,每个barrier的值即为其对应的计数器的值。
事件解析电路805,用于在第一任务执行完成的情况下,通过第一接口801接收第一事件,并基于第一同步信息表确定第一事件对应的第二barrier的标识,通知同步计数电路806修改第二barrier对应的计数器的值。其中,第一事件用于指示第一任务执行完成。
同步计数电路806,用于修改第二barrier对应的计数器的值。
可选的,上述第二barrier与第一barrier可以为同一个barrier,也可以为不同的barrier。
可选的,同步计数电路806修改第二barrier对应的计数器的值时,可以将该第二barrier对应的计数器的值加一,也可以将该第二barrier对应的计数器的值减一,还可以将该第二barrier对应的计数器的值加上或减去其他数值。实际应用中,同步计数电路806修改barrier对应的计数器的值时,是将计数器的值增大(例如,加一)还是减小(例如,减一),与该计数器的初始值有关。
例如,当barrier对应的计数器的初始值为0时,同步计数电路806修改第二barrier对应的计数器的值时可以将该计数器的值加一。在该实现方式中,当barrier的值增加到一定数值时,该barrier满足其对应的触发条件。
再例如,当barrier对应的计数器的初始值为根据任务之间的依赖关系预设的非零数值时,同步计数电路806修改第二barrier对应的计数器的值时可以将该计数器的值减一。在该实现方式中,当barrier的值减到0时,该barrier满足其对应的触发条件。
本申请实施例对于同步计数电路806修改barrier对应的计数器的值的具体方法并不限定,下述实施例以barrier的初始值为0,同步计数电路806修改一次barrier对应的计数器的值是将其加一为例进行说明。
示例性的,计算单元执行完第一任务后,可以向任务调度装置发送指示第一任务执行完成的第一事件,第一接口801解析该第一事件,并向事件解析电路805发送该第一事件。事件解析电路805接收该第一事件,并查询第一同步信息表,确定第一事件对应的的第二barrier的标识,并通知同步计数电路806修改该第二barrier对应的计数器的数值。同步计数电路806修改第二barrier对应的计数器的数值后,向任务状态机803通知该第二barrier的标识。任务状态机803基于第二同步信息表判断该第二barrier的值是否满足其对应的触发条件,在该第二barrier的值满足其对应的触发条件的情况下,从任务图控制电路802获取下一个待执行的任务,并向计算单元发送该任务。直至第一任务图模板中的所有任务执行完毕。
可以理解的,本申请实施例提供的任务调度装置支持将静态的任务图模板内置,从而任务调度装置在执行处理方式和依赖关系均相同的多个任务图时,不需要CPU每次将任务图对应的处理方式和依赖关系初始化到任务调度装置中,减少了任务图的初始化时间。也就是说,本申请实施例提供的任务调度装置,通过创建一次任务图模板,就可以重复多次执行与该任务图模板的处理方式和依赖关系均相同的任务图,而且在后续执行该多个任务图时,无需再次将静态的处理方式和依赖关系载入任务调度装置,能够节省将静态的处理方式和依赖关系载入任务调度装置的时间,提升计算效率。而且本申请实施例提供的任务图模板中多个任务可以复用barrier,能够减少同步计数电路中计数器的数量,从而减小任务调度装置的面积,提高芯片的可扩展性。
下面结合图7和图9对本申请实施例提供的任务调度装置调度任务的过程进行介绍。
示例性的,结合图7中的(a)所示的任务图模板,以任务图模板中多个任务不复用同一个barrier,b1至b7的初始值为0为例。图7中的(a)所示的任务图模板对应的第一同步信息表如下表6所示。
表6
Figure PCTCN2021100415-appb-000009
Figure PCTCN2021100415-appb-000010
结合上述表1、表6和表4,对本申请实施例提供的任务调度装置调度任务的过程进行介绍。
如图9所示,任务图控制电路802通过第一接口801接收来自CPU的任务图模板,并存储该任务图模板,该任务图模板的数据结构可以采用表1、表4和表6这三张表描述。任务图控制电路802通过第一接口801接收来自CPU的第一任务图的任务信息,该第一任务图对应的任务图模板如图7中的(a)所示。任务图控制电路802向任务状态机803发送首任务触发信号,任务状态机803基于该首任务触发信号查询表4,确认b1的初始值0满足首任务对应的触发条件b1=0,任务状态机803根据首任务标识T1、第一任务图的输入数据,以及表1,从任务图控制电路802获取T1的任务内容,并通过第二接口804向计算单元发送任务T1。
计算单元执行完任务T1后,向第一接口801发送指示任务T1执行完成的Event1。第一接口801解析该Event1,并向事件解析电路805发送该Event1,事件解析电路805接收该Event1,并查询表6,确定Event1对应的barrier的标识为b2,并通知同步计数电路806修改b2对应的计数器的数值。同步计数电路806将b2的值修改为1,并向任务状态机803通知该b2的标识。任务状态机803基于表4,确定b2的值满足其对应的触发条件b2=1,待执行任务标识为T2和T3,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取任务T2和任务T3,并向计算单元1和计算单元2发送任务T2和任务T3。
计算单元并行执行任务T2和任务T3,计算单元执行完任务T2和任务T3后,向第一接口801发送指示任务T2和任务T3执行完成的Event2和Event3。第一接口801解析该Event2和Event3,并向事件解析电路805发送该Event2和Event3,事件解析电路805接收该Event2和Event3,并查询表6,确定Event2对应的barrier的标识为b3和b5,Event3对应的barrier的标识为b4和b5,并通知同步计数电路806修改b3、b4和b5对应的计数器的数值。同步计数电路806将b3的值修改为1,将b4的值修改为1,将b5的值修改为2,并向任务状态机803通知该b3、b4和b5的标识。任务状态机803基于表4,确定b3的值满足其对应的触发条件b3=1,b5的值满足其对应的触发条件b5=2,待执行任务标识为T4和T6,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取待执行的任务T4和任务T6,并向计算单元发送任务T4和任务T6。
计算单元执行完任务T4后,向第一接口801发送指示任务T4执行完成的Event4。第一接口801解析该Event4,并向事件解析电路805发送该Event4,事件解析电路805接收该Event4,并查询表6,确定Event4对应的barrier的标识为b4,并通知同步计数电路806修改b4对应的计数器的数值。同步计数电路806将b4的值修改为2,并向任务状态机803通知该b4的标识。任务状态机803基于表4,确定b4的值满足其对应的触发条件b4=2,待执行任务标识为T5,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取待执行的任务T5,并向计算单元发送任务T5。
计算单元执行完任务T6后,向第一接口801发送指示任务T6执行完成的Event6。第一接口801解析该Event6,并向事件解析电路805发送该Event6,事件解析电路805接收该Event6,并查询表6,确定Event6对应的barrier的标识为b6,并通知同步计数电路806修改b6对应的计数器的数值。同步计数电路806将b6的值修改为1,并向任务状态机803通知该b6的标识。任务状态机803基于表4,确定b6的值不满足其对应的触发条件b6=2。可选的,计算单元可以并行执行任务T4和任务T6。
计算单元执行完任务T5后,向第一接口801发送指示任务T5执行完成的Event5。第一接口801解析该Event5,并向事件解析电路805发送该Event5,事件解析电路805接收该Event5,并查询表6,确定Event5对应的barrier的标识为b6,并通知同步计数电路806修改b6对应的计数器的数值。同步计数电路806将b6的值修改为2,并向任务状态机803通知该b6的标识。任务状态机803基于表4,确定b6的值满足其对应的触发条件b6=2,待执行任务标识为T7,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取待执行的任务T7,并向计算单元发送任务T7。
计算单元执行完任务T7后,向第一接口801发送指示任务T7执行完成的Event7。第一接口801解析该Event7,并向事件解析电路805发送该Event7,事件解析电路805接收该Event7,并查询表6,确定Event7对应的barrier的标识为b7,并通知同步计数电路806修改b7对应的计数器的数值。同步计数电路806将b7的值修改为1,并向任务状态机803通知该b7的标识。任务状态机803基于表4,确定b7的值满足其对应的触发条件b7=1,待执行任务标识为T8,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取待执行的任务T8,并向计算单元发送任务T8。任务T8执行完成后第一任务图模板中的所有任务执行完毕。
示例性的,结合图7中的(b)所示的任务图模板,以任务图模板中多个任务复用同一个barrier,b1至b4的初始值为0为例。图7中的(b)所示的任务图模板对应的第一同步信息表如下表7所示。
表7
Figure PCTCN2021100415-appb-000011
结合上述表1、表7和表5,对本申请实施例提供的任务调度装置调度任务的过程进行介绍。
如图9所示,任务图控制电路802通过第一接口801接收来自CPU的任务图模板,并存储该任务图模板,该任务图模板的数据结构可以采用表1、表7和表5这三张表描述。任务图控制电路802通过第一接口801接收来自CPU的第一任务图的任务信息, 该第一任务图对应的任务图模板如图7中的(b)所示。任务图控制电路802向任务状态机803发送首任务触发信号,任务状态机803基于该首任务触发信号查询表5,确认b1的初始值0满足首任务对应的触发条件b1=0,任务状态机803根据首任务标识T1、第一任务图的输入数据,以及表1,从任务图控制电路802获取T1的任务内容,并通过第二接口804向计算单元发送任务T1。
计算单元执行完任务T1后,向第一接口801发送指示任务T1执行完成的Event1。第一接口801解析该Event1,并向事件解析电路805发送该Event1,事件解析电路805接收该Event1,并查询表7,确定Event1对应的barrier的标识为b1,并通知同步计数电路806修改b1对应的计数器的数值。同步计数电路806将b1的值修改为1,并向任务状态机803通知该b1的标识。任务状态机803基于表5,确定b1的值满足其对应的触发条件b1=1,待执行任务标识为T2和T3,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取任务T2和任务T3,并向计算单元1和计算单元2发送任务T2和任务T3。
计算单元并行执行任务T2和任务T3,计算单元执行完任务T2和任务T3后,向第一接口801发送指示任务T2和任务T3执行完成的Event2和Event3。第一接口801解析该Event2和Event3,并向事件解析电路805发送该Event2和Event3,事件解析电路805接收该Event2和Event3,并查询表7,确定Event2对应的barrier的标识为b1和b2,Event3对应的barrier的标识为b2和b3,并通知同步计数电路806修改b1、b2和b3对应的计数器的数值。同步计数电路806将b1的值修改为2,将b2的值修改为2,将b3的值修改为1,并向任务状态机803通知该b1、b2和b3的标识。任务状态机803基于表5,确定b1的值满足其对应的触发条件b1=2,b2的值满足其对应的触发条件b2=2,待执行任务标识为T4和T6,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取待执行的任务T4和任务T6,并向计算单元发送任务T4和任务T6。
计算单元执行完任务T4后,向第一接口801发送指示任务T4执行完成的Event4。第一接口801解析该Event4,并向事件解析电路805发送该Event4,事件解析电路805接收该Event4,并查询表7,确定Event4对应的barrier的标识为b3,并通知同步计数电路806修改b3对应的计数器的数值。同步计数电路806将b3的值修改为2,并向任务状态机803通知该b3的标识。任务状态机803基于表5,确定b3的值满足其对应的触发条件b3=2,待执行任务标识为T5,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取待执行的任务T5,并向计算单元发送任务T5。
计算单元执行完任务T6后,向第一接口801发送指示任务T6执行完成的Event6。第一接口801解析该Event6,并向事件解析电路805发送该Event6,事件解析电路805接收该Event6,并查询表7,确定Event6对应的barrier的标识为b4,并通知同步计数电路806修改b4对应的计数器的数值。同步计数电路806将b4的值修改为1,并向任务状态机803通知该b4的标识。任务状态机803基于表5,确定b4的值不满足其对应的触发条件b4=2。可选的,计算单元可以并行执行任务T4和任务T6。
计算单元执行完任务T5后,向第一接口801发送指示任务T5执行完成的Event5。第一接口801解析该Event5,并向事件解析电路805发送该Event5,事件解析电路805 接收该Event5,并查询表7,确定Event5对应的barrier的标识为b4,并通知同步计数电路806修改b4对应的计数器的数值。同步计数电路806将b4的值修改为2,并向任务状态机803通知该b4的标识。任务状态机803基于表5,确定b4的值满足其对应的触发条件b4=2,待执行任务标识为T7,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取待执行的任务T7,并向计算单元发送任务T7。
计算单元执行完任务T7后,向第一接口801发送指示任务T7执行完成的Event7。第一接口801解析该Event7,并向事件解析电路805发送该Event7,事件解析电路805接收该Event7,并查询表7,确定Event7对应的barrier的标识为b4,并通知同步计数电路806修改b4对应的计数器的数值。同步计数电路806将b4的值修改为3,并向任务状态机803通知该b4的标识。任务状态机803基于表5,确定b4的值满足其对应的触发条件b4=3,待执行任务标识为T8,任务状态机803基于待执行任务标识以及表1,从任务图控制电路802获取待执行的任务T8,并向计算单元发送任务T8。任务T8执行完成后第一任务图模板中的所有任务执行完毕。
可以理解的,本申请实施例提供的任务调度装置支持将静态的任务图模板内置,从而任务调度装置在执行处理方式和依赖关系均相同的多个任务图时,不需要CPU每次将任务图对应的处理方式和依赖关系初始化到任务调度装置中,减少了任务图的初始化时间。也就是说,本申请实施例提供的任务调度装置,通过创建一次任务图模板,就可以重复多次执行与该任务图模板的处理方式和依赖关系均相同的任务图,而且在后续执行该多个任务图时,无需再次将静态的处理方式和依赖关系载入任务调度装置,能够节省将静态的处理方式和依赖关系载入任务调度装置的时间,提升计算效率。而且本申请实施例提供的任务图模板中多个任务可以复用barrier,能够减少同步计数电路中计数器的数量,从而减小任务调度装置的面积,提高芯片的可扩展性。
可选的,在第二同步信息表中的一个barrier对应多个触发条件的情况下,由于该多个触发条件所占的存储资源大,需要较大的存储器件存储该多个触发条件。为了减小任务调度装置的芯片面积,可以将第二同步信息表划分为第一子信息表和第二子信息表,并将第一子信息表存储在任务调度装置中,将第二子信息表存储在内存中。
其中,第一子信息表包括多个barrier,每个barrier对应的首个触发条件,以及每个barrier满足其对应的首个触发条件时的待执行的任务的标识。第二子信息表包括多个barrier,每个barrier对应的其他触发条件,以及每个barrier满足其对应的其他触发条件时的待执行的任务的标识。对于同一个barrier,其对应的首个触发条件的触发顺序早于其对应的其他触发条件的触发顺序。
例如,以每个barrier对应的首个触发条件为trigger_condition0,每个barrier对应的其他触发条件为trigger_condition1和trigger_condition2为例,第一子信息表和第二子信息表分别如表8和表9所示。
表8
Figure PCTCN2021100415-appb-000012
Figure PCTCN2021100415-appb-000013
表9
Figure PCTCN2021100415-appb-000014
可选的,第一子信息表可以存储在任务调度装置的缓存cache中,第二子信息表可以存储在内存(例如,双倍速率(double data rate,DDR)同步动态随机存储器)中,该内存不在任务调度装置中,为任务调度装置以外的存储器。可以理解的,本申请实施例通过将部分触发条件存储在DDR中,可以减小任务调度装置的芯片面积。
示例性的,在barrier对应的计数器的初始值为0的情况下,该barrier满足首个触发条件时的值可以小于barrier满足其他触发条件时的值,从而使得首个触发条件的触发顺序早于其他触发条件的触发顺序。在barrier对应的计数器的初始值为根据任务之间的依赖关系预设的非零数值的情况下,该barrier满足首个触发条件时的值可以大于barrier满足其他触发条件时的值,从而使得首个触发条件的触发顺序早于其他触发条件的触发顺序。
可选的,上述第二子信息表中的多个触发条件可以按照触发顺序依次排列。
任务图控制电路802,还用于在barrier的值满足其对应的首个触发条件时,按照第二子信息表中该barrier对应的多个其他触发条件的触发顺序,从内存中读取下一个其他触发条件,并将该barrier对应的首个触发条件替换为该其他触发条件。
barrier对应的多个触发条件中,该下一个其他触发条件的触发顺序紧接着第一触 发条件。即该下一个其他触发条件为第一barrier的值满足第一触发条件之后,下一个会被第一barrier触发的触发条件。比如第二个触发条件。
可选的,由于任务控制电路将cache中存储的barrier对应的首个触发条件替换为该barrier对应的第二个触发条件。因此,任务图控制电路802,还用于在barrier的值满足cache中的第二个触发条件时,按照第二子信息表中该barrier对应的多个其他触发条件的触发顺序,从内存中读取第三个触发条件,并将cache中的第二个触发条件替换为该第三个触发条件。以此类推,直至同一个barrier对应的多个触发条件全部遍历完。
示例性的,以barrier对应的触发条件为3个为例,该3个触发条件按照触发顺序依次为首个触发条件,第二个触发条件和第三个触发条件(该第二个触发条件和第三个触发条件即为上述其他触发条件),任务调度装置的缓存cache中存储首个触发条件,DDR中存储第二个触发条件和第三个触发条件。当barrier的值满足首个触发条件时,任务图控制电路802从DDR中读取第二个触发条件,并将cache中的首个触发条件替换为第二个触发条件。当barrier的值满足第二个触发条件时,任务图控制电路802从DDR中读取下一个其他触发条件(即第三个触发条件),并将cache中的第二个触发条件替换为第三个触发条件。
可以理解的,本申请实施例通过将首个触发条件存储在任务调度装置的cache中,将其他触发条件存储在DDR中,并通过动态的替换cache中的触发条件,可以依次将触发条件载入cache,该方案能够减小任务调度装置的芯片面积,提高芯片的可扩展性。
本申请实施例还提供一种计算设备,如图9所示,该计算设备包括中央处理器CPU,以及图8所示的任务调度装置,CPU用于向任务调度装置发送任务图模板。
可选的,该计算设备还可以包括增强型短消息服务(enhanced message severice,EMS)和计算单元,该EMS用于接收来自任务调度装置的待执行任务,并将该待执行任务分配给计算单元。该计算单元用于执行该待执行任务。该计算单元可以为加速器或处理器。该EMS为硬件队列管理和负载均衡模块,用于将待执行任务均衡的分配给计算单元。
示例性的,本申请实施例还提供一种任务调度方法,如图10所示,该任务调度方法应用于图8所示的任务调度装置,该任务调度方法包括以下步骤:
S1001、任务调度装置获取第一任务图的任务信息。
该第一任务图的任务信息包括第一任务图的输入数据和第一任务图对应的任务图模板标识。
该任务调度装置包括一个或多个任务图模板,任务调度装置中的任务图模板可以是接收来自CPU的任务图模板,也可以是预置在任务调度装置中的任务图模板。
可选的,任务图模板的数据结构可以采用上述任务信息表、第一同步信息表和第二同步信息表这三张表来描述。
可选的,上述步骤S1001可以由图8所示的任务调度装置中的任务图控制电路802执行,该任务图控制电路802可以通过第一接口801接收来自CPU或加速器的第一任务图的任务信息。
S1002、基于第一任务图对应的任务图模板标识,在一个或多个任务图模板中确定 第一任务图对应的任务图模板。
可选的,上述步骤S1002可以由图8所示的任务调度装置中的任务图控制电路802执行,该任务图控制电路802根据第一任务图对应的任务图模板标识,可以在其存储的多个任务图模板中确定第一任务图对应的任务图模板。
S1003、基于第一任务图的输入数据和第一任务图对应的任务图模板,对第一任务图进行调度。
下面以任务调度装置为图8所示的任务调度装置为例,对步骤S1003的具体执行步骤进行说明。如图11所示,上述步骤S1003可以包括以下步骤:
S10031、任务状态机基于第二同步信息表,在确定第一barrier的值满足其对应的第一触发条件时,根据第一任务标识、第一任务图的输入数据,以及第一任务图对应的任务信息表,从任务图控制电路获取第一任务标识对应的第一任务,并通过第二接口向计算单元发送第一任务。
该第一任务标识为第一barrier的值满足第一触发条件时待执行的任务的标识。
可选的,对于第一任务图中的首个任务,在上述步骤S10031之前,任务图控制电路802可以向任务状态机803发送首任务触发信号,该首任务触发信号用于指示任务状态机803查询第二同步信息表,确定barrier(在第二同步信息表中该barrier的值满足触发条件对应的待执行任务为首任务)的初始值是否满足首任务对应的触发条件。在任务状态机803确定该barrier的初始值满足首任务对应的触发条件时,任务状态机803根据首任务标识,第一任务图的输入数据,以及第一任务图对应的任务信息表,从任务图控制电路802获取首任务的任务内容,并将该首任务的任务内容通过第二接口804发送给计算单元。
可选的,对于第一任务图中首任务之后的其他任务,任务状态机803可以在barrier的值更新时,查询第二同步信息表,确定该barrier的值是否满足其对应的触发条件。
可选的,如果第一barrier的值满足其对应的第一触发条件时待执行的第一任务为多个,那么计算单元并行执行该多个第一任务。
可选的,任务图模板中的多个任务可以复用同一个barrier。多个任务复用同一个barrrier是指该多个任务的触发可以依赖于同一个barrier。即,在第二同步信息表中,如果多个任务复用同一个barrrier,那么该barrier的值满足一个或多个触发条件时,对应的待执行任务就是该多个任务。关于多个任务复用同一个barrier的相关描述可以参考前述实施例,在此不再赘述。
S10032、事件解析电路在第一任务执行完成的情况下,通过第一接口接收第一事件,并基于第一同步信息表确定第一事件对应的第二barrier的标识,通知同步计数电路修改第二barrier对应的计数器的值。
其中,第一事件用于指示第一任务执行完成。
第一事件对应的第二barrier可以为一个,也可以为多个。第二barrier可以与第一barrier相同,也可以与第一barrier不同。
可选的,在计算单元执行完成第一任务后,可以向第一接口发送指示第一任务执行完成的第一事件,第一接口解析该第一事件,并将第一事件路由至事件解析电路805。事件解析电路805基于第一事件标识,查询第一同步信息表,确定该第一事件对 应的一个或多个第二barrier的标识,并通知同步计数电路806修改该第二barrier对应的计数器的数值。
S10033、同步计数电路修改第二barrier对应的计数器的值。
示例性的,同步计数电路修改第二barrier对应的计数器的值后,该第二barrier的值更新。
可选的,同步计数电路806修改第二barrier对应的计数器的值后,可以向任务状态机803通知该第二barrier的标识。任务状态机803基于第二同步信息表判断该第二barrier的值是否满足其对应的触发条件,并在该第二barrier的值满足其对应的触发条件的情况下,继续执行上述步骤S10031-S10033,直至第一任务图中的所有任务执行完毕。
可选的,当第一barrier对应多个触发条件时,如果第一barrier对应的首个触发条件存储在任务调度装置的cache中,第一barrier对应的其他触发条件存储在DDR中,上述步骤S1003还可以包括:
S10034、在第一barrier的值满足其对应的第一触发条件时,任务图控制电路按照第二子信息表中第一barrier对应的其他触发条件的触发顺序,从内存中读取下一个其他触发条件,并将任务调度装置中的第一barrier对应的触发条件替换为该下一个其他触发条件。
第一barrier对应的多个触发条件中,该下一个其他触发条件的触发顺序紧接着第一触发条件。即该下一个其他触发条件为第一barrier的值满足第一触发条件之后,下一个会被第一barrier触发的触发条件。比如,该第一barrier对应的第一触发条件为首个触发条件时,该下一个其他触发条件即为第一barrier对应的第二个触发条件。再比如,该第一barrier对应的第一触发条件为第二个触发条件时,该下一个其他触发条件即为该第一barrier对应的第三个触发条件。
在第一触发条件为首个触发条件的情况下,上述步骤S10034可以包括:在第一barrier的值满足cache中该第一barrier对应的首个触发条件时,任务图控制电路按照第二子信息表中该第一barrier对应的其他触发条件的触发顺序,从内存中读取下一个其他触发条件,并将cache中的第一barrier对应的首个触发条件替换为该下一个其他触发条件。
在第一触发条件为首个触发条件之后的其他触发条件时,上述步骤S10034可以包括:在第一barrier的值满足cache中的该第一barrier对应的第一触发条件时,任务图控制电路按照第二子信息表中该第一barrier对应的其他触发条件的触发顺序,从内存中读取下一个其他触发条件,并将cache中该第一barrier对应的第一触发条件替换为该下一个其他触发条件。
示例性的,以第一barrier对应的触发条件为3个为例,该3个触发条件按照触发顺序依次为首个触发条件,第二个触发条件和第三个触发条件(该第二个触发条件和第三个触发条件即为上述其他触发条件),任务调度装置的缓存cache中存储首个触发条件,DDR中存储第二个触发条件和第三个触发条件。当第一barrier的值满足其对应的首个触发条件时,任务图控制电路802从DDR中读取该第一barrier对应的第二个触发条件,并将cache中的第一barrier对应的首个触发条件替换为该第二个触发条 件。当第一barrier的值满足其对应的第二个触发条件时,任务图控制电路802从DDR中读取该第一barrier对应的第三个触发条件,并将cache中的第一barrier对应的第二个触发条件替换为该第三个触发条件。
可以理解的,本申请实施例通过将首个触发条件存储在任务调度装置的cache中,将其他触发条件存储在DDR中,并通过动态的替换cache中的触发条件,可以依次将触发条件载入cache,该方案能够减小任务调度装置的芯片面积,提高芯片的可扩展性。
可选的,步骤S10034可以在步骤S10031之后执行,也可以与步骤S10031同时执行,本申请实施例对此并不限定。
需要说明的是,上述步骤S10031-S10034的具体实现方式可以参考前述实施例的相关描述,在此不再赘述。
本申请实施例提供的任务调度方法,由于任务调度装置中存储了静态的任务图模板,因此每次执行任务图时不需要将依赖关系和处理方式再次初始化至任务调度装置,只需要将任务图的动态数据初始化至任务调度装置即可,因此,减少了将依赖关系和处理方式初始化至任务调度装置的时间。与现有技术中每次都需要CPU重新将任务图对应的处理方式和依赖关系初始化到任务调度装置中相比,本申请实施例通过创建一次任务图模板,就可以重复多次执行与该任务图模板的处理方式和依赖关系均相同的多个任务图,而且在后续执行该多个任务图时,无需再次将静态的处理方式和依赖关系载入任务调度装置,够节省将静态的处理方式和依赖关系载入任务调度装置的时间,提升计算效率。而且本申请实施例提供的任务图模板中多个任务可以复用barrier,能够减少同步计数电路中计数器的数量,从而减小任务调度装置的面积,提高芯片的可扩展性。本申请实施例进一步通过将首个触发条件存储在cache中,将其他触发条件存储在DDR中,并通过动态的替换cache中的触发条件,可以依次将触发条件载入cache,该方案能够减小任务调度装置的芯片面积,提高芯片的可扩展性。
结合本申请公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、可擦除可编程只读存储器(erasable programmable ROM,EPROM)、电可擦可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于核心网接口设备中。当然,处理器和存储介质也可以作为分立组件存在于核心网接口设备中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。

Claims (24)

  1. 一种任务调度装置,其特征在于,所述任务调度装置包括一个或多个任务图模板,每个所述任务图模板用于指示该任务图模板包括的多个任务之间的依赖关系,以及每个所述任务的处理方式;所述任务调度装置,用于:
    获取第一任务图的任务信息;所述第一任务图的任务信息包括所述第一任务图的输入数据和所述第一任务图对应的任务图模板标识;
    基于所述第一任务图对应的任务图模板标识,在所述一个或多个任务图模板中确定所述第一任务图对应的任务图模板;
    基于所述第一任务图的输入数据和所述第一任务图对应的任务图模板,对所述第一任务图进行调度。
  2. 根据权利要求1所述的装置,其特征在于,
    所述任务调度装置,还用于获取所述一个或多个任务图模板;每个所述任务图模板包括任务信息表、第一同步信息表和第二同步信息表;其中,所述任务信息表包括多个任务标识,以及每个所述任务标识对应的处理方式;所述第一同步信息表包括多个事件,以及每个所述事件对应的一个或多个屏障barrier的标识,所述多个事件与所述多个任务一一对应,每个所述事件用于指示其对应的所述任务执行完成;所述第二同步信息表包括多个barrier、每个所述barrier对应的一个或多个触发条件、以及每个barrier满足其对应的触发条件时的待执行任务标识。
  3. 根据权利要求2所述的装置,其特征在于,所述任务调度装置包括耦合连接的第一接口、任务图控制电路、任务状态机,以及第二接口;其中,
    所述任务图控制电路,用于通过所述第一接口获取所述任务图模板,以及所述第一任务图的任务信息;
    所述任务状态机,用于基于所述第二同步信息表,在确定第一barrier的值满足其对应的第一触发条件时,根据第一任务标识、所述第一任务图的输入数据,以及所述第一任务图对应的任务信息表,从所述任务图控制电路获取所述第一任务标识对应的第一任务,并通过所述第二接口向计算单元发送所述第一任务;所述第一任务标识为所述第一barrier的值满足第一触发条件时待执行的任务的标识。
  4. 根据权利要求3所述的装置,其特征在于,所述第一任务为多个时,所述计算单元并行执行多个所述第一任务。
  5. 根据权利要求3或4所述的装置,其特征在于,所述任务调度装置还包括耦合连接的事件解析电路和同步计数电路,所述同步计数电路包括多个计数器,每个计数器对应一个barrier;
    所述事件解析电路,用于在所述第一任务执行完成的情况下,通过所述第一接口接收第一事件,并基于所述第一同步信息表确定所述第一事件对应的第二barrier的标识,通知所述同步计数电路修改所述第二barrier对应的计数器的值;其中,所述第一事件用于指示所述第一任务执行完成;
    所述同步计数电路,用于修改所述第二barrier对应的计数器的值。
  6. 根据权利要求3-5中任一项所述的装置,其特征在于,
    所述任务图控制电路,还用于修改或删除所述任务图模板。
  7. 根据权利要求2-6中任一项所述的装置,其特征在于,所述任务图模板包括第一任务和第二任务,所述第一任务和所述第二任务复用同一个barrier。
  8. 根据权利要求7所述的装置,其特征在于,所述第一任务和所述第二任务满足以下情况中的至少一种:
    所述第一任务和所述第二任务均没有父节点;或者,
    所述第一任务和所述第二任务具有相同的父节点;或者,
    所述第一任务为所述第二任务唯一的父节点;或者,
    所述第一任务和所述第二任务的根节点复用同一个barrier,且所述第一任务为所述第二任务唯一的父节点。
  9. 根据权利要求2-8中任一项所述的装置,其特征在于,一个所述barrier对应多个触发条件,所述多个触发条件包括首个触发条件和其他触发条件,所述首个触发条件的触发顺序早于所述其他触发条件的触发顺序;
    所述第二同步信息表包括第一子信息表和第二子信息表,所述第一子信息表包括所述多个barrier,每个所述barrier对应的所述首个触发条件,以及每个所述barrier满足其对应的所述首个触发条件时的待执行的任务的标识;所述第二子信息表包括所述多个barrier,每个所述barrier对应的所述其他触发条件,以及每个所述barrier满足其对应的所述其他触发条件时的待执行的任务的标识。
  10. 根据权利要求9所述的装置,其特征在于,所述第一子信息表存储在所述任务调度装置的缓存cache中,所述第二子信息表存储在内存中。
  11. 根据权利要求10所述的装置,其特征在于,在所述barrier对应的所述其他触发条件为多个的情况下,所述第二子信息表中该barrier对应的多个所述其他触发条件按触发顺序先后依次排列;
    所述任务图控制电路,还用于在所述barrier的值满足其对应的所述首个触发条件时,按照所述第二子信息表中该barrier对应的多个所述其他触发条件的触发顺序,从内存中读取下一个所述其他触发条件,并将该barrier对应的所述首个触发条件替换为该其他触发条件。
  12. 一种任务调度方法,其特征在于,应用于任务调度装置,所述任务调度装置包括一个或多个任务图模板,每个所述任务图模板用于指示该任务图模板包括的多个任务之间的依赖关系,以及每个所述任务的处理方式;所述方法包括:
    所述任务调度装置获取第一任务图的任务信息;所述第一任务图的任务信息包括所述第一任务图的输入数据和所述第一任务图对应的任务图模板标识;
    所述任务调度装置基于所述第一任务图对应的任务图模板标识,在所述一个或多个任务图模板中确定所述第一任务图对应的任务图模板;
    所述任务调度装置基于所述第一任务图的输入数据和所述第一任务图对应的任务图模板,对所述第一任务图进行调度。
  13. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    所述任务调度装置获取所述一个或多个任务图模板;每个所述任务图模板包括任务信息表、第一同步信息表和第二同步信息表;其中,所述任务信息表包括多个任务标识,以及每个所述任务标识对应的处理方式;所述第一同步信息表包括多个事件, 以及每个所述事件对应的一个或多个屏障barrier的标识,所述多个事件与所述多个任务一一对应,每个所述事件用于指示其对应的所述任务执行完成;所述第二同步信息表包括多个barrier、每个所述barrier对应的一个或多个触发条件、以及每个barrier满足其对应的触发条件时的待执行任务标识。
  14. 根据权利要求13所述的方法,其特征在于,所述任务调度装置包括耦合连接的第一接口、任务图控制电路、任务状态机,以及第二接口;
    所述任务调度装置获取所述任务图和所述第一任务图的任务信息,包括:所述任务图控制电路通过所述第一接口获取所述任务图,以及所述第一任务图的任务信息;
    所述任务调度装置基于所述第一任务图的输入数据和所述第一任务图对应的任务图模板,对所述第一任务图进行调度,包括:所述任务状态机基于所述第二同步信息表,在确定第一barrier的值满足其对应的第一触发条件时,根据第一任务标识、所述第一任务图的输入数据,以及所述第一任务图对应的任务信息表,从所述任务图控制电路获取所述第一任务标识对应的第一任务,并通过所述第二接口向计算单元发送所述第一任务;所述第一任务标识为所述第一barrier的值满足第一触发条件时待执行的任务的标识。
  15. 根据权利要求14所述的方法,其特征在于,所述第一任务为多个时,所述计算单元并行执行多个所述第一任务。
  16. 根据权利要求14或15所述的方法,其特征在于,所述任务调度装置还包括耦合连接的事件解析电路和同步计数电路,所述同步计数电路包括多个计数器,每个计数器对应一个barrier;所述任务调度装置基于所述第一任务图的输入数据和所述第一任务图对应的任务图模板,对所述第一任务图进行调度,还包括:
    所述事件解析电路在所述第一任务执行完成的情况下,通过所述第一接口接收第一事件,并基于所述第一同步信息表确定所述第一事件对应的第二barrier的标识,通知所述同步计数电路修改所述第二barrier对应的计数器的值;其中,所述第一事件用于指示所述第一任务执行完成;
    所述同步计数电路修改所述第二barrier对应的计数器的值。
  17. 根据权利要求14-16中任一项所述的方法,其特征在于,所述方法还包括:
    所述任务图控制电路修改或删除所述任务图模板。
  18. 根据权利要求14-17中任一项所述的方法,其特征在于,所述任务图模板包括第一任务和第二任务,所述第一任务和所述第二任务复用同一个barrier。
  19. 根据权利要求18所述的方法,其特征在于,所述第一任务和所述第二任务满足以下情况中的至少一种:
    所述第一任务和所述第二任务均没有父节点;或者,
    所述第一任务和所述第二任务具有相同的父节点;或者,
    所述第一任务为所述第二任务唯一的父节点;或者,
    所述第一任务和所述第二任务的根节点复用同一个barrier,且所述第一任务为所述第二任务唯一的父节点。
  20. 根据权利要求14-19中任一项所述的方法,其特征在于,一个所述barrier对应多个触发条件,所述多个触发条件包括首个触发条件和其他触发条件,所述首个触 发条件的触发顺序早于所述其他触发条件的触发顺序;
    所述第二同步信息表包括第一子信息表和第二子信息表,所述第一子信息表包括所述多个barrier,每个所述barrier对应的所述首个触发条件,以及每个所述barrier满足其对应的所述首个触发条件时的待执行的任务的标识;所述第二子信息表包括所述多个barrier,每个所述barrier对应的所述其他触发条件,以及每个所述barrier满足其对应的所述其他触发条件时的待执行的任务的标识。
  21. 根据权利要求20所述的方法,其特征在于,所述第一子信息表存储在所述任务调度装置的缓存cache中,所述第二子信息表存储在内存中。
  22. 根据权利要求21所述的方法,其特征在于,在所述barrier对应的所述其他触发条件为多个的情况下,所述第二子信息表中该barrier对应的多个所述其他触发条件按触发顺序先后依次排列;所述方法还包括:
    所述任务图控制电路在所述barrier的值满足其对应的所述首个触发条件时,按照所述第二子信息表中该barrier对应的多个所述其他触发条件的触发顺序,从内存中读取下一个所述其他触发条件,并将该barrier对应的所述首个触发条件替换为该其他触发条件。
  23. 一种计算设备,其特征在于,所述计算设备包括中央处理器CPU,以及如权利要求1-11中任一项所述的任务调度装置,所述CPU用于向所述任务调度装置发送所述任务图模板。
  24. 根据权利要求23所述的计算设备,其特征在于,所述计算设备还包括增强型短消息服务EMS和计算单元,所述EMS用于接收来自所述任务调度装置的待执行任务,并将所述待执行任务分配给所述计算单元,所述计算单元用于执行所述待执行任务。
PCT/CN2021/100415 2021-06-16 2021-06-16 一种任务调度方法和装置 WO2022261867A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180097744.8A CN117222980A (zh) 2021-06-16 2021-06-16 一种任务调度方法和装置
PCT/CN2021/100415 WO2022261867A1 (zh) 2021-06-16 2021-06-16 一种任务调度方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/100415 WO2022261867A1 (zh) 2021-06-16 2021-06-16 一种任务调度方法和装置

Publications (1)

Publication Number Publication Date
WO2022261867A1 true WO2022261867A1 (zh) 2022-12-22

Family

ID=84526884

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/100415 WO2022261867A1 (zh) 2021-06-16 2021-06-16 一种任务调度方法和装置

Country Status (2)

Country Link
CN (1) CN117222980A (zh)
WO (1) WO2022261867A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140039954A1 (en) * 2012-07-31 2014-02-06 Wj Global Llc Project management with task templification and concentration, and automated provider identification and scheduling
CN104166590A (zh) * 2013-05-20 2014-11-26 阿里巴巴集团控股有限公司 一种实现任务调度的方法及系统
CN110888721A (zh) * 2019-10-15 2020-03-17 平安科技(深圳)有限公司 一种任务调度的方法及相关装置
CN110895486A (zh) * 2018-09-12 2020-03-20 北京奇虎科技有限公司 分布式任务调度系统
CN111522635A (zh) * 2019-12-31 2020-08-11 支付宝实验室(新加坡)有限公司 计算任务处理方法、装置、服务器及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140039954A1 (en) * 2012-07-31 2014-02-06 Wj Global Llc Project management with task templification and concentration, and automated provider identification and scheduling
CN104166590A (zh) * 2013-05-20 2014-11-26 阿里巴巴集团控股有限公司 一种实现任务调度的方法及系统
CN110895486A (zh) * 2018-09-12 2020-03-20 北京奇虎科技有限公司 分布式任务调度系统
CN110888721A (zh) * 2019-10-15 2020-03-17 平安科技(深圳)有限公司 一种任务调度的方法及相关装置
CN111522635A (zh) * 2019-12-31 2020-08-11 支付宝实验室(新加坡)有限公司 计算任务处理方法、装置、服务器及存储介质

Also Published As

Publication number Publication date
CN117222980A (zh) 2023-12-12

Similar Documents

Publication Publication Date Title
US11301445B2 (en) Compiling graph-based program specifications
US9996394B2 (en) Scheduling accelerator tasks on accelerators using graphs
CN106663075B (zh) 执行基于图的程序规范
US10031775B2 (en) Backfill scheduling for embarrassingly parallel jobs
US10318260B2 (en) Method and apparatus for a compiler and related components for stream-based computations for a general-purpose, multiple-core system
US10175951B2 (en) Specifying components in graph-based programs
CN106687920B (zh) 管理任务的调用
CN106687919B (zh) 用于控制多个组件的执行的方法、系统和计算机可读介质
CN106605209B (zh) 控制数据处理任务
US10942824B2 (en) Programming model and framework for providing resilient parallel tasks
JP2013524386A (ja) ランスペース方法、システムおよび装置
US20230038051A1 (en) Data transmission method and apparatus
CN109669772A (zh) 计算图的并行执行方法和设备
Jung et al. Dynamic behavior specification and dynamic mapping for real-time embedded systems: Hopes approach
Wang et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
Vrba et al. Kahn process networks are a flexible alternative to MapReduce
WO2022261867A1 (zh) 一种任务调度方法和装置
WO2023077875A1 (zh) 用于并行执行核心程序的方法和装置
WO2022252091A1 (zh) 一种计算模型的处理方法及装置
US20240028423A1 (en) Synchronization Method and Apparatus
US20240127034A1 (en) Apparatus and method for distributed processing of neural network
WO2024109312A1 (zh) 任务调度执行方法、任务调度执行指令的生成方法及装置
Sathvik et al. A Study on Different Types of Scheduling Algorithm
Whitlock et al. Extending TensorFlow's Semantics with Pipelined Execution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21945454

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180097744.8

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE