US12430166B2 - Hierarchical task scheduling for accelerators - Google Patents
Hierarchical task scheduling for acceleratorsInfo
- Publication number
- US12430166B2 US12430166B2 US17/542,022 US202117542022A US12430166B2 US 12430166 B2 US12430166 B2 US 12430166B2 US 202117542022 A US202117542022 A US 202117542022A US 12430166 B2 US12430166 B2 US 12430166B2
- Authority
- US
- United States
- Prior art keywords
- task
- tasks
- sub
- accelerator
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/501—Performance criteria
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5017—Task decomposition
Definitions
- a general purpose host processor can be coupled to one or more accelerators—which are specialized circuit modules that can perform certain compute-intensive tasks faster or more efficiently than the host processor.
- accelerators which are specialized circuit modules that can perform certain compute-intensive tasks faster or more efficiently than the host processor.
- scheduling tasks among accelerators has proven to be a significant challenge, and accelerator resources often remained under-utilized.
- scheduling is an NP-complete problem, and an increase in numbers of tasks can exponentially increase the difficulty of finding efficient schedules.
- the challenge of scheduling large numbers of tasks can increase further if a heterogeneous computer is to support multiple client applications concurrently.
- scheduling feasibility weighs in favor of partitioning a computation job into a small number of coarse-grained tasks.
- large tasks can require larger banks of high performance memory at each accelerator in order to efficiently process the large tasks.
- Die sizes can become prohibitive as the number of accelerators is increased. That is, considerations of die size weigh in favor of partitioning a job into a large number of fine-grained tasks, which is at odds with scheduling. Till now, performance and scaling of heterogeneous computers have been limited by these conflicting considerations.
- the disclosed technologies schedule tasks hierarchically.
- coarse scheduling of large tasks can be performed centrally, leading to dispatch of large tasks to individual acceleration modules.
- a received task can be partitioned into smaller sub-tasks, fine scheduling of these sub-tasks can be performed, and the sub-tasks can be executed locally.
- the disclosed technologies can be implemented as a system for scheduling tasks among a plurality of accelerator circuits.
- the system includes a hierarchical task scheduler having a coarse scheduling circuit module and at least two fine scheduling circuit modules.
- the coarse scheduling circuit module is configured to receive task-set metadata; schedule tasks from the task-set metadata among the plurality of accelerator circuits, to optimize a predetermined criterion; and dispatch the scheduled tasks to the plurality of accelerator circuits.
- Each fine scheduling circuit module is communicatively coupled with the coarse scheduling circuit module and with corresponding one or more of the accelerator circuits.
- Each fine scheduling circuit module includes an interface sub-module and an accelerator-specific scheduler (AS) sub-module.
- AS accelerator-specific scheduler
- the interface sub-module is configured to receive, from the coarse scheduling circuit module, the tasks scheduled for the corresponding one or more accelerator circuits.
- the accelerator-specific scheduler (AS) sub-module is configured to partition a given task into one or more streams of first sub-tasks, which include at least some computation sub-tasks, and to schedule the computation sub-tasks among the corresponding one or more accelerator circuits.
- the predetermined criterion to be optimized by the coarse scheduling circuit module can be a makespan.
- the accelerator circuits can incorporate respective neural network kernels, each of the neural network kernels configured as a convolution kernel, a batch normalization kernel, or a fully-connected layer kernel.
- the accelerator circuits can include convolution kernels of multiple types.
- At least one of the fine scheduling circuit modules is implemented as a hard-wired circuit.
- the coarse scheduling circuit module and at least one of the fine scheduling circuit modules can be implemented on a common chip.
- the coarse scheduling circuit module and at least one of the fine scheduling circuit modules can include distinct respective processor cores configured to execute respective scheduling instructions.
- the system and the plurality of accelerator circuits can be implemented in a chipset.
- the accelerator circuits can include two or more of: a neuromorphic array, a field-programmable gate array (FPGA), a general-purpose graphics processor unit (GPGPU), or an application specific integrated circuit (ASIC).
- FPGA field-programmable gate array
- GPU general-purpose graphics processor unit
- ASIC application specific integrated circuit
- the system, the plurality of accelerator circuits, and a host processor can be incorporated into a high-performance computing (HPC) system.
- the host processor can be configured to provide the task-set metadata to the coarse scheduling circuit module.
- the disclosed technologies can be implemented as a chipset incorporating first circuitry and second circuitry.
- the first circuitry is configured to implement a task scheduler.
- the second circuitry distinct from and coupled to the first circuitry, incorporates a processor core and is configured to implement an accelerator and a sub-task scheduler.
- the first circuitry is further configured to dispatch a first task to the second circuitry based on output from the task scheduler.
- the sub-task scheduler is further configured to schedule a plurality of sub-tasks of the first task for execution at the accelerator.
- the processor core can be a first processor core
- the accelerator can be a first accelerator
- the sub-task scheduler can be a first sub-task scheduler
- the chipset can include third circuitry, coupled to the first circuitry, implementing a second accelerator and a second processor core.
- the second processor core can be configured to implement a second sub-task scheduler.
- the first circuitry can be configured to dispatch a second task to the third circuitry based on output from the task scheduler.
- the second sub-task scheduler can be configured to schedule a plurality of sub-tasks of the second task on the second accelerator.
- the chipset can include eight additional circuitries, each coupled to the first circuitry, each implementing a respective additional accelerator and a respective additional processor core configured to implement a respective additional sub-task scheduler.
- the first circuitry can be configured to dispatch respective tasks to the additional circuitries based on output from the task scheduler.
- Each of the additional second sub-task schedulers can be configured to schedule a plurality of sub-tasks of the respective task on the respective additional accelerator.
- the accelerator can implement a neural network kernel, a convolution function, a matrix arithmetic function, a matrix analysis function, a data compression function, an encryption function, a domain transform, a bit blit function, a regular expression search function, a wireless coding function, or a beamforming function.
- the processor core can be a RISC-V processor core.
- FIG. 7 is a dataflow diagram of a second example system implementing the disclosed technologies.
- FIG. 8 is a block diagram of a third example system implementing the disclosed technologies.
- FIG. 16 is a fourth chart illustrating strong scaling performance of examples of the disclosed technologies.
- FIG. 19 is a third chart illustrating weak scaling performance of examples of the disclosed technologies.
- FIG. 20 is a fourth chart illustrating weak scaling performance of examples of the disclosed technologies.
- Heterogeneous computers incorporate accelerators to assist general-purpose processors with a range of compute-intensive tasks, and hold promise for dramatically increasing computing power available in practical devices for a wide range of applications.
- accelerators to assist general-purpose processors with a range of compute-intensive tasks, and hold promise for dramatically increasing computing power available in practical devices for a wide range of applications.
- scheduling is NP-complete even in a homogeneous computing environment, and the time required to find a best possible schedule can scale exponentially with number of tasks. Heuristic schedulers can be used, but also suffer in schedule quality as the number of tasks increases.
- the problem of scheduling complexity is significantly exacerbated in a heterogeneous computing environment where processing time for a given task can vary between processing resources, and data transfer time can also vary. This problem is further exacerbated for larger scale heterogeneous computers, as an increasing number concurrent applications can be required in order to maintain high levels of utilization as the number of accelerators increases.
- scheduling complexity strongly weighs in favor of large coarse-grain tasks.
- Another consideration strongly favoring large tasks is compatibility with programming models used for client applications. That is, developers of a client application can often readily provide a coarse-grained task graph and corresponding task parameters. Adaptation of the task parameters into a cost matrix for a particular set of heterogeneous computing resources can be a straightforward exercise. That is, large tasks provide programming portability, where accurate task-set metadata can be easily obtained for varying heterogeneous computer architectures.
- determining an efficient fine schedule can also be straightforward. Even with several (e.g. 2-10) accelerators in an acceleration module, the fine scheduling can be much simpler than task scheduling at the apex level.
- the sub-task partitioning or fine scheduling can be implemented as hard-wired circuit modules.
- accelerators and fine schedulers can benefit from design re-use.
- a single parameterized design can be re-used for convolutional kernels with MAC counts ranging from 64 to 1024 or even more.
- the hierarchical approach is well-suited to mixing types of accelerators, or even diverse accelerator technologies, in a single heterogeneous computer.
- the disclosed technologies scale well, both in terms of supporting large numbers of accelerators (because of the low requirements for local memory) and in terms of supporting large numbers of client applications (because of the large tasks at the apex level).
- a “makespan” denotes an overall execution time between start and completion of a predefined computing work. Elapsed time can be measured as real time (sometimes known as “wall clock time”) or processor time (sometimes known as “CPU time”). With regard to parallelization, a clock cycle in which two or more processing units perform work can be counted only once. With regard to processors working at different clock rates, a common time unit can be used. For example, 2 cycles at 100 MHz and 20 cycles at 1 GHz can each count as 20 ns.
- a “heuristic” method is a computationally practical method to solve a problem, in particular an optimization problem.
- some scheduling problems lack polynomial-time solutions (e.g. NP-Complete), and can be computationally impractical for even modest problem sizes.
- heuristic scheduling can be used to obtain a “pretty good” solution in a practical amount of computation time, although superior solutions may exist, undiscovered.
- Predict Earliest Finish Time PEFT
- Other exemplary heuristic scheduling techniques include Heterogeneous Earliest Finish Time (HEFT) and Constrained Earliest Finish Time (CEFT).
- a “pipeline” is an arrangement of connected hardware or software functional blocks operating in parallel, such that data units flow from one functional block to the next (as if along a pipe).
- data units 1 , 2 , 3 can be successively fed to a pipeline of blocks A, B, C.
- Data unit 1 is processed first by block A (to illustrate, block A can transfer data unit 1 from main memory into local memory), the results from block A are forwarded to block B (to illustrate, block B can compute a matrix inversion on the local data), and the results from block B are forwarded to block C (to illustrate, block C can transfer the results of block B from local memory to main memory).
- block C When block C is operating on data unit 1 , block B can operate on data unit 2 , and block A can operate on data unit 3 .
- the illustration with matrix inversion has a single input and a single output and is a linear pipeline, however this is not a requirement. In other examples, a pipeline can have branches.
- a convolution function can operate on input data and weights.
- two blocks A 1 , A 2 can operate concurrently, respectively providing input data and weights in local memory for use by a convolution block B.
- an LU-decomposition function can operate on one input matrix, generating two factors of the input matrix as output. Accordingly, two blocks C 1 , C 2 can operate concurrently, respectively transferring the L and U factor matrices computed by block B from local memory to main memory.
- Task A can be assigned to run on accelerator P starting at time T 1
- Task B can be assigned to run on accelerator Q starting at time T 1
- Task C can be assigned to run on accelerator P starting at time T 2 .
- the times can be absolute times according to a particular clock, or can be relative times or time slots, such that a next time slot waits for a preceding time slot to complete before commencing.
- chedule refers to the act of temporally allocating the computing resources to the tasks or sub-tasks
- cheduler refers to a computational module (which can be implemented as a hard-wired circuit, as software executed by a processor circuit, or as any combination thereof) configured to perform scheduling
- schedule as a noun refers to the output of a scheduler listing the temporal allocation of the computing resources.
- cheduling instructions are executable program instructions performing at least a part of a scheduling act.
- Hierarchical scheduling is with reference to an environment in which tasks are defined hierarchically, and scheduling is performed separately for the tasks at each level of the hierarchy. Without loss of generality, the hierarchy can be oriented so that lower-level tasks are sub-tasks of higher-level tasks. In some examples, the task hierarchy can have two levels, with “coarse scheduling” performed on higher level tasks and “fine scheduling” performed on lower level sub-tasks of these tasks. Extending the previous illustration with task B assigned to an acceleration module Q, task B can be partitioned into sub-tasks B 1 . . . BN (for some positive integer N>1) at the acceleration module and scheduled at one or more local accelerator resources. Hierarchical scheduling is not limited to two levels.
- sub-task B 1 can be assigned to accelerator Q 1 within module Q, and further partitioned into third-level tasks B 1 a , B 1 b , B 1 c to be executed as a stream on accelerator Q 1 .
- the terms coarse and fine scheduling can be applied to any two levels of a task hierarchy.
- a “task” is a discrete unit of computing work.
- tasks include a convolution operation, a matrix multiplication, a search operation, a compression operation, or an encoding or decoding operation.
- a computation job can be organized as a set of multiple tasks dubbed a “task-set”.
- Tasks of a task-set can have dependencies, e.g. task B can operate on the result of task A.
- Some tasks can be further divided into a plurality of “sub-tasks”. For example, a convolution task on a large dataset can be partitioned into smaller sub-tasks each operating on a subset of the large dataset dubbed a “tile”.
- the computing work can include “data transfers” in which data items can be moved, and “computation” in which a processor performs operations on data items. Computation operations can include, without limitation, modifying a data item (either in memory or in a register), testing a condition of a data item, or triggering an operation by a peripheral device.
- a task can be partitioned into one or more computation sub-tasks (which can be executed by an accelerator) and data transfers (which can be executed by a DMA engine).
- a sub-task is a task.
- a task can be performed using a task specification, and can be performed without knowledge of associated task-set metadata.
- a “task-set metadata” is metadata specifying resources required for computation of the task-set and dependencies between the constituent tasks.
- the required resources can include computation time for each task of the task-set (which can vary among available processors and accelerators) and an amount of data transfer to, from, or between respective tasks.
- Dependencies can be provided as a task graph, with an amount of data transfer associated with each edge of the task graph.
- Task scheduling can be performed using the task-set metadata, and can be performed without detailed knowledge of the task specifications.
- accelerators of interest herein include: an application-specific integrated circuit (ASIC), comprising hard-wired circuitry customized for certain computer operation(s); a field-programmable gate array (FPGA); a general-purpose graphics processing unit (GPGPU); or a neuromorphic array.
- a heterogeneous computer incorporating two or more classes of accelerators is regarded as “extremely heterogeneous”.
- bus is an electrical pathway providing a facility for data transfer between two or more circuit modules.
- a “shared bus” can couple three or more circuit modules.
- a bus can include one or more conductive paths (sometimes “wires”) for data signals, control signals, or address signals. The number of conductive paths able to carry data signals concurrently can be termed the “width” of the bus.
- DMA direct memory access
- the processor can program a controller (“DMA controller” or “DMA engine”) with source and destination addresses and an amount of data to be transferred, and then trigger the transfer to commence.
- the DMA controller can be a peripheral device of the processor.
- One or both of the source and destination circuit modules can be memories.
- a “DMA-specific” memory address or register refers to a memory location or register which can be written by the processor to configure the DMA controller or which can be written by the DMA controller to report status to the processor.
- EDA electronic design automation
- a circuit on a semiconductor die can be performed using “photolithography”, wherein a series of patterned operations is performed on a semiconductor surface, with the patterning defined at least in part by varying light exposure over the die.
- a “mask” is a device that controls the light pattern over an entire semiconductor wafer, commonly including multiple dice.
- a “reticle” controls the light pattern for a single die, which can be applied repeatedly for multiple dice on a wafer.
- an “integrated circuit” is a set of one or more electronic circuit modules assembled in a single enclosed package, which can be further integrated with other integrated circuits or electrical components, e.g. on a printed circuit board.
- electronic circuitry can be provided on one or more semiconductor dice.
- a “chipset” is a group of one or more integrated circuits, configured to function together within an electronic apparatus.
- accelerators or memory can be integrated on a same chip as a host processor or coarse scheduler, or can be provided on a separate chip of a chipset.
- kernel is a circuit configured to perform a specific function or operation.
- a kernel configured to perform function X e.g. convolution
- X kernel e.g. “convolution kernel”.
- Kernel names can be further qualified, e.g. convolution-1024 kernel and convolution-128 kernel can denote convolution kernels having 1024 and 128 multiply-accumulate circuit blocks (dubbed “MACs”) respectively.
- High-bandwidth memory which is a stacked implementation of synchronous DRAM offering wide buses and high data transfer rates.
- Local memory (sometimes, “accelerator-side” memory) can be a small bank of memory (often in a range 64 bytes to 1 megabyte, but sometimes larger) not shared with any devices outside a given acceleration module.
- local memory can be read or written by one or more DMA engines and one or more accelerators.
- Local memory can be implemented as SRAM.
- a “memory controller” is a circuit module which coordinates accesses to one or more banks of memory over one or more buses.
- a memory controller can maintain a queue of scheduled DMA data transfers.
- a memory controller can identify the target of a requested memory operation and forward the requested operation accordingly.
- an available memory address space can be partitioned between main memory, local memory, and memory-mapped I/O addresses; and the address of a read or write instruction can be decoded to identify a target memory device.
- a “client” is a computer hardware or software entity that uses another computer hardware or software entity, as a resource or for a service.
- the hardware or software entity providing the resource or service is dubbed a “server”.
- a software application can be a client of a server, hence the term “client application”.
- a “first-in first-out queue” (“FIFO queue” or simply “FIFO”) is a buffer organized to retrieve data items in the same order as the data items were stored.
- a dispatched task can be stored in a FIFO at an accelerator device until the task is retrieved for processing (e.g. by partitioning the task into sub-tasks and executing the sub-tasks).
- a “neural network” is an artificial network of “units” (or “cells”) that has linkages modeled on behavior of biological neurons and can be implemented by one or more electronic circuit modules, either hard-wired or as software executed on a processor.
- the units of a neural network can be organized in a graph of layers, such that output of one layer provides input to one or more other layers.
- a neural network can be executed as a task-set, with each task of the task set being one layer of the neural network. Numerous layer types are used in the art. Layer types of interest herein include convolution, branch normalization, and fully-connected (or dense) layers.
- a “smartphone” is a handheld mobile communication device integrating voice or video telephony with a computing environment in which software applications can be installed by a user to extend or customize functionality.
- Smartphones have diverse computing requirements for processing both media streams and wireless signals, and can benefit from acceleration using the disclosed technologies.
- Smartphones can process wireless signals under numerous standards, often grouped according to technology generation. For example, fifth-generation (“5G”) standards have been deployed and sixth-generation (“6G”) standards are under development.
- 5G fifth-generation
- 6G sixth-generation
- video is a digital signal representing a temporal stream of images. Often, successive images of a video stream represent successive views of a common scene, e.g. a moving picture.
- Video can be represented as a stream of image frames, encoded according to a video coding standard. Non-limiting examples of video standards include Advanced Video Coding (AVC, or H.264) and High Efficiency Video Coding (HEVC, or H.265).
- AVC Advanced Video Coding
- HEVC High Efficiency Video Coding
- Video coding operations for storage, transmission, or reproduction can be compute-intensive.
- Video processing can also include other compute-intensive operations such as resizing or error-correction.
- Directed edges couple respective pairs of vertices and represent a dependency between the corresponding tasks.
- the edge from vertex T 1 to vertex T 2 indicates that task T 2 is dependent on task T 1 .
- some input data of task T 2 can be generated by task T 1 .
- the edges from vertex T 1 to vertices ⁇ T 2 , T 3 , T 4 , T 5 , T 6 ⁇ indicate that each of tasks ⁇ T 2 . . . T 6 ⁇ is dependent on task T 1 .
- the edges from vertices ⁇ T 2 , T 4 , T 6 ⁇ to vertex T 8 indicates that task T 8 is dependent on each of tasks ⁇ T 2 , T 4 , T 6 ⁇ .
- task graph 101 is a directed acyclic graph (DAG) having a single top-level vertex T 1 (top-level vertices corresponding to top-level tasks which have no dependencies on other tasks) and a single leaf vertex T 10 (leaf vertices corresponding to leaf tasks upon which no other tasks are dependent).
- a task graph can have 2, 3, or more top-level vertices, or 2, 3, or more leaf vertices.
- task graph 101 shows each task exactly once.
- a task-set can require a same function to be performed repeatedly with different input data. Because a task specification is a combination of commands (e.g.
- FIG. 1 C is a Gantt chart 103 showing a scheduling of tasks T 1 -T 10 on processing resources P 1 -P 3 121 .
- the illustrated schedule was obtained heuristically; better schedules may exist.
- Arrow 105 indicates increasing time.
- Rows 113 , 133 , 135 show the tasks scheduled to run on resources P 1 , P 3 , P 2 respectively.
- Each resource is assumed to have a single inward DMA facility which can receive data from either of the other processing resources.
- Rows 132 , 134 show the DMA transfers to resources P 1 , P 3 respectively.
- the heuristic schedule illustrated in Gantt chart 103 requires no data transfer to resource P 2 and, accordingly, inward DMA for resource P 2 is omitted from chart 103 .
- the DMA transfers are labeled according to the source and destination tasks.
- data transfer D 12 represents the data transferred from task T 1 to task T 2
- data transfer D 910 represents the data transferred from task D 9 to task T 10 .
- No data transfer is shown from task T 8 to task T 10 , because these tasks are scheduled to run on the same resource P 3 , and no physical transfer of data is required.
- Dashed lines generally show the connectivity from source task (e.g. T 4 ) to DMA data transfer (e.g. D 48 ) to destination task (e.g. T 8 ) in a direction of increasing time. To maintain clarity of chart 103 , a few dashed lines are omitted.
- Chart 103 also shows end times of each task in italics.
- Task T 1 starts at time 0 .
- task T 1 on resource P 2 takes 21 cycles to complete (see cost matrix 102 ).
- Data transfer D 12 takes 17 cycles to complete.
- FIG. 2 is a block diagram 200 of a system having a heterogeneous set of accelerators and implementing a conventional approach to scheduling.
- a host processor 207 receives computation jobs from one or more client applications 205 , and schedules and distributes tasks from the jobs for execution among multiple accelerators 257 A- 257 C.
- Scheduling module 225 can retrieve task-set metadata 210 from FIFO 211 .
- centralized scheduler 215 can determine a task schedule (containing information similar to Gantt chart 103 ) and task dispatcher 235 can dispatch the tasks for execution according to the task schedule.
- accelerator 257 A When input data is available, a given accelerator 257 A can execute its task, following which output data from the task can be transferred to main memory by DMA engine 263 A.
- the other accelerators 257 B- 257 C can operate similarly.
- accelerators 257 A- 257 B implement convolution kernels having 1024 and 128 MAC units respectively.
- Accelerator 257 C implements a neural network kernel for a dense layer.
- each accelerator 357 A- 357 C can be reduced by one, two, or more orders of magnitude. Comparing the architectures of FIG. 2 - 3 from a chip design perspective, the memory savings can far outweigh the die area required by a RISC-V core 345 A with interface 351 A and fine scheduler 353 A. Further details of acceleration modules 345 A- 345 C are described herein, for example in context of FIGS. 4 - 8 .
- FIG. 4 is a block diagram 400 of an example acceleration module, its environment, and associated data paths.
- FIG. 4 shows a single acceleration module 445 having a single accelerator 457
- examples of the disclosed technology can include multiple acceleration modules 445 ; or a single acceleration module 445 can include multiple accelerators 457 or additional DMA engines.
- bus 452 can be implemented as a 4096-bit wide bus to avoid data starvation at accelerator 457 .
- bus 452 can be a 1024-bit wide bus clocked at quad data rate.
- FIG. 5 is a diagram 500 pipelined sub-task execution, with time increasing from left to right as shown by arrow 505 .
- Six time slots TS 0 . . . TS 5 are marked.
- sub-task execution activities are shown in five lanes 501 - 504 and 506 .
- an instant task has been partitioned into a succession of tiles. Each tile has two inbound data-transfer sub-tasks, a computation sub-task, and an outbound data-transfer sub-task.
- lanes 501 - 504 depict the time ordering of sub-tasks for tiles 1 - 4 respectively.
- Lane 506 depicts a stream of configuration actions 510 - 516 .
- Sub-tasks occurring in time slot K can be configured in an earlier time slot (K ⁇ 1).
- configuration blocks 511 - 515 can set configuration registers for sub-tasks occurring in a next time slice.
- block 511 can set memory-mapped I/O registers for inbound DMA sub-tasks 521 , 531 , which are the only sub-tasks scheduled for time slot TS 0 .
- Configuration block 512 can configure registers for inbound DMA sub-tasks 522 , 532 and also configure accelerator registers for computation sub-task 541 , all of which are scheduled for time slot TS 1 .
- Subsequent configuration blocks 513 - 516 operate similarly, and can also configure memory-mapped I/O registers for outbound DMA sub-tasks 551 - 554 . Configuration for a subsequent time slot can occur while sub-tasks of a current time slot are still executing. Accordingly, the various configuration registers can be buffered to prevent overwriting registers that are in use.
- sub-task 521 can load a tile of an input data array and sub-task 531 can load a tile of an input weights array.
- the data tile and the weights tile can be operated on by the accelerator kernel as shown by computation sub-task 541 .
- sub-task 541 can perform convolution of the input data tile with the weights tile.
- an outbound data-transfer sub-task 551 can move output of sub-task 541 from local memory to main memory, freeing up local memory space for subsequent tiles.
- the sub-tasks of FIG. 5 can also be viewed as streams.
- Sub-tasks 521 - 524 form a stream of inbound data-transfer sub-tasks (e.g. for tiles of an input data array), and sub-tasks 531 - 534 form a second stream of inbound data-transfer sub-tasks (e.g. for tiles of a weights array).
- Sub-tasks 541 - 544 form a stream of computation sub-tasks, and sub-tasks 551 - 554 form a stream of outbound data-transfer sub-tasks.
- the respective sub-tasks can be executed sequentially in successive time slots.
- sub-tasks executing in a given time slot can take the same time to complete, however this is not a requirement and, in general, concurrent sub-tasks can take different amounts of time to complete.
- time slot TS 0 the two inbound DMA transfers 521 , 531 take different amounts of time, and time slot TS 0 ends when the last sub-task 531 completes.
- time slot TS 1 the data transfers 522 , 532 complete before computation sub-task 541 , and time slot TS 1 ends when sub-task 541 completes.
- Time slot TS 2 ends when inbound DMA sub-task 523 completes.
- outbound DMA sub-task 552 can get a head start.
- the preconditions for sub-task 552 are that computation for tile 2 is complete and the DMA output channel is free. These preconditions are met when computation sub-task 542 finishes, because DMA output sub-task 551 has finished earlier, and sub-task 552 can start before inbound DMA sub-task 523 has finished. All sub-tasks of time slot TS 3 complete at the same time, and sub-tasks 553 , 544 begin immediately thereafter.
- Time slot TS 4 ends when outbound DMA sub-task 553 ends. Outbound DMA sub-task cannot get a head start because the outbound DMA channel is busy until sub-task 553 completes.
- weights are organized as a 4-D data array of size OD ⁇ ID ⁇ KH ⁇ KW, which can be regarded as a 2-D KH ⁇ KW weight array for each pair of input and output array slices.
- This organization of input features, output features, and weights is exemplary and, in other applications, different organization can be used.
- the fine scheduler can partition the task into sub-tasks for respective tiles of the feature maps and respective slices of the weights array.
- the input tiles can be organized as 3-D arrays of size ith ⁇ itw ⁇ itd and the output tiles can be organized as 3-D arrays of size oth ⁇ otw ⁇ otd. All variables in legend 620 are positive integers.
- data structures can be allocated and initialized for respective streams of sub-tasks: sWeights and sIFmap are sub-task streams for inbound data-transfer sub-tasks similar to 521 , 531 of FIG. 5 ; sKernel is a sub-task stream for kernel computation sub-tasks similar to 541 ; and sOFmap is a sub-task stream for outbound data-transfer sub-tasks similar to 551 .
- Lines L 04 -L 07 begin nested loops over output array height (indexed by oh, with stride equal to tile height oth), output array width (indexed by ow, with stride equal to tile width otw), output array depth (indexed by od, with stride equal to tile depth otd) and, in the innermost loop, input array depth (indexed by id, with stride equal to tile depth itd).
- additional nested loops can traverse input array height IH and input array width IW, however in the present illustration each sub-task is presumed to span the entire range IH ⁇ IW and additional nested loops are not required.
- an inbound DMA sub-task can be configured to transfer a tile of weights data from a main memory block starting at address aWeightsMain to a local memory block starting at address aWeightsLocal.
- This sub-task can be similar to sub-task 531 of FIG. 5 and can be added to stream sWeights and designated for execution at time slot next_ts.
- inbound DMA sub-task can be configured to transfer a tile of input feature data from a main memory block starting at address aIFmapMain to a local memory block starting at address aIFmapLocal.
- an outbound DMA sub-task (similar to 551 ) can be configured to transfer a tile of output data from a local memory block starting at address aOFmapLocal to main memory address starting at address aOFmapMain.
- the DMA sub-tasks can be conditioned on checking an enable variable as shown at lines L 10 , L 12 , L 15 .
- a DMA sub-task can be omitted if the required data is already present in local memory and can be reused.
- the instant example uses a 4-D weights array, with different weight data elements for each input slice id.
- a 2-D weights array KHxKW can be used and the weights array can be loaded just once into local memory, and reused thereafter.
- nested loops can be organized with input data index id varying in an outer loop, as a result of which the same input data can be reused as output data indexes are varied in inner loops.
- Lines L 17 -L 19 control pipelined execution of the configured sub-tasks.
- the sub-tasks previously configured for the current time slot is can be executed, e.g. inbound DMA sub-tasks 523 , 533 for tile 3 , computation sub-task 542 for tile 2 , and outbound DMA sub-task 551 .
- Lines 17 - 18 can be non-blocking.
- the fine scheduler can block (wait) till sub-tasks of the current time slot complete, before proceeding to a next iteration within the nested loops L 04 -L 23 . Closing operations are performed at line L 24 .
- FIG. 7 is a dataflow diagram 700 of a second example system implementing hierarchical scheduling.
- hard-wired or software-implemented components are illustrated with angular corners, and data objects passed between these components are illustrated with rounded corners.
- FIG. 7 includes components of an innovative system together with environment components with which the innovative system can operate.
- One or more client applications 705 can be coupled to supervisory module 725 to provide tasks to be scheduled and executed using the disclosed technologies.
- an image recognition application 705 can spawn an Inception v3 neural network task-set, which can be provided to the innovative system.
- the task-set can include task-set graph 712 and a cost matrix 714 , similar to those described in context of FIGS. 1 A- 1 B , in task-set metadata 710 .
- Coarse scheduler 715 can operate on metadata 710 to generate task-set schedule 730 , optimizing schedule 730 according to a predetermined criterion.
- task-set metadata 710 and task specifications 720 can be provided by developers of client application 705 .
- at least part of the task-set metadata 710 can be determined by a host processor separate from client application 705 using techniques disclosed herein.
- the combination of coarse scheduler 715 and task dispatcher 735 can be regarded as a coarse scheduling circuit module.
- the task-set can also include specifications 720 of the constituent tasks of the instant task-set.
- Each task specification 720 can include a data segment 722 and a command segment 724 .
- Data segment 722 can specify input or output data for the instant task and can variously include: literal data required as input for task execution; a link or reference to such literal data; a input data label matching another output data label among the task specifications 720 ; an output data label; or memory addresses to be used for input or output of data from the instant task.
- Command segment 724 can specify the work to be performed for the instant task and can variously include: program instructions for the instant task; a link or reference to such program instructions; a name or other identifier of the task to be performed; a name or other identifier of an accelerator kernel with which the task is to be performed.
- Task-set schedule 730 and task specifications 720 can be provided as inputs to task dispatcher 735 , which can route the constituent tasks (that is, task specifications 720 ) of the instant task-set among various processing resources according to task-set schedule 730 .
- task dispatcher 730 can dispatch tasks 721 on a just-in-time basis to respective processing resources, so that each processing resource acts on its designated tasks (including performing any necessary sub-task scheduling) as tasks 721 arrive.
- task dispatcher 730 can dispatch tasks 721 in a correct order (as designated in task-set schedule 730 ) to each processing resource without regard to the scheduled time for that task 721 .
- Tasks 721 dispatched to a given processing resource can be queued in a FIFO for the given processing resource until the given processing resource is ready to act on each such task 721 in turn.
- Available processing resources can include one or more acceleration modules 745 and, optionally, one or more general purpose processors 765 . That is, tasks providing the best overall performance improvement through acceleration can scheduled on an acceleration module 745 , while other tasks can be scheduled on a general purpose processor 765 .
- general purpose processors 765 are not a requirement.
- a task-set can comprise solely tasks suitable for acceleration.
- a client application 705 can also perform general work locally within client application 705 , leaving accelerable tasks for an innovative system.
- tasks 721 dispatched by task dispatcher 735 can include additional metadata beyond data and command segments 722 , 724 of task specification 720 .
- Task 721 can be accompanied by metadata indicating a particular time slot at which execution of the instant task is scheduled to begin or, in cases where an acceleration module 745 has a plurality of accelerators 757 , an indication of a particular accelerator (or accelerators) on which the instant task is scheduled to be executed.
- Each acceleration module 745 can include circuitry of interface sub-module 751 , fine scheduler 753 , and one or more accelerators 757 .
- the combination of interface 751 and fine scheduler 753 can be regarded as a fine scheduling circuit module, which is coupled to the coarse scheduling circuit module (in particular, task dispatcher 735 ) and to accelerator(s) 757 .
- Interface sub-module 751 can be configured to receive tasks 721 scheduled for accelerators 757 .
- Fine scheduler 753 can be customized specifically for accelerator(s) 757 and can be configured to partition a given received task 721 into one or more streams of sub-tasks including at least some computation sub-tasks.
- Fine scheduler 753 can generate corresponding sub-task specifications 740 as output.
- Fine scheduler 753 can be further configured to schedule the computation sub-tasks among accelerator(s) 757 with, e.g., a sub-tasks schedule 750 generated as output.
- Sub-task dispatcher 755 can receive sub-task specifications 740 and sub-tasks schedule 750 as input, and can be configured to dispatch sub-tasks 740 according to schedule 750 . As for task dispatcher 735 , sub-task dispatcher 755 can variously dispatch sub-tasks on a just-in-time basis, or in advance, to be queued among accelerators 757 .
- FIG. 7 also depicts routing of output data from tasks or sub-tasks.
- Arrows 762 denote sub-task output data routed within acceleration module 745 , e.g. from one accelerator 757 to another accelerator 757 , or to interface 751 .
- interface 751 can forward sub-task output data in streaming fashion as each sub-tasks output is received.
- interface 751 can aggregate sub-task output data into consolidated task output data.
- Arrows 764 indicate paths for forwarding task output data (either streaming or consolidated) to another acceleration module 745 or to a general purpose processor 765 to be used as input for a subsequent task.
- sub-task output data can be routed to multiple destinations. To illustrate, the same sub-task output data can be used by a subsequent sub-task of the same task, can be used at a different accelerator of an instant acceleration module 745 , and can be part of the overall task output.
- FIG. 8 is a block diagram 800 of a third example system based on the architecture of FIG. 7 .
- coarse scheduling circuit module 815 incorporates coarse scheduler 815 (similar to 715 of FIG. 7 ) and task dispatcher 835 ( 735 ), and two or more fine scheduling circuit modules 845 .
- coarse scheduler can receive task-set metadata 810 and can generate therefrom task-set schedule 830 (based on optimization of a predetermined criterion) allocating tasks to respective acceleration modules 845 .
- acceleration module 745 can include just one accelerator 757 ; the system implementation can include just one acceleration module 745 ; or the overall implementation can include just one client application 705 .
- zero, one, or more general purpose processors 765 can be provided for task execution.
- interface sub-module 751 can include a FIFO configured to queue tasks 721 received from task dispatcher 735 .
- Sub-tasks 740 can include data-transfer sub-tasks in addition to the computation sub-tasks described above.
- the fine scheduling circuit module within each acceleration module 745 can include a respective memory controller (not shown) with inbound and outbound DMA engines.
- Fine scheduler 753 can be configured to schedule data-transfer sub-tasks for execution by the memory controller using the appropriate DMA engine.
- the data-transfer sub-tasks can be pipelined with the computation sub-tasks as described further herein.
- Data-transfer sub-tasks can transfer inbound data from a main memory to local memory of acceleration module 745 , while outbound data can be transferred from local memory to the main memory.
- the memory controller can use DMA channels to read or write local memory, and can use a shared bus to read or write the main memory.
- the bus can be shared with other acceleration sub-modules 745 , supervisory module 725 , general purpose processor 765 , or other components of an instant compute environment.
- Sub-tasks partitioned from a given task can include instructions to configure memory-mapped I/O addresses.
- 0 ⁇ 1000 and 0 ⁇ 2000 can be written to accelerator-specific memory-mapped addresses to configure the accelerator to read sub-task input data starting at location 0 ⁇ 1000 in local memory and to write sub-task output data starting at location 0 ⁇ 2000 of local memory.
- Exemplary DMA-specific memory-mapped addresses can be written 0 ⁇ 4000 0000 and 0 ⁇ 1000 to cause inbound DMA to transfer data starting at main memory address 0 ⁇ 4000 0000 to local memory starting at address 0 ⁇ 1000, and similarly for outbound DMA.
- Additional memory-mapped addresses can be used to specify a transfer count, i.e. how much data is to be read or written.
- memory-mapped addresses can be used to trigger a sub-task or indicate completion of the sub-task.
- Coarse scheduler 715 can perform scheduling to minimize a makespan of the task-set.
- Coarse scheduler 715 can be configured to implement PEFT, HEFT, CEFT, or another heuristic procedure to derive task-set schedule 730 from task-set metadata 710 .
- successive task-set metadata 710 can be queued in a FIFO (similar to 311 ) coupled between coarse scheduler 715 and one or more hosts generating task-set metadata 710 for respective jobs.
- accelerators 757 can implement respective neural network kernels, each accelerator configured as a convolution kernel, a batch normalization kernel, or a fully-connected layer kernel. Accelerator circuits 757 can implement at least two different types of convolution kernels. In further applications, the accelerators can include two or more classes of accelerators 757 . Non-limiting examples of accelerator classes include: a neuromorphic array; a field-programmable gate array (FPGA); a general-purpose graphics processor unit (GPGPU); or an application specific integrated circuit (ASIC).
- FPGA field-programmable gate array
- GPU general-purpose graphics processor unit
- ASIC application specific integrated circuit
- fine scheduler 753 and interface 751 can be implemented as hard-wired circuitry.
- coarse scheduler 715 and fine scheduler 753 can be implemented on distinct respective processor cores executing scheduling instructions.
- software fine scheduler 753 can be deterministic (like FIG. 6 ) or can perform heuristic optimization using e.g., PEFT.
- coarse scheduler 715 and fine scheduler 753 can be implemented on a common chip.
- sub-task scheduler 953 can be implemented as program instructions executed on core 947
- accelerator 957 can be implemented as a hard-wired circuit.
- accelerator 957 can be implemented as program instructions executed on core 947 or on another processor.
- processor core 947 can be a RISC-V processor core which can be configured to execute instructions of sub-task scheduler 953
- a GPGPU alongside core 947 can be configured to execute program instructions of accelerator 957 .
- First circuitry 925 can incorporate another processor core.
- this processor core can be configured to implement task scheduler 915 by executing corresponding program instructions.
- task scheduler 915 can be hard-wired circuitry of a peripheral device coupled to and controlled by the another processor.
- the chipset can include N additional instances of circuitry having similar functionality as second circuitry 945 , but having possible differences in accelerator types or other internal implementation.
- a third circuitry can implement a second accelerator and a second processor core, with the second processor core configured to implement a second sub-task scheduler.
- First circuitry 925 can be configured to dispatch a second task to the third circuitry based on output of task scheduler 915 .
- the second sub-task scheduler can be configured to schedule a plurality of sub-tasks of the second task for execution on the second accelerator.
- N can be any positive integer, such as 2, 3, any number from 4 to 10, 11 to 100, 101 to 1000, or 1001 to 1 million.
- the accelerators can include three or more mutually heterogeneous accelerators or accelerators of different classes.
- the chipset can be implemented as a single chip.
- Accelerator 957 can implement diverse functions including, without limitation: a neural network layer, convolution, matrix arithmetic, matrix analysis, compression, encryption, domain transformation, a bit blit function, regular expression search, wireless coding, or beamforming.
- processor 947 can control reconfiguration of accelerator 957 to implement at least two of the above functions.
- the plurality of sub-tasks of the first task can be scheduled as a pipeline of first data-transfer sub-tasks, second computation sub-tasks, and third data-transfer sub-tasks.
- the first sub-tasks can load input data of the first task into local memory of accelerator 957 and the third sub-tasks can transfer output data of the first task out from the local memory of accelerator 957 to a destination.
- the first and third data-transfer sub-tasks can be executed by DMA facilities within second circuitry 945 .
- the second computation sub-tasks can be executed by accelerator 957 .
- the pipeline of sub-tasks can be performed concurrently. In examples, at least 50% of the second computation sub-tasks can be performed concurrently with (i) a first data-transfer sub-task (data in) or (ii) a third data-transfer sub-task (data out).
- FIG. 10 is a flowchart 1000 of an example method according to the disclosed technologies.
- computer-readable media are programmed with definitions of circuitry implementing hierarchical task schedulers as described herein.
- the computer-readable media can be any non-transitory or tangible storage media as described herein or otherwise known in the art.
- a definition of first circuitry implementing a task scheduler can be produced.
- a definition of second circuitry can be produced, the second circuitry incorporating a processor core and implementing a sub-task scheduler.
- the first circuitry can be configured to dispatch a first task to the second circuitry based on output from the task scheduler.
- the sub-task scheduler can be configured to schedule sub-tasks of the first task for execution at an accelerator.
- the computer-readable descriptions can be stored on computer-readable media usable for fabricating masks or reticles for manufacturing integrated circuits implementing the first circuitry and the second circuitry. The method can be performed using EDA tools.
- the method can extend to programming the computer-readable media with one or more accelerators coupled to receive sub-task definitions from the second circuitry and execute the sub-task specifications as scheduled by the sub-task scheduler.
- the method can extend to programming the computer-readable media with additional sub-task schedulers configured to schedule sub-tasks of respective tasks on respective accelerators.
- the computer-readable media can be programmed with a definition of a FIFO coupled between the first circuitry and the second circuitry, and configured to queue tasks dispatched from the first circuity for sub-task scheduling at the second circuitry or execution by an accelerator associated with the second circuitry.
- Inception-v3 has 189 layers (which map 1:1 to vertices of DAG 1200 or to tasks) to be executed to obtain e.g., a classification of an input image.
- ResNet-50 is a 50 layer deep neural network also used for image classification, using residuals and skip connections to provide bypass paths around one or more intermediate layers.
- VGG-16 is a 16-layer deep convolutional neural network providing image classification into 1024 categories.
- U-Net is a convolutional neural network used for image segmentation, featuring upsampled output stages and increased output image resolution.
- a base design (“Design-1”) includes five convolutional accelerators varying from 64 to 1024 MAC units with 128 to 256 kB of local SRAM; five batch normalization accelerators varying from 64 to 1024 MAC units with 0.5 to 8 kB of local SRAM; and five fully connected accelerator kernels varying from 64 to 1024 MAC units with uniform 128 kB of local SRAM.
- Advanced designs 2-10 respectively include 2-10 ⁇ the convolutional accelerators of Design-1 and uniformly twice the batch normalization and fully connected accelerators of Design-1 (i.e. Design-10 has 50 convolutional accelerators and 10 each batch normalization and fully connected accelerators).
- Advanced designs A-H respectively include 1-8 ⁇ all the accelerators of Design-1. That is, Design-H has 40 convolutional accelerators, 40 batch normalization accelerators, and 40 fully connected accelerators.
- Design-A and Design-B are identical to Design-1 and Design-2 respectively.
- Each design was packaged into a conventional system according to FIG. 2 and an innovative system according to FIG. 3 , including schedule generation at block 215 or at blocks 315 and 353 A- 353 C.
- the packaged systems were simulated on a GEMS cycle-level simulator for client applications Inception-v3, ResNet-50, for U-Net, and VGG-16.
- the coarse scheduling task at block 315 requires scheduling 189 tasks for Inception-v3, 107 tasks for ResNet- 50 , 17 tasks for U-Net, or 16 tasks for VGG-16. Because of the limited local memory of the various accelerators, the conventional scheduler of block 215 was provided with finer grain tasks compatible with the accelerator memory constraints.
- the task count for the conventional systems was 20,469 for Inception-v3 (a 108-fold increase over the innovative system), 6,824 for ResNet-50 (64 ⁇ ), 12,372 for U-Net (728 ⁇ ), or 38,064 for VGG-16 (2,379 ⁇ ).
- the heuristic PEFT scheduler was constrained to 1 ms maximum for schedule generation.
- makespan was determined for both the conventional and innovative systems. In all cases, makespan for the innovative systems was dramatically reduced compared to the conventional systems. Average makespan reduction was by factors of 17.66 for Inception-v3, 9.10 for ResNet-50, 6.96 for U-Net, and 16.96 for VGG-16.
- the innovative designs include a RISC-V core (similar to 345 A) and some additional logic compared to the comparative designs based on FIG. 2 . This extra circuitry was found to impose an 2-3% overhead on die area for the innovative system.
- FIGS. 13 - 16 are charts 1300 , 1400 , 1500 , 1600 illustrating strong scaling performance of the disclosed technologies for Inception-v3, U-Net, ResNet-50, and VGG-16 neural network client applications, respectively.
- Horizontal coordinates 1 - 10 denote Design-1 to Design-10 which have 1-10 ⁇ the number of convolutional accelerators as Design-1.
- makespans were measured for 10 parallel instances of the noted application (e.g. Inception-v3 in FIG. 13 ) for an innovative system and for a conventional system.
- a speedup factor was determined relative to the conventional system.
- graph 1310 shows the baseline speedup ⁇ 1.
- graph 1305 shows a hypothetical speedup equal to the accelerator multiple on the horizontal axis.
- the speedup of graph 1320 is seen to plateau at a speedup of about 2.5.
- Examination of DAG 1200 shows that significant portions of Inception-v3 lack parallelism, i.e. tasks are executed sequentially. Accordingly, for these portions, 10 instances of Inception-v3 can utilize at most 10 convolutional accelerators. Because Design-3 has 15 convolutional accelerators—and larger Designs 4-10 have even more—convolutional accelerators are often idle (“task starvation”) and speedup can plateau.
- FIGS. 17 - 20 are charts 1700 , 1800 , 1900 , 2000 illustrating weak scaling performance of the disclosed technologies for Inception-v3, U-Net, ResNet-50, and VGG-16 neural network client applications, respectively.
- the horizontal coordinate 1 [A]- 8 [H] indicates Design-A to Design-H which have 1-8 ⁇ the number of accelerators as Design-1.
- makespans were measured for varying numbers of application instances (e.g. Inception-v3 in FIG. 17 ) for an innovative system and for a conventional system.
- a speedup factor was determined relative to the conventional system.
- Graphs 1720 , 1730 , 1740 , 1750 , 1760 show the corresponding speedup for 8, 10, 20, 40, and 100 application instances. Additionally, graph 1705 shows a hypothetical speedup equal to the accelerator multiple on the horizontal axis. FIG. 17 demonstrates good scaling and shows that, with sufficient client applications, a large number of accelerators can be effectively utilized in a single system. (Design H has 120 accelerators.)
- computing environment 2110 includes one or more processing units 2122 and memory 2124 .
- processing unit 2122 can execute computer-executable instructions, such as for control or data transfer as described herein.
- Processing unit 2122 can be a general-purpose central processing unit (CPU), a processor in an application-specific integrated circuit (ASIC), a RISC-V processing core, a processor in an FPGA, a general purpose graphics processing unit, a neuromorphic processor, or any other type of processor.
- CPU general-purpose central processing unit
- ASIC application-specific integrated circuit
- RISC-V processing core a processor in an FPGA
- general purpose graphics processing unit a general purpose graphics processing unit
- neuromorphic processor or any other type of processor.
- multiple processing units execute computer-executable instructions to increase processing power.
- a computing system 2110 can have additional features, such as one or more of storage 2140 , input devices 2150 , output devices 2160 , or communication ports 2170 .
- An interconnection mechanism such as a bus, controller, or network interconnects the components of the computing environment 2110 .
- operating system software provides an operating environment for other software executing in the computing environment 2110 , and coordinates activities of the components of the computing environment 2110 .
- a computing apparatus can be local or distributed, and can include any combination of special-purpose hardware (e.g. an accelerator or hard-wired processing circuitry) and/or general-purpose hardware (e.g. a RISC core) and/or virtualized hardware, together with software implementing described functionality.
- special-purpose hardware e.g. an accelerator or hard-wired processing circuitry
- general-purpose hardware e.g. a RISC core
- any of the disclosed methods can be implemented as computer-executable instructions or a computer program product stored on one or more computer-readable storage media, such as tangible, non-transitory computer-readable storage media, and executed on a computing device (e.g., any available computing device, including tablets, smartphones, or other mobile devices that include computing hardware).
- Tangible computer-readable storage media are any available tangible media that can be accessed within a computing environment (e.g., one or more optical media discs such as DVD or CD, volatile memory components (such as DRAM or SRAM), or nonvolatile memory components (such as flash memory or hard drives)).
- computer-readable storage media include memory 2124 and storage 2140 .
- the terms computer-readable storage media or computer-readable media do not include signals and carrier waves.
- the terms computer-readable storage media or computer-readable media do not include communication ports.
- any of the computer-executable instructions for implementing the disclosed techniques as well as any data created and used during implementation of the disclosed embodiments can be stored on one or more computer-readable storage media.
- the computer-executable instructions can be part of, for example, a dedicated software application or a software application that is accessed or downloaded via a web browser or other software application (such as a remote computing application).
- Such software can be executed, for example, on a single local computer (e.g., any suitable commercially available computer) or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a client-server network, a cloud computing network, or other such network) using one or more network computers.
- any of the software-based embodiments can be uploaded, downloaded, or remotely accessed through a suitable communication means.
- suitable communication means include, for example, the Internet, the World Wide Web, an intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF, microwave, infrared, and optical communications), electronic communications, or other such communication means.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
Description
Claims (28)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/542,022 US12430166B2 (en) | 2020-12-11 | 2021-12-03 | Hierarchical task scheduling for accelerators |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202063124268P | 2020-12-11 | 2020-12-11 | |
| US17/542,022 US12430166B2 (en) | 2020-12-11 | 2021-12-03 | Hierarchical task scheduling for accelerators |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20220188155A1 US20220188155A1 (en) | 2022-06-16 |
| US12430166B2 true US12430166B2 (en) | 2025-09-30 |
Family
ID=81942542
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/542,022 Active 2044-02-06 US12430166B2 (en) | 2020-12-11 | 2021-12-03 | Hierarchical task scheduling for accelerators |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US12430166B2 (en) |
Families Citing this family (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114070657B (en) * | 2020-08-03 | 2025-05-30 | 华为技术有限公司 | chip |
| US20230042858A1 (en) * | 2021-08-02 | 2023-02-09 | Nvidia Corporation | Offloading processing tasks to decoupled accelerators for increasing performance in a system on a chip |
| US12530732B2 (en) * | 2022-04-26 | 2026-01-20 | Mediatek Inc. | Enhanced computer vision application programming interface |
| US20220365813A1 (en) * | 2022-06-28 | 2022-11-17 | Rajesh Poornachandran | Apparatus, Device, Method, and Computer Program for Scheduling an Execution of Compute Kernels |
| US20240118923A1 (en) * | 2022-09-28 | 2024-04-11 | Qualcomm Incorporated | Robust scheduling with generative flow networks |
| US12505042B2 (en) * | 2023-03-29 | 2025-12-23 | Samsung Electronics Co., Ltd. | Systems and methods for distributing work between a host and an accelerator using a shared memory |
| EP4451044A1 (en) | 2023-04-20 | 2024-10-23 | Black Semiconductor GmbH | Device for modulating electromagnatic waves and method for producing the same |
| EP4516736A1 (en) | 2023-09-01 | 2025-03-05 | Black Semiconductor GmbH | Layered structure and method for the production of graphene |
| EP4530709A1 (en) | 2023-09-27 | 2025-04-02 | Black Semiconductor GmbH | Method for producing planarized surfaces in layered structures used for producing opto-electronic devices |
| EP4531115A1 (en) | 2023-09-27 | 2025-04-02 | Black Semiconductor GmbH | Opto-electronic component with an electrical via connecting to an electrical conductor layer through two dielectric layers |
| CN117171075B (en) * | 2023-10-27 | 2024-02-06 | 上海芯联芯智能科技有限公司 | Electronic equipment and task processing method |
| CN120256078A (en) * | 2024-01-03 | 2025-07-04 | 华为技术有限公司 | A task processing device, related crystal grain and processing method |
| WO2025231578A1 (en) * | 2024-05-06 | 2025-11-13 | Intel Corporation | Method and apparatus for accelerator rate limiting |
| US20260003821A1 (en) * | 2024-06-26 | 2026-01-01 | Ati Technologies Ulc | Dispatch for a configurable data-flow compute array and data-parallel compute units |
Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7321940B1 (en) | 2003-06-30 | 2008-01-22 | Cisco Technology, Inc. | Iterative architecture for hierarchical scheduling |
| US8655997B2 (en) | 2004-01-30 | 2014-02-18 | International Business Machines Corporation | Hierarchical resource management for a computing utility |
| US20180143975A1 (en) * | 2016-11-18 | 2018-05-24 | Lionbridge Technologies, Inc. | Collection strategies that facilitate arranging portions of documents into content collections |
| US10387202B2 (en) | 2014-09-30 | 2019-08-20 | Hewlett Packard Enterprise Development Lp | Quality of service implementation in a networked storage system with hierarchical schedulers |
| US20200183738A1 (en) * | 2018-12-06 | 2020-06-11 | Raytheon Company | Accelerating dataflow signal processing applications across heterogeneous cpu/gpu systems |
| US20200257560A1 (en) * | 2019-02-13 | 2020-08-13 | GM Global Technology Operations LLC | Architecture and device for multi-stream vision processing on shared devices |
| US20210406646A1 (en) * | 2020-06-30 | 2021-12-30 | Samsung Electronics Co., Ltd. | Method, accelerator, and electronic device with tensor processing |
| US20220058024A1 (en) * | 2020-08-18 | 2022-02-24 | Alibaba Group Holding Limited | Using tagged instruction extension to express dependency for memory-based accelerator instructions |
| US20220092408A1 (en) * | 2020-09-23 | 2022-03-24 | Facebook, Inc. | Neural network weight distribution using a tree direct-memory access (dma) bus |
| US20220147776A1 (en) * | 2020-11-12 | 2022-05-12 | Ambarella International Lp | Unsupervised multi-scale disparity/optical flow fusion |
| US11422821B1 (en) * | 2018-09-04 | 2022-08-23 | Apple Inc. | Age tracking for independent pipelines |
| US11868872B1 (en) * | 2020-03-31 | 2024-01-09 | Amazon Technologies, Inc. | Direct memory access operation for neural network accelerator |
-
2021
- 2021-12-03 US US17/542,022 patent/US12430166B2/en active Active
Patent Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7321940B1 (en) | 2003-06-30 | 2008-01-22 | Cisco Technology, Inc. | Iterative architecture for hierarchical scheduling |
| US8655997B2 (en) | 2004-01-30 | 2014-02-18 | International Business Machines Corporation | Hierarchical resource management for a computing utility |
| US10387202B2 (en) | 2014-09-30 | 2019-08-20 | Hewlett Packard Enterprise Development Lp | Quality of service implementation in a networked storage system with hierarchical schedulers |
| US20180143975A1 (en) * | 2016-11-18 | 2018-05-24 | Lionbridge Technologies, Inc. | Collection strategies that facilitate arranging portions of documents into content collections |
| US11422821B1 (en) * | 2018-09-04 | 2022-08-23 | Apple Inc. | Age tracking for independent pipelines |
| US20200183738A1 (en) * | 2018-12-06 | 2020-06-11 | Raytheon Company | Accelerating dataflow signal processing applications across heterogeneous cpu/gpu systems |
| US20200257560A1 (en) * | 2019-02-13 | 2020-08-13 | GM Global Technology Operations LLC | Architecture and device for multi-stream vision processing on shared devices |
| US11868872B1 (en) * | 2020-03-31 | 2024-01-09 | Amazon Technologies, Inc. | Direct memory access operation for neural network accelerator |
| US20210406646A1 (en) * | 2020-06-30 | 2021-12-30 | Samsung Electronics Co., Ltd. | Method, accelerator, and electronic device with tensor processing |
| US20220058024A1 (en) * | 2020-08-18 | 2022-02-24 | Alibaba Group Holding Limited | Using tagged instruction extension to express dependency for memory-based accelerator instructions |
| US20220092408A1 (en) * | 2020-09-23 | 2022-03-24 | Facebook, Inc. | Neural network weight distribution using a tree direct-memory access (dma) bus |
| US20220147776A1 (en) * | 2020-11-12 | 2022-05-12 | Ambarella International Lp | Unsupervised multi-scale disparity/optical flow fusion |
Non-Patent Citations (20)
| Title |
|---|
| Arabnejad, et al., "List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table," in IEEE Transactions on Parallel and Distributed Systems, vol. 25, No. 3, pp. 682-694 (Mar. 2014). |
| Arm, "AMBA AXI and ACE Protocol Specification," available from https://developer.arm.com/documentation/ihi0022/e/, pp. 1-328 (Feb. 2013). |
| Arm, "AMBA AXI-Stream Protocol Specification," available from https://developer.arm.com/documentation/ihi0051/latest/, pp. 1-56 (Apr. 2021). |
| Arnold, et al., "Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit," 2012 IEEE Computer Society Annual Symposium on VLSI, pp. 249-254 (Aug. 2012). |
| Asanovic, et al., "The Rocket Chip Generator," EECS Department, University of California, Berkley Technical Report No. UCB/EECS-2016-27, pp. 1-9 (Apr. 2016). |
| Binkert, et al., "The gem5 Simulator," ACM SIGARCH Computer Architecture News, vol. 39, Issue 2, pp. 1-7 (May 2011). |
| Canon, et al., "Online Scheduling of Task Graphs on Heterogeneous Platforms," in IEEE Transactions on Parallel and Distributed Systems, vol. 31, No. 3, pp. 721-732 (Mar. 2020). |
| Dallou, et al., "Nexus#: A Distributed Hardware Task Manager for Task-Based Programming Models," 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 1129-1138 (May 2015). |
| Johnston, et al., "AIWC: OpenCL-based Architecture-Independent Workload Characterization," available from https://arxiv.org/pdf/1805.04207.pdf, 11 pages (Oct. 2018). |
| Kaleem, et al., "Adaptive heterogeneous scheduling for integrated GPUs," 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 151-162 (Aug. 2014). |
| Liu, et al., "Deffe: a data-efficient framework for performance characterization in domain-specific computing," CF '20: Proceedings of the 17th ACM International Conference on Computing Frontiers, pp. 182-191 (May 2020). |
| Ma, et al., "Hierarchical task scheduler for interleaving substacks on heterogeneous multiprocessor platforms," ASP-DAC '05: Proceedings of the 2005 Asia and South Pacific Design Automation Conference, pp. 952-955 (Jan. 2005). |
| Morais, et al., "Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor," available from https://core.ac.uk/download/pdf/326217828.pdf, 12 pages, also published as Morais, et al., "Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor," MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 861-872 (Oct. 2019). |
| Shantharama, et al., "Hardware-Accelerated Platforms and Infrastructures for Network Functions: A Survey of Enabling Technologies and Research Studies," IEEEAccess, pp. 132021-132085 (Jul. 2020). |
| Shao, et al., "Co-designing accelerators and SoC interfaces using gem5-Aladdin," 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 12 pages (Oct. 2016). |
| Sjalander, et al., "A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures," 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools, pp. 149-157 (Sep. 2008). |
| Topcuoglu, et al., "Performance-effective and low-complexity task scheduling for heterogeneous computing," in IEEE Transactions on Parallel and Distributed Systems, vol. 13, No. 3, pp. 260-274 (Mar. 2002). |
| Vetter, et al., "Extreme Heterogeneity 2018: Productive Computational Science in the Era of Extreme Heterogeneity Report for DOE ASCR Basic Research Needs Workshop on Extreme Heterogeneity," available from https://www.osti.gov/servlets/purl/1494112, pp. 1-49 (Jan. 2018). |
| Waterman, et al., "RISC-V Instruction," poster available from https://old.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25- posters/HC25.26.p70-RISC-V-Warterman-UCB.pdf, 1 page (2013). |
| Western Digital, "RISC-V and Open Source Hardware Address New Computer Requirements," available from https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/collateral/tech-brief/tech-brief-western-digital-risc-v.pdf, pp. 1-6 (Dec. 2019). |
Also Published As
| Publication number | Publication date |
|---|---|
| US20220188155A1 (en) | 2022-06-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12430166B2 (en) | Hierarchical task scheduling for accelerators | |
| US12026825B2 (en) | Apparatus and method for reduced precision bounding volume hierarchy construction | |
| US20250103049A1 (en) | Quantizing autoencoders in a neural network | |
| CN114118354B (en) | Efficient SOFTMAX computation | |
| EP3518176B1 (en) | Hardware for matrix computations with sparse matrices for arbitrary neural networks | |
| Abadi et al. | {TensorFlow}: a system for {Large-Scale} machine learning | |
| US20210125071A1 (en) | Structured Pruning for Machine Learning Model | |
| CN109388777A (en) | A system and method for an optimized Winograd convolution accelerator | |
| CN111950695A (en) | Grammar transfer using one or more neural networks | |
| CN109993684A (en) | Compression in machine learning and deep learning processes | |
| WO2021057746A1 (en) | Neural network processing method and apparatus, computer device and storage medium | |
| CN110363698A (en) | For compressing the device and method of the leaf node of enclosure body hierarchical structure (BVH) | |
| Gerlinghoff et al. | E3NE: An end-to-end framework for accelerating spiking neural networks with emerging neural encoding on FPGAs | |
| CN108734298A (en) | GPU/CPU consistency is extended to more GPU cores | |
| CN108694080A (en) | Efficient thread group scheduling | |
| CN108734272A (en) | Convolutional neural networks optimize mechanism | |
| CN109712064A (en) | Use low precision and high-precision mixed inference | |
| CN110389783A (en) | For having the instruction and logic of cumulative contraction dot product | |
| US11645533B2 (en) | IR drop prediction with maximum convolutional neural network | |
| CN109154990A (en) | Lookup convolutional layer in convolutional neural networks | |
| CN109564699A (en) | Device and method for optimized ray tracing | |
| Banerjee et al. | Re-designing CNTK deep learning framework on modern GPU enabled clusters | |
| US12008469B1 (en) | Acceleration of neural networks with stacks of convolutional layers | |
| CN108734284A (en) | Rely on the deep learning of real-time context | |
| CN114265673A (en) | Spatial Slicing of Compute Arrays Using Shared Control |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| AS | Assignment |
Owner name: UT-BATTELLE, LLC, TENNESSEE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MINISKAR, NARASINGA RAO;LIU, FRANK Y.;YOUNG, AARON R.;AND OTHERS;SIGNING DATES FROM 20220321 TO 20220408;REEL/FRAME:059556/0426 |
|
| AS | Assignment |
Owner name: U. S. DEPARTMENT OF ENERGY, DISTRICT OF COLUMBIA Free format text: CONFIRMATORY LICENSE;ASSIGNOR:UT-BATTELLE, LLC;REEL/FRAME:059594/0827 Effective date: 20220224 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: AWAITING TC RESP, ISSUE FEE PAYMENT VERIFIED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| AS | Assignment |
Owner name: UT-BATTELLE, LLC, TENNESSEE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CHAKRABORTY, DWAIPAYAN;REEL/FRAME:072782/0928 Effective date: 20251104 |