CN116820574A

CN116820574A - Processing engine mapping for a time-space partitioned processing system

Info

Publication number: CN116820574A
Application number: CN202310106695.0A
Authority: CN
Inventors: 帕维尔·扎伊科夫; 拉里·詹姆斯·米勒; H·卡瓦略; 斯里瓦赞·瓦拉达拉詹
Original assignee: Honeywell International Inc
Current assignee: Honeywell International Inc
Priority date: 2022-03-28
Filing date: 2023-02-13
Publication date: 2023-09-29

Abstract

Embodiments for improved processing efficiency between a processor and at least one coprocessor are disclosed. Some examples relate to mapping a workload to one or more clusters of coprocessors for execution based on a coprocessor allocation policy. In connection with the disclosed embodiments, the coprocessor may be implemented as a Graphics Processing Unit (GPU), a hardware processing accelerator, a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), or other processing circuitry. The processor may be implemented by a Central Processing Unit (CPU) or other processing circuitry.

Description

Processing engine mapping for a time-space partitioned processing system

Cross Reference to Related Applications

The present application claims the benefit of U.S. application Ser. No. 17/707,164, entitled "PROCESSING ENGINE MAPPING FOR TIME-SPACE PARTITIONED PROCESSING SYSTEMS", filed on 3 month 29 of 2022, and claims the benefit of U.S. application Ser. No. 17/705,959, entitled "PROCESSING ENGINE SCHEDULING FOR TIME-SPACE PARTITIONED PROCESSING SYSTEMS", filed on 3 month 28 of 2022, and is a continuation of these applications, the contents of which are hereby incorporated by reference.

Statement regarding non-U.S. sponsored research or development

The project that produced this application has obtained funds from clean sky 2 joint executives under the european union horizon 2020 research and innovation program according to the funding agreement No. 945535.

Background

Real-time processing in a dynamic environment requires processing large amounts of data in a very short time frame. Depending on the particular context, such processing may involve computational iterative mathematical calculations or performing intensive data analysis. Fast and accurate data output is important to avoid processing delays, which are especially necessary for safety-critical or mission-critical applications, such as those used in avionics.

Some real-time operating systems utilize temporal and/or spatial partitioning processes to process data. Initially, tasks are performed at a host processor (referred to herein as a "central processing unit" or "CPU") according to instructions from an application program. The CPU is generally responsible for directing the execution of tasks and managing the data output as the CPU executes tasks. The majority of the raw data processing for tasks received at the CPU is performed by a different coprocessor than the CPU. When a CPU executes a task, it may assign a workload associated with the task to a coprocessor for processing. "workload" is also referred to herein as a "job," "kernel," or "shader" of a particular application. Tasks performed by the CPU may require processing that may be performed more quickly on the coprocessor, so the CPU may send one or more requests defining the workload that the coprocessor must perform to complete the task performed by the CPU. These requests are referred to herein as "workload cranking requests".

The coprocessor typically receives many such requests, sometimes in a short period of time. Each request may involve a very large amount of intensive computation. The ability to process workload placement requests in a timely manner depends not only on the processing power of the coprocessor, but also on how the coprocessor is utilized to perform the work requested by the host processor. While coprocessors with powerful processing resources can process these requests quickly, they can be expensive to implement and do not guarantee that the coprocessors can process tasks with a large number of processing requirements in a short time frame. Less advanced coprocessors with limited processing resources tend to handle the delay associated with insufficient bandwidth for processing additional requests and may result in loss of assurance of certainty. In any case, the coprocessor becomes overwhelmed by backing up the workload launch request.

Some coprocessors implement temporal and/or spatial partitioning of their processing resources so that multiple jobs may be executed in parallel. However, conventional coprocessors do not provide sufficient spatial isolation, temporal certainty, and responsiveness to concurrently execute multiple security critical applications. Failure to process security critical applications in time can ultimately lead to loss of assurance of certainty.

Disclosure of Invention

The details of one or more embodiments are set forth in the description below. The features illustrated or described in connection with one exemplary embodiment may be combined with the features of other embodiments. Thus, any of the various embodiments described herein can be combined to provide further embodiments. Aspects of the embodiments can be modified, if necessary, to employ the concepts of the various patents, applications and publications as identified herein to provide yet further embodiments.

In one embodiment, a processing system is disclosed. The processing system includes a processor and a coprocessor configured to implement a processing engine. The processing system also includes a processing engine scheduler configured to schedule a workload for execution on the coprocessor. The processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing on the processor. In response, the processing engine scheduler is configured to generate at least one launch request for submission to the coprocessor based on the coprocessor scheduling policy. Based on the coprocessor scheduling policy, the processing engine scheduler selects which coprocessor clusters to activate to execute the workload identified by the queue based on the at least one launch request. The coprocessor scheduling policy defines at least one of: a tightly coupled coprocessor schedule, wherein a workload identified by the at least one launch request is scheduled to execute immediately on the coprocessor within a time of a timing window in which the one or more tasks execute on the processor; or a close-coupled coprocessor schedule, wherein the workload identified by the at least one launch request is scheduled to be executed on the coprocessor based on a priority order and by any one of: during a subsequent timing window subsequent to the time of the timing window in which the one or more tasks are executing on the processor, or with respect to an external event common to both the processor and the coprocessor.

In another embodiment, a coprocessor is disclosed. The coprocessor is configured to be coupled to a processor and configured to implement a processing engine. The coprocessor includes at least one cluster configured to execute a workload. The coprocessor includes a processing engine scheduler configured to schedule a workload for execution on the coprocessor. The processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing on the processor. In response, the processing engine scheduler is configured to generate at least one launch request for submission based on the coprocessor scheduling policy. Based on the coprocessor scheduling policy, the processing engine scheduler is configured to select which of the at least one cluster to activate to execute the workload identified by the queue including the at least one launch request. The coprocessor scheduling policy defines at least one of: a tightly coupled coprocessor schedule, wherein a workload identified by the at least one launch request is scheduled to execute immediately on the coprocessor within a time of a timing window in which the one or more tasks execute on the processor; or a close-coupled coprocessor schedule, wherein the workload identified by the at least one launch request is scheduled to be executed on the coprocessor based on a priority order and by any one of: during a subsequent timing window subsequent to the time of the timing window in which the one or more tasks are executing on the processor, or with respect to an external event common to both the processor and the coprocessor.

In another embodiment, a method is disclosed. The method includes receiving one or more workload launch requests from one or more tasks executing on a processor. The one or more workload initiation requests include one or more workloads configured for execution on the coprocessor. The method includes generating at least one launch request in response to the one or more workload launch requests based on a coprocessor scheduling policy. The method includes scheduling, based on the coprocessor scheduling policy, one or more workloads identified in the at least one launch request for execution on the coprocessor by at least one of: a tightly coupled coprocessor schedule, wherein a workload identified by the at least one launch request is scheduled to execute immediately on the coprocessor within a time of a timing window in which the one or more tasks execute on the processor; or a close-coupled coprocessor schedule, wherein the workload identified by the at least one launch request is scheduled to be executed on the coprocessor based on a priority order and by any one of: during a subsequent timing window subsequent to the time of the timing window in which the one or more tasks are executing on the processor, or with respect to an external event common to both the processor and the coprocessor.

In another embodiment, a processing system is disclosed. The processing system includes a processor and a coprocessor configured to implement a processing engine. The processing system includes a processing engine scheduler configured to schedule a workload for execution on the coprocessor. The processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks being executed or already executed on the processor. In response, the processing engine scheduler is configured to generate at least one launch request for submission to the coprocessor. The coprocessor includes a plurality of computing units and at least one command fluidizer associated with one or more of the plurality of computing units. Based on the coprocessor allocation policy, the processing engine scheduler is configured to allocate, via the at least one command fluidizer, a cluster of computing units of the coprocessor for a given execution partition to execute one or more workloads identified by the one or more workload launch requests according to workload priorities. The coprocessor allocation policy defines at least: an exclusive allocation policy in which each workload is performed by a dedicated cluster of computing units; a staggered allocation policy in which each workload is exclusively executed across all computing units of the cluster of computing units; a policy-distributed allocation policy in which each workload is individually allocated to at least one of the clusters of computing units and execution duration during the given execution partition; or a shared allocation policy, wherein each workload is not exclusively performed by a cluster of computing units, each executing multiple workloads simultaneously.

In another embodiment, a coprocessor is disclosed. The coprocessor is configured to be coupled to a processor and configured to implement a processing engine. The coprocessor includes a plurality of computing units each configured to execute a workload. The coprocessor includes a processing engine scheduler configured to allocate a workload for execution on the coprocessor. The processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks being executed or already executed on the processor. In response, the processing engine scheduler is configured to generate at least one launch request for submission to the coprocessor. The coprocessor includes at least one command fluidizer associated with one or more of the plurality of computing units. Based on the coprocessor allocation policy, the processing engine scheduler is configured to allocate, via the at least one command fluidizer, a cluster of computing units of the coprocessor for a given execution partition to execute one or more workloads identified by the one or more workload launch requests according to workload priorities. The coprocessor allocation policy defines at least: an exclusivity policy, wherein each workload is performed by a dedicated cluster of clusters of the computing unit; an interleaving policy in which each workload is exclusively executed across all computing units of at least one of the clusters of computing units; policy-distributed policy in which each workload is individually assigned to at least one of the clusters of computing units and execution duration during the given execution partition; a sharing policy in which each workload is not exclusively performed by a cluster of computing units that each execute multiple workloads simultaneously.

In another embodiment, a method is disclosed. The method includes receiving one or more workload launch requests from one or more tasks being executed or already executed on a processor. The one or more workload initiation requests include one or more workloads configured for execution on the coprocessor. The method includes generating at least one launch request in response to the one or more workload launch requests. The method includes assigning a cluster of computing units of the coprocessor to execute one or more workloads identified in the one or more workload placement requests according to workload priorities based on a coprocessor assignment policy. The coprocessor allocation policy defines at least: an exclusivity policy, wherein each workload is performed by a dedicated cluster of clusters of the computing unit; an interleaving policy in which each workload is exclusively executed across all computing units of at least one of the clusters of computing units; policy-distributed policy in which each workload is individually assigned to at least one of the clusters of computing units and execution duration during a given execution partition; a sharing policy in which each workload is not exclusively performed by a cluster of computing units that each execute multiple workloads simultaneously.

Drawings

Understanding that the drawings depict only exemplary embodiments and are not therefore to be considered to be limiting of the invention's scope, the exemplary embodiments will be described with additional specificity and detail through use of the accompanying drawings in which:

1A-1B depict block diagrams illustrating an exemplary system configured to schedule and allocate an launch request to a coprocessor;

FIGS. 2A-2B depict block diagrams of clusters that schedule and allocate workloads associated with one or more queues to coprocessors;

FIG. 3 depicts a diagram of multiple clusters of workload scheduling to coprocessors in a loosely coupled coprocessor scheme, according to one embodiment;

4A-4B depict diagrams of scheduling a workload to multiple clusters of coprocessors in a tightly coupled coprocessor scheme;

5A-5B depict diagrams of a synchronization inference process between a CPU and one or more clusters of coprocessors;

6A-6B depict graphs of preemption policies applied to workloads scheduled to a Graphics Processing Unit (GPU);

FIG. 7 depicts a flowchart that shows an exemplary method for scheduling a workload for execution to a coprocessor;

8A-8B depict diagrams of data coupling between multiple clusters of coprocessors;

FIG. 9 depicts a diagram of coprocessor allocation policies applied to multiple clusters of coprocessors, according to one embodiment;

10A-10C depict block diagrams illustrating an exemplary system configured to allocate workloads to multiple clusters of coprocessors;

FIG. 11 depicts a flowchart that shows an exemplary method for distributing workloads to processing resources of a coprocessor;

FIG. 12 depicts a flowchart that shows an exemplary method for managing processing resources when a workload is executed on a coprocessor;

FIG. 13 depicts a flowchart that shows an exemplary method for prioritizing operational loads in the background; and is also provided with

Fig. 14A-14C depict a flowchart illustrating an exemplary method for scheduling and distributing workloads.

In accordance with common practice, the various features described are not necessarily drawn to scale, but are used to emphasize specific features relevant to the exemplary embodiments.

Detailed Description

In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific exemplary embodiments. However, it is to be understood that other embodiments may be utilized and that logical, mechanical, and electrical changes may be made. Furthermore, the methods presented in the figures and description should not be interpreted as limiting the order in which the various steps may be performed. The following detailed description is, therefore, not to be taken in a limiting sense.

Embodiments of the present disclosure provide improvements to scheduling and distributing workloads to coprocessors (e.g., GPUs) for execution. Some embodiments disclosed herein enable scheduling of a workload to a GPU based on a timing window of the CPU such that the GPU is at least partially synchronized with the CPU. Other embodiments enable the GPU to dynamically allocate workloads to optimize the use of processing resources on the GPU. Workload may be referred to herein, unless otherwise indicated, in the singular "workload" or plural "workloads" and it is to be understood that the description applies to a single workload or multiple workloads.

While some examples are shown and described for specifically scheduling and distributing workloads to GPUs, the examples described herein also apply in the context of other systems. For example, such techniques are also applicable to any processing system having one or more processors that schedule and allocate workloads to one or more coprocessors. Coprocessors may typically be implemented in a processing system as integrated or discrete processing units. In various examples, coprocessors may be implemented as graphics processing units ("GPUs"), neural processing units ("NPUs"), data processing units ("DPUs"), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), and other processing circuitry, or combinations thereof.

The coprocessor may accelerate workload processing using conventional execution or facilitating execution through artificial intelligence. For AI-based modeling, coprocessors are used to accelerate the execution of some of the workloads associated with Machine Learning (ML)/Artificial Intelligence (AI) applications. In addition, coprocessors may be used to accelerate the execution of the ML/AI application program by an inference engine, which may be used, for example, for Deep Neural Network (DNN) processing. In the various figures and descriptions that follow, the coprocessor is implemented as a GPU for instructional interpretation.

FIG. 1A depicts a block diagram illustrating an exemplary system 100 configured to schedule and allocate workloads to coprocessors. In some examples, system 100 implements a real-time operating system (RTOS) that facilitates execution of real-time applications to process data as the data enters within specified time constraints. The system 100 includes a processor 104 coupled to one or more coprocessors 106. Only one processor 104 and one coprocessor 106 are explicitly shown in fig. 1, but it should be understood that any number of processors 104 may be coupled to one or more coprocessors 106.

The processor 104 is configured to receive system parameters from the offline system 102 (e.g., from a stored system configuration), including a coprocessor scheduling policy that determines when to allocate a workload to the coprocessor 106 and a coprocessor allocation policy that determines where to allocate the workload to the processing resources of the coprocessor 106. The processor 104 is further configured to execute tasks 105 received from one or more applications (safety critical applications, best effort applications, etc.) running on processing resources (processors, processing circuits) of the processor 104 (not shown in fig. 1A). Tasks 105 executing on processor 104 may require processing that utilizes processing resources on coprocessor 106. In one example, the processor is configured to prepare one or more workload launch requests when executing a given task 105, where the workload requires data processing by the coprocessor 106 (e.g., DNN processing including mathematical computations (such as matrix operations, etc.). Other types of processing may also be required for task 105, including but not limited to rendering and/or computing processing. In various examples, a workload is represented as a "kernel," "job," "thread," "shader," or other processing unit. When task 105 is processed by processor 104, coprocessor 106 also processes workload launch requests to launch and execute the workload associated with task 105 in parallel with processor 104. The workload launch request includes a sequence of workloads (e.g., kernels) required for processing, as well as other workload parameters, such as input and output data arrays, workload code, processing loads necessary to complete the workload, priorities of the workloads included in the launch request, and/or other parameters.

Processor 104 may include one or more partitions 103. Each partition 103 serves as an independent processing system (e.g., processing core 103 as shown in fig. 1B) and is configured to perform one or more of tasks 105. Partition 103 may partition in time and/or space. The temporal and spatial partitioning may be implemented by conventional physical ("hard") partitioning techniques that separate hardware circuitry in the processor 104 or by software methods (e.g., virtualization techniques) that set processing constraints that are performed by the processor 104. When spatially partitioned via software, the partition 103 does not contaminate the code, input/output (I/O), or storage area of data of another partition 103, and the consumption of each partition does not exceed its corresponding shared processing resource allocation. Furthermore, failures of hardware attributable to one software partition do not adversely affect the performance of the software partition. A time partition is implemented when no consumption of a software partition exceeds the execution time dispatch of that software partition on the processing core on which it executes due to mitigating time interference between partition partitions hosted on different cores, regardless of whether the partition is not executing on any of the other active cores or on all of the other active cores. For a partitioned processor (such as processor 104 shown in fig. 1), one or more of the tasks 105 are executed on the first partition 103, thereby enabling simultaneous processing for at least some of the tasks 105 assigned to different partitions 103. Processor 104 may include any number of partitions 103. In some examples, partitions 103 are physically or virtually isolated from each other to prevent error propagation from one partition to another.

Each coprocessor 106 coupled to one or more processors 104 is configured to receive at least some of the processing offloaded by the processor 104. The system 100 includes a driver 108 that includes a processing engine scheduler (also referred to as a "scheduler") 109 and one or more contexts 110. The background 110 includes hardware configured to provide spatial isolation. The multiple contexts 110 enable multiple partitions on the coprocessor to be executed in parallel to support temporal and/or spatial partitioning.

For artificial intelligence processing models such as neural networks, the scheduler 109 may be an inference engine scheduler that utilizes inference processing to schedule workloads for execution on the co-processor 106. The driver 108 and coprocessor 106 may utilize various types of processing, including computing and rendering. In one example where the system 100 is an RTOS, the driver 108 resides in the processor 104 and schedules the workload for execution based on the processing resources of the coprocessor 106. In another example, the driver 108 is implemented by software that is specifically accessed by a server application to which one or more client applications submit workloads. The server will typically retain exclusive access to the driver 108 and utilize the driver 108 to schedule the workload on the co-processor 106 when it receives a workload launch request from a task 105 executing on the processor 104. As shown in fig. 1A, the driver 108 is implemented as a stand-alone unit, but in other examples (such as those previously described), the driver 108 is implemented by or otherwise part of the co-processor 106 or processor 104.

The scheduler 109 of the driver 108 is configured to dispatch the workload associated with the task 105 executed by the processor 104 to the computing units 115, 117 and, in some examples, based on the timing window of the processor 104. The scheduler 109 is configured to receive workload initiation requests from the processor 104 and schedule the workload for execution on the coprocessor 106. In some examples, scheduler 109 is configured to generate at least one launch request from the workload launch requests based on the scheduling policy. Some examples of scheduling policies are further described with respect to fig. 2-7.

For each launch request generated by scheduler 109, one or more contexts 110 include the workload to be scheduled and allocated to the processing resources of the coprocessor for execution. The background 110 also includes one or more queues 111 that categorize the workload identified from the one or more launch requests. The launch requests in each queue 111 may be sequentially queued and scheduled or allocated based on the priority of the queue 111 relative to other queues organized by the background 110. In some examples, queues 111 are stored in a playlist listing the priority of each queue. Moreover, the driver 108 may include any number of contexts 110, and each context 110 may include any number of queues 111. In some examples, workload launch requests in different queues may be executed in parallel or in a different order, provided that the workloads in the queues are isolated from each other during processing.

Coprocessor 106 also includes one or more command fluidizers 112 configured to schedule and allocate workloads identified by the launch requests to available clusters 114 and/or 116 according to a coprocessor scheduling policy and a coprocessor allocation policy. Coprocessor 106 may include any number of command fluidizers 112, and in some examples, one or more command fluidizers 112 are shared among queues 111 and/or are hosted by specialized context 110. Each cluster 114, 116 includes a set of respective computing units 115, 117 configured to perform data processing. In some examples, clusters 114, 116 are statically configured (e.g., hardwired) in coprocessor 106, with computing unit 115 permanently associated with cluster 114 and computing unit 117 permanently associated with cluster 116. Clusters 114, 116 are configured to perform processing associated with one or more of the workloads associated with each queue 111 when the queues are allocated to the respective clusters by command fluidizer 112.

As used herein, "compute unit" refers to a cluster's processing resources. Each computing unit 115, 117 may include one processing core (otherwise referred to as a "single core processing unit") or multiple processing cores (otherwise referred to as "multi-core processing units") for executing a workload, as presented to scheduler 109. The core may be a physical core or a virtual core. The physical core includes hardware (e.g., processing circuitry) that forms a core that physically handles the assigned workload. However, virtual cores may also be presented to scheduler 109 for processing of workloads, where each virtual core is implemented using an underlying physical core.

The processor 104 and the coprocessor 106 generally include a combination of processors, microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays, and/or other similar variations thereof. The processor 104 and coprocessor 106 may also include or act as software programs, firmware, or other computer readable instructions for performing the various process tasks, computing, and control functions used in the methods described herein. These instructions are typically tangibly embodied on any storage medium (or computer-readable medium) for storing computer-readable instructions or data structures.

Data from workload execution and other information may be stored in memory (not shown in fig. 1). Memory can include any available storage medium (or computer-readable medium) that can be accessed by a general purpose or special purpose computer or processor or any programmable logic device. Suitable computer-readable media may include storage media or memory media such as semiconductor, magnetic and/or optical media, and may be embodied as instructions stored in a non-transitory computer-readable medium such as cache, random Access Memory (RAM), read-only memory (ROM), non-volatile RAM, electrically erasable programmable ROM, flash memory, or other storage media. For example, in an example where workloads currently executing by clusters 114, 116 are preempted by higher priority workloads, the progress of execution prior to preemption is stored in memory and accessible for rescheduling by command fluidizer 112 during a later period of time (e.g., when the preempted workload completes execution). In addition, the memory is configured to store application programs including tasks to be performed by the processor 104.

FIG. 1B depicts an exemplary implementation of a particular example of the system depicted in FIG. 1A. The same reference numerals used in fig. 1B refer to specific examples of components used in fig. 1A. In fig. 1B, system 100 includes a CPU 104 coupled to a GPU driver 108 in communication with a GPU 106. The CPU 104 is configured as a processor 104, the GPU driver 108 is configured as a driver 108, and the GPU 106 is configured as a coprocessor 106, as described in fig. 1A. The CPU 104, GPU driver 108, and all other components of the GPU 106 function similarly to that described in fig. 1A.

Fig. 2A-2B illustrate a cluster that allocates queues to coprocessors according to the scheduling and/or allocation techniques described herein and as described in fig. 1A-1B. Fig. 2A illustrates workload distribution in a time (time partition) isolated coupling configuration. In contrast, fig. 2B illustrates workload distribution in a spatially isolated coupling configuration.

Referring to fig. 2A, one or more contexts 201-a include a queue 202 and a queue 203, where each queue includes one or more workloads (jobs) to be processed by a coprocessor. More specifically, queue 202 includes workload 202-A and workload 202-B, while queue 203 includes workload 203-A and workload 203-B. In one example of time allocation shown in fig. 2A, only one queue may be executed at a given time to achieve time isolation between queue 202 and queue 203. Thus, queue 202 and queue 203 execute in order. In some examples, queues are allocated based on a priority order, with queues 202 being given higher priority (and thus executed first on the coprocessor), followed by queues 203. However, the workload associated with a single queue may be executing on one or more available clusters 204 of coprocessors at the same time.

In the example shown in FIG. 2A, at a first time interval (time interval 1 shown in FIG. 2A), workload 202-A and workload 202-B are assigned to an available cluster 204 for processing. When cluster 204 executes workload 202-a and workload 202-B of queue 202, queue 203 is next-ordered in context 201-a. Once all of the workloads associated with queue 202 have completed execution (or are preempted in some examples), at a subsequent time interval (time interval 2 shown in fig. 2A), the workloads 203-a and 203-B associated with queue 203 are assigned to available clusters 204 for execution. This workload distribution is iteratively repeated as the background 201-a receives additional queues from the driver 108.

In contrast, fig. 2B illustrates an example of a spatially isolated execution system in which multiple different queues 202 and 203 may be executed simultaneously while maintaining sufficient isolation between the queues. Specifically, queues 202 and 203 are queued in isolation background 201-B. This configuration enables workload 202-A of queue 202 and workload 203-A of queue 203 to execute simultaneously on cluster 204 in time interval 1. Similarly, at time interval 2, when processing workload 202-A and workload 203-A, workload 202-B of queue 202 and workload 203-B of queue 203 may be loaded on cluster 205. In some examples, cluster 204 (or a coprocessor partition including cluster 204) is associated with queue 202 and cluster 205 (or a coprocessor partition including cluster 205) is associated with queue 203.

As described in further detail below, in some examples, driver 108 is configured to schedule the workload according to a coprocessor scheduling policy such that the coprocessor is at least partially synchronized with the processor. In other examples, driver 108 is configured to allocate a workload to processing resources of the coprocessor according to a coprocessor allocation policy to optimize the use of processing resources on the coprocessor. Both the scheduling policy and the allocation policy may include policies that manage preemption of the workload based on the priority of the workload. Although described separately for instructional explanation, workload scheduling, workload distribution, and workload preemption techniques may be utilized in combination.

Coprocessor scheduling policy

As previously described with respect to fig. 1A-1B, in some examples, the scheduler 109 of the driver 108 is configured to implement a coprocessor scheduling policy that schedules workloads associated with tasks 105 for execution by clusters 114, 116 of coprocessors 106 in accordance with at least one launch request. The scheduling policy defines when the workload will be scheduled for execution on one or more specified clusters in the coprocessor 106. The examples described with respect to fig. 1A-7 include various exemplary examples of scheduling workloads on a coprocessor, such as a GPU.

Still referring to FIG. 1A, in one example, a scheduling policy performed by the scheduler 109 of the drive 108 optionally first selects a queue 111 containing associated workloads to be scheduled to the clusters 114, 116 for processing. The selection may be triggered by an external event that initiates the prioritization of queues to the coprocessor 106. Some events include common timing events between the processor 104 and the coprocessor 106, such as updates from the processor 104, interrupt messages, or other external events. Selecting the queues 111 prior to scheduling the workload received by the coprocessor 106 helps to ensure adequate isolation between the multiple queues 111 during execution. In some examples, the queues 111 will have associated processing parameters that the scheduler 109 considers when determining which queue 111 to select. Exemplary parameters include identifying parameters (e.g., partition ID and/or context ID), minimum and/or maximum clusters required for queues and/or workloads within a queue, priorities of a priori allocated queues or workloads, budgets, preemption parameters, and other parameters. In one example, a budget corresponding to a workload refers to the longest expected time that it will take to fully execute the workload plus a safety margin.

In the next phase of the scheduling policy, the scheduler 109 selects the workload to be scheduled to the clusters 114, 116 for processing. Similar to the associated partition parameters described above, the workload may have associated parameters such as workload ID, partition ID, priority, budget, cluster requirements, preemption, number of cores, and other parameters. Once the scheduler 109 selects the queue 111 and the workload associated with the selected queue 111, the scheduler 109 then generates one or more launch requests associated with the selected task 105 based on the coupling arrangement between the processor 104 and the co-processor 106. Depending on the example, the coprocessor 106 may have varying degrees of synchronization with the processor 104. In one example, the coprocessor 106 is decoupled from the processor 104 and operates asynchronously to the processor 104. Thus, when a cluster 114, 116 becomes available on the co-processor 106 according to the priority of the associated workload request, the workload placement requests generated by the scheduler 109 are scheduled. In this coupling arrangement, there is little any preemption that occurs on the workload already executing on the coprocessor 106.

In another example, the co-processor 106 shares a loosely coupled arrangement with the processor 104. In this example, the coprocessor 106 operates in some degree of synchronization with the processor 104. For example, in a loosely coupled arrangement, the coprocessor 106 is synchronized with the processor 104 at the data frame boundary, and any unserviced workload that is performed at the end of the data frame boundary is cleared at the beginning of the subsequent data frame. Thus, both the processor 104 and the coprocessor 106 have the same input and output data rates in a loosely coupled arrangement. However, the coprocessor 106 will typically operate asynchronously with the processor 104 during the timing window, meaning that partitions and/or tasks executing on the processor 104 may execute in parallel with unrelated partitions and/or workloads executing on the coprocessor 106. The loosely coupled arrangement may support pre-emptive and non-pre-emptive scheduling between the processor 104 and the co-processor 106.

In yet another example, the co-processor 106 shares a tightly coupled arrangement with the processor 104. In this example, the coprocessor 106 operates in high synchronization with the processor 104; that is, the coprocessor 106 synchronizes queue and/or workload execution associated with corresponding tasks concurrently executed by the processor 104 based on the timing window of the processor 104. The tight coupling arrangement may be embodied in various ways. In one implementation, the coprocessor 106 is highly synchronized with the processor 104 during the same timing window, or in other words, the coprocessor 106 performs the workload associated with the one or more tasks currently executing on the processor 104 in that timing window. When the processor 104 performs another task in a subsequent timing window, the coprocessor 106 then executes the workload associated with the next task performed by the processor 104. In another implementation, the coprocessor 106 is synchronized with the processor 104 for subsequent timing intervals, but the coprocessor 106 maintains degrees of freedom in terms of executing workloads on the coprocessor 106 that are associated with different tasks consistent with other priority rules or processing availability.

The coupling arrangements may also be combined. For example, the coprocessor 106 may be loosely coupled with the processor 104 relative to one timing window, but tightly coupled with the processor 104 relative to another timing window. Thus, the scheduling policy may schedule the launch requests based on a combination of coupling arrangements between the processor 104 and the coprocessor 106, and may be dynamically updated as system scheduling parameters change.

FIG. 3 depicts a diagram 300 of workload scheduling to multiple clusters of coprocessors (illustratively GPUs) in a loosely coupled coprocessor scheme. At a first timing window (TW 1), the CPU 104 receives the new data frame (frame 1) and begins execution of a first CPU task at time 302. The CPU 104 then requests an "initialize" workload from cluster 1 of the GPU 106 at time 303 to confirm whether to initiate processing associated with the first CPU task. In some examples, GPU driver 108 determines that processing is not required for the associated task. For example, if the data frame corresponds to a different frame of the camera image, GPU driver 108 may determine that processing is not needed if the previous image frame is sufficiently similar to the current image frame. GPU 106 performs an "initialization" workload on cluster 1 at time 303 and confirms that the workload received at time 302 needs to be processed within timing window 1. Thus, cluster 1 of GPU 106 begins executing the workload associated with the first CPU task at time 304.

While cluster 1 of GPU 106 continues to process the workload from the task executing at time 302, timing window 2 (TW 2) begins at time 306 at CPU 104, and CPU 104 begins processing a second task at time 306. While cluster 1 is executing the workload associated with the CPU 104, cluster 2 begins executing the workload associated with the next CPU task. At time 308, cluster 1 completes processing of the workload associated with the first CPU task and begins processing of the workload associated with the second CPU task. Thus, at time 308, clusters 1 and 2 both call processing resources to execute the workload associated with the second CPU task. In this example, work previously performed only on cluster 2 has been extended and is now performed on both clusters 1 and 2. Clusters 1 and 2 then complete processing the workload associated with the second CPU task within timing window 2 at time 310. Because CPU 104 does not need additional tasks to be scheduled within timing window 2, clusters 1 and 2 may be assigned for processing low priority workloads at time 310 (if such workloads are available). For the highest priority workloads, the driver 108 is configured to prioritize the scheduling so that these workloads can begin executing within the earliest available timing window. In contrast, driver 108 is configured to schedule the lowest priority workload whenever processing resources become available. That is, the driver 108 makes a "best effort" to schedule low priority workloads within the earliest available timing window, but such low priority workloads may not be able to begin or complete execution once scheduled due to insufficient processing resources preempted by higher priority workloads and/or for executing lower priority workloads. In avionics applications, high priority workloads are associated with high Design Assurance Levels (DALs) (e.g., A-C), while low priority workloads are associated with low DALs (e.g., D-E).

At time 312, the timing window changes to timing window 1 and the CPU104 begins executing the third task. In some examples, the timing windows are scheduled sequentially via time division multiplexing. At time 314, gpu driver 108 receives an instruction from CPU104 to begin executing a workload associated with a third CPU task. Because the third CPU task has a higher priority than the low priority workloads executed by clusters 1 and 2 after time 310, GPU driver 108 stops (or preempts) execution of the low priority workloads at time 314 and schedules the workloads associated with the third CPU task to cluster 1 for execution. Cluster 2 optionally remains idle at time 314 while cluster 1 executes the workload. At time 316, cluster 1 completes executing the workload associated with the third CPU task and both clusters 1 and 2 resume processing the low priority workload. Timing windows 1 and 2 may alternate as needed and may or may not be synchronized with the receipt of new data frames. While processing frame 1, GPU 106 may continue to process low priority workloads until CPU104 performs another task that requires a workload on GPU 106. The number of timing windows in a data frame may vary and, in some examples, is specified independently of the coprocessor scheduling policy.

At time 318, the CPU 104 receives a new data frame (frame 2). Timing window 1 begins at time 320 shortly after receipt of data frame 2 and CPU 104 begins executing a fourth CPU task. At time 322, gpu 106 then executes the workload to determine whether the fourth CPU task requires processing; GPU driver 108 distributes this workload to cluster 1 as shown in fig. 3. When cluster 1 executes the workload, it determines that no additional processing is required for the fourth CPU task, and at time 324, both clusters 1 and 2 begin executing low priority workloads for the remaining time of timing window 1.

At time 325, the CPU 104 executes the fifth CPU task and then sends a workload request to the GPU driver 108 for the workload associated with the fifth CPU task at time 326. In this case, the GPU 106 and the CPU 104 execute the corresponding respective workload and task in parallel, with the CPU 104 waiting for the GPU 106 to "catch up". Thus, the CPU 104 and GPU 106 execute in parallel. The workload associated with the fifth CPU task preempts the low priority workload previously executing on GPU 106. At time 328, clusters 1 and 2 complete the processing of the workload associated with the fifth CPU task and resume the low priority workload. Finally, at time 330, the CPU 104 performs a sixth CPU task in timing window 1 and determines that additional processing is not required to perform the sixth CPU task. Thus, GPU 106 continues to process the low priority workload for the remaining time of timing window 1 until a new frame of data (frame 3) is received.

Fig. 4A-4B depict diagrams of scheduling a workload to multiple clusters of GPUs in a tightly coupled coprocessor scheme. Graph 400A illustrates an example in which a GPU executes a workload associated with CPU tasks within the same timing window. Graph 400B illustrates an example in which a GPU executes a workload associated with a CPU task in a subsequent timing window. Although fig. 4A shows a 1:1 correlation between the timing window of the CPU and the timing window of the GPU, in other examples, the timing window of the CPU may have a different duration than the timing window of the GPU. Moreover, the CPU may have a different number of timing windows than the timing windows of the GPU, and the GPU workload associated with the CPU tasks may have different execution time requirements.

Referring first to fig. 400a, the CPU 104 performs a first CPU task within timing window 1 at time 401. GPU driver 108 then determines that processing is required for the first CPU task, and at time 402, GPU driver 108 schedules the workload associated with the first CPU task to cluster 1 for execution. Cluster 1 continues to execute the workload for the rest of timing window 1, but cannot complete execution of the workload until timing window 2 begins at time 403. When timing window 2 begins, CPU 104 begins executing a second CPU task that requires processing from GPU 106. The workload handled by cluster 1 during timing window 1 becomes preempted at the timing window 2 boundary. Since in this example GPU 106 is synchronized with CPU 104 at the timing window boundary, cluster 1 stops processing of the workload associated with the first CPU task once timing window 2 begins at time 403. At the same time, during the time of timing window 2, cluster 2 begins processing of the workload associated with the second CPU task.

At time 404, the timing window returns to timing window 1. At this point, the workload handled by cluster 2 during timing window 2 becomes preempted, and at time 404, cluster 1 resumes processing of the workload associated with the first CPU task that had previously been preempted at the beginning of timing window 2. As cluster 1 resumes processing of the first CPU task workload, CPU 104 also executes a third CPU task for processing. At time 405, cluster 1 completes processing of the workload associated with the first CPU task and begins processing of the workload associated with the third CPU task at time 406. At time 407, when the timing window returns to timing window 2, the processing of the third CPU task workload becomes preempted. At this point, cluster 2 resumes processing of the workload associated with the second CPU task.

At time 408, a new data frame (frame 2) is received. At time 409, the CPU 104 performs a fourth CPU task. At time 410, gpu driver 108 schedules a light workload associated with the fourth CPU task and determines that no additional processing is required for the fourth CPU task. Thus, GPU driver 108 schedules low priority workloads during the remainder of timing window 1. Once timing window 2 begins, the low priority workload is preempted and the CPU 104 performs a fifth CPU task. Then, at time 411, gpu driver 108 schedules the workload associated with the fifth CPU task to both clusters 1 and 2. At time 412, gpu 106 completes execution of the workload associated with the fifth CPU task and resumes processing of the low priority workload for the remainder of timing window 2.

Fig. 4B depicts an alternative close-coupled coprocessor scheme used alone or in combination with fig. 4A. Referring next to diagram 400B, at time 413, the CPU 104 performs a first CPU task. At time 414, the CPU 104 registers a corresponding workload associated with the first CPU task to determine if additional processing is necessary. In one example, the CPU 104 may register the workload to be allocated to the GPU 106 in a list scheduled during a subsequent timing window. As shown in diagram 400B, CPU 104 does not perform additional tasks during the remainder of timing window 1. At time 415, the timing window changes to timing window 2. At this point in time, GPU driver 108 queues the registered workload from the previous timed window for execution. GPU driver 108 then schedules the workload associated with the first CPU task to cluster 1 for processing. GPU driver 108 determines that the first CPU task needs to be processed and begins processing the workload associated with the first CPU task on cluster 1 at time 416. At the same time, the CPU 104 performs a second CPU task at time 415 and records the workload associated with the second CPU task for the next timing window. At time 418, cluster 1 completes execution of the workload associated with the first CPU task. Since no additional workload is queuing during timing window 1, clusters 1 and 2 begin to process low priority workloads for the remainder of timing window 2.

At time 419, the timing window changes to timing window 1, and the workload queued from the previous timing window may now be executed by clusters 1 and/or 2 of GPU 106. However, in some examples, the queued workload is delayed by a specified time within the current timing window. For example, the queued workload optionally includes an estimated time required to complete the workload. If the estimated time is less than the duration of the current timing window, the queued workload may be delayed until the time remaining in the current timing window is equal to the estimated completion time of the queued workload. This is shown in graph 400B because the workload associated with the second CPU task is not performed by GPU 106 at the beginning of timing window 1, but rather begins at time 421 after some time has elapsed. Instead, both clusters 1 and 2 process low priority workloads during time 419 until the remaining time in timing window 1 is equal to the estimated completion time for the workload associated with the second CPU task.

Also, at time 419, the CPU 104 begins processing a third CPU task. At time 420, the CPU 104 registers a workload associated with the third CPU task, which the GPU 106 begins executing through cluster 1 for the duration of the subsequent timing window at time 422.

A new data frame is then received (frame 2). Beginning at time 423, the CPU 104 begins processing the fourth CPU task in timing window 1. Since no workload is registered by CPU 104 at the previous timing window, GPU 106 begins processing low priority workloads for the duration of timing window 1 beginning at time 423. Timing window 2 begins at time 424. At this point, cluster 1 of GPU 106 performs a light workload associated with the fourth CPU task for cluster 1 while CPU 104 begins processing the fifth CPU task. At time 425, cluster 1 completes processing the light workload associated with the fourth CPU task and resumes processing of the low priority workload (and cluster 2) for the duration of timing window 2. At time 426, CPU 104 executes the sixth CPU task while clusters 1 and 2 begin processing the workload associated with the fifth CPU task from the previous timing window. Once completed, clusters 1 and 2 resume processing of the low priority workload beginning at time 427 for the duration of timing window 1.

In some examples, the order of priority of a given task is based on a timing window in which the given task was originally scheduled. For example, for a given set of three workloads (W1, W2, W3) to be executed on the GPU, W1 may have the highest priority in timing window 1 and thus will not be preempted by W2 or W3 during timing window 1. Once timing window 2 begins, the priority may be changed so that W2 has the highest priority, enabling W2 to schedule immediately for execution and preempt W1 if W1 did not complete execution during timing window 1. Similarly, once the timing window switches to timing window 3, W3 has the highest priority and can be immediately scheduled for execution and can camp on W2. Thus, when the timing window changes between timing windows 1, 2, and 3, the order of priority among the workloads assigned to the GPUs may also change.

Fig. 5A to 5B depict diagrams of synchronous operation between a CPU and a GPU. In particular, fig. 5A depicts blocking CPU-GPU synchronization, where a CPU processes tasks based on processing performed simultaneously by a GPU, and fig. 5B depicts launching multiple workloads associated with a single CPU task to multiple clusters of GPUs. In both fig. 5A and 5B, the horizontal axis is represented by time.

Referring to FIG. 5A, system 500A includes any number of CPU cores (CPU 1, CPU 2, and up to N amounts of CPU-N, where N is an integer). In addition, system 500A includes any number of GPU clusters (cluster 1, cluster 2, and up to N number of clusters N, where N is an integer). Each CPU core is configured to execute tasks of one or more applications as previously described, and each GPU cluster is configured to execute a workload associated with the tasks executed by the CPU cores as previously described. At the beginning of a given time (within a single timing window or time periods spanning multiple timing intervals of a data frame) as shown in fig. 5A, both CPU cores CPU 1 and CPU 2 perform different tasks 502 and 504, respectively. Both tasks 502 and 504 require processing by the GPU cluster. Thus, one or more workloads 508 associated with task 504 executing on CPU 2 are scheduled to cluster 2, and one or more workloads 505 associated with task 502 executing on CPU 1 are scheduled to cluster 1.

In various examples, some CPU cores may perform tasks based on whether the corresponding GPU cluster is currently executing a workload. Consider, for example, CPU 1. As shown in fig. 5A, the CPU 1 initially starts executing the task 502 for a certain period of time. When CPU 1 executes task 502, cluster 1 begins executing one or more workloads 505 associated with task 502. While cluster 1 executes the workload associated with the first CPU task, CPU 1 is configured to delay the processing of another task (task 510), which may be the same task as task 502, until cluster 1 completes executing the workload associated with the previous task. In this example, CPU 1 and cluster 1 are configured in a "blocking" synchronous configuration, in that CPU 1 blocks execution of new tasks for a period of time until sufficient processing resources are available on the GPU, i.e., the corresponding cluster on the GPU.

In additional or alternative examples, some CPU-GPU synchronization configurations are not "blocked". In these examples, the CPU is configured to perform a task independent of whether the corresponding GPU cluster is currently performing a workload associated with another task. As shown in fig. 5A, the CPU 2 initially starts executing task 504. Once CPU 2 completes execution 504, one or more workloads associated with task 504 are scheduled to cluster 2 for execution. When cluster 2 executes workload 508, CPU 2 then executes another task 506. Unlike CPU 1, CPU 2 is configured to immediately begin executing task 506 while cluster 2 executes workload 508 associated with previous task 504. Non-blocking execution may be achieved by: (1) Frame buffering, wherein the CPU queues a plurality of frames for processing by the GPU; (2) Parallel execution, wherein the CPU sequentially launches multiple workloads on the GPU (by slicing workload requests into portions of the same sequence of multiple buffers, or in multiple unrelated sequences); and (3) batch processing, wherein the CPU merges multiple similar requests that execute in parallel across one request.

In some examples, CPU-GPU synchronization is partially "blocked" such that a CPU is free to perform tasks concurrently with a corresponding GPU until the GPU becomes overly backlogged with workload from the CPU. In this case, the CPU may wait until the GPU completes a certain amount of workload to "catch up" with the CPU. For example, CPU 2 may wait a certain period of time until cluster 2 completes execution of workload 508 before executing task 512.

Referring now to fig. 5B, in some examples, workloads associated with one or more CPU tasks are scheduled to multiple GPU clusters. As shown in fig. 500B, the workload associated with the task performed by CPU 1 is assigned to cluster 1. In particular, as CPU 1 executes task 514, task 518, task 526, and task 534 in order, one or more workloads 520, 528 associated with these tasks are scheduled to cluster 1 for execution. Alternatively, it is illustrated that CPU 1 and cluster 1 share a 1:1 synchronous configuration, wherein the workload associated with the task performed by CPU 1 is scheduled to only cluster 1 for execution. Conversely, when CPU 2 executes any of tasks 516, 530, and 532, one or more workloads associated with any of these tasks may be scheduled to cluster 2 and/or cluster N. Thus, CPU 1 shares a 1:N synchronization configuration in which the workload associated with tasks performed by CPU 1 is scheduled to multiple GPU clusters for execution. As shown in FIG. 500B, when CPU 2 executes task 516, workload 522 associated with task 516 is scheduled to cluster N, and workload 524 associated with task 516 is additionally scheduled to cluster 2. GPU cluster as shown in fig. 500B may also implement the blocking synchronization technique as described in system 500A.

Examples of coprocessor scheduling policies (and coprocessor allocation policies described further herein) optionally include a preemption policy that manages preemption of workloads configured for execution by a coprocessor GPU. Examples of preemption schedules are shown in fig. 6A-6B (although depicted in other figures as well), where fig. 6A depicts a non-preemption policy and fig. 6B depicts an example of a preemption policy between a CPU and a GPU. Referring to fig. 6A, cluster 1 is configured to perform a workload associated with tasks performed by both CPU 1 and CPU 2. CPU 1 initially executes low priority task 602 while CPU2 executes high priority task 604. Because both tasks require processing by the GPU, cluster 1 is configured to execute a workload 606 associated with a low priority task 602 (e.g., best effort task) and also execute a workload 608 associated with a high priority task 604 (e.g., safety critical task). As shown in fig. 6A, CPU 1 completes low priority task 602 before CPU2 completes high priority task 604. Thus, once CPU 1 completes executing low priority task 602, cluster 1 begins executing workload 606, while CPU2 continues executing high priority task 604.

After the CPU 2 completes the high priority task 604, cluster 1 may execute the workload 608 associated with the high priority task 604. However, by the time CPU 2 completes the high priority task 604, cluster 1 has executed the workload 606 associated with the low priority task 602 executing on CPU 1. In the non-preemptive example as shown in fig. 6A, the workload 606 associated with the low priority task 602 is not preempted by workload launch requests that include higher priority workloads. Thus, once cluster 1 completes executing workload 606, cluster 1 executes workload 608 associated with high priority task 604 during a subsequent timing cycle. Alternatively, it is illustrated that a low priority workload executed by a GPU cluster cannot be preempted by subsequent workload launch requests having a higher priority workload configured for execution by the same GPU cluster. In these examples, subsequent workload launch requests are instead performed on a first-come-first-serve basis.

In contrast, FIG. 6B shows an example of a preemption schedule in which lower priority workloads are preempted by higher priority workloads. As previously described, task 604 is higher in priority than task 602; however, cluster 1 initially executes the workload 606 associated with the low priority task 602 because CPU 1 completes executing the low priority task 602 before CPU 2 completes executing the high priority task 604. After CPU 2 completes executing high priority task 604, cluster 1 may execute workload 608 associated with high priority task 604. Workload 608 is higher in priority than workload 606 currently being executed by cluster 1, so cluster 1 is configured to stop execution of lower priority workload 606 and begin execution of high priority workload 608. That is, the high priority workload 608 preempts the low priority workload 606. In some examples, cluster 1 is configured to resume execution of low priority workload 606 upon completion of high priority workload 608, and optionally when there is no pending workload launch request with a higher priority than low priority workload 606. These examples are useful when the progress of executing the lower priority workload can be stored and accessed for processing in a subsequent time period. In other examples, preempting the lower priority workload resets progress on the lower priority workload and thereby requires the GPU to restart the lower priority workload from the beginning. These examples require less processing bandwidth than examples of storing job progress.

For an example of implementing workload preemption on a coprocessor (such as a GPU), the coprocessor receives a request from a driver, for example, specifying when preemption of a lower priority workload will occur. The preemption policy may be implemented by hardware and/or software. In one hardware example, preemption occurs at command boundaries such that, once a command is completed, or at the earliest preemptible command (i.e., when the GPU can implement the next command), a lower priority workload (or context including a set of lower priority workloads) is preempted. In another hardware example, preemption occurs at thread boundaries where a lower priority context stops issuing additional lower priority workloads and becomes preempted when all of the currently executing workloads are completed. In yet another hardware example, workload execution is preempted by saving the workload state into memory, which can be restored once execution resumes. In another hardware example, preemption occurs during execution of a thread, where the GPU may immediately stop execution of the lower priority thread and store a previously executed portion of the thread into memory for later execution.

The coprocessor may also implement preemption by software. In one software example, preemption occurs at thread boundaries as previously described in a hardware implementation. In another software example, preemption occurs immediately upon receipt of a request, and any currently or previously executing workload within the same context must restart at a later time period, similar to resetting the lower priority workload as mentioned in fig. 6B. In another software example, preemption occurs at a defined checkpoint during execution of a workload. Checkpoints may be set at workload boundaries or at any point during execution of a workload and saved for future execution. Once the coprocessor can continue executing the workload, it resumes executing the workload at the save checkpoint. Additionally or alternatively, the workload is sliced into multiple sub-portions (e.g., sub-kernels) before the coprocessor executes the workload launch request, and preemption may occur at any sliced sub-kernel boundary.

FIG. 7 depicts a flowchart that shows an exemplary method for scheduling a workload for execution to a coprocessor. The method 700 may be implemented via the techniques described with respect to fig. 1-6, but may also be implemented via other techniques. For ease of explanation, the blocks of the flow diagrams are arranged in a generally sequential manner; however, it is to be understood that such an arrangement is merely exemplary, and it is to be appreciated that the processes associated with the methods described herein (and the blocks shown in the figure) may occur in different orders (e.g., with at least some of the processes associated with the blocks performed in a parallel manner and/or in an event-driven manner).

The method 700 includes block 702 of receiving a workload launch request from one or more tasks executing on a processor, such as by a driver implemented on a coprocessor or other processing element. The workload placement request includes a list of workloads associated with tasks performed by the processor and may include other parameters such as the priority of the workloads in the list and the processing resources required to execute the corresponding workloads on the coprocessor. At block 704, the method 700 proceeds by generating at least one launch request according to a workload launch request and based on a coprocessor scheduling policy. The driver or other processing unit may then schedule the workload for execution on the coprocessor based on the launch request and the coprocessor scheduling policy (block 705).

Depending on the example, method 700 proceeds based on the terms of the coprocessor scheduling policy. Optionally, the method 700 proceeds to block 706 and schedules the workload for execution independent of a time period (e.g., timing window and/or data frame boundary) of the processor or other external event. In this loosely coupled configuration, the coprocessor may schedule the workload asynchronously with the timing of the processor. Such loosely coupled configurations optionally enable workload scheduling based on a priority order among the workloads received by the coprocessors. For example, even though the coprocessor may schedule workloads asynchronously with the processor timing window, the coprocessor scheduling policy may include a preemption policy that preempts lower priority workloads currently executing or queued on the coprocessor by higher priority workloads.

Additionally or alternatively, the method 700 optionally proceeds to block 708 and schedules a workload for execution based on a timing window of the processor. In one implementation, the method 700 schedules the workload for execution on the coprocessor during the same timing window of the processor. In another implementation, the method 700 schedules workloads in the same timing window for execution on the coprocessor, wherein the coprocessor 106 is synchronized with the processor 104 in the same timing window, but maintains freedom in executing on the coprocessor workloads associated with different queues and/or tasks consistent with other priority rules or processing availability. That is, the coprocessor scheduling policy optionally includes a preemption policy that applies to the close-coupled configuration and schedules the workload for execution based on the priority order of the workload. When the workload placement request includes a workload with a higher priority than the workload currently executing on the coprocessor, the coprocessor scheduling policy configures the coprocessor to preempt the lower priority workload and synchronize the higher priority workload to a subsequent timing window of the processor or even another common timing window between the coprocessor and the processor.

Coprocessor allocation policy

As previously described with respect to fig. 1A, in some examples, the co-processor 106 (and in particular the driver 108) is configured to implement a co-processor allocation policy that allocates workload associated with the task 105 to the clusters 114, 116 for execution in accordance with at least one workload placement request. Coprocessor allocation policies define the locations at which workloads execute on one or more specified clusters of coprocessors 106. Figures 8-11 illustrate various examples of distributing workload across coprocessors, such as coprocessor 106. The allocation techniques further described herein may be implemented in connection with the previously described scheduling policies or as a stand-alone example. For example, the allocation policy may implement the preemption technique described in the context of scheduling workloads for execution.

Fig. 8A-8B depict diagrams of data coupling between multiple clusters of coprocessors. In the example shown in fig. 8A, each cluster (cluster 1, cluster 2, cluster 3) is configured to perform a workload associated with a task performed by the processor. In some examples, such as shown in fig. 8A, the data is coupled at a frame boundary or alternatively illustrates that the data is coupled between the coprocessor and the processor at a frame rate of the input data provided to the processor. This data coupling is shown in fig. 8A by the dashed line extending from the frame 1 boundary indicated by numeral 802. At frame 1 boundary 802, each cluster begins processing the workload associated with the same processor task and continues to do so even through the timing window boundary indicated by numeral 804. The data is not coupled at the timing window boundary, which enables the cluster to continue processing the workload independent of timing window 1 boundary 804 and at timing window 2 boundary 806. Once the new frame 2 boundary 808 arrives, when data is received by the processor (not shown in fig. 8A), the data from the new data frame is again coupled between the clusters.

Fig. 8B depicts a diagram of data coupling at a timing window boundary opposite a data frame boundary. At frame 1 boundary 802, no data is allocated to the cluster. Instead, the workload associated with the processor task is distributed to all three clusters at timing window 2 boundary 804. Between timing window 2 boundary 804 and timing window 1 boundary 806 (during timing window 2), the clusters may finish processing the workload associated with one processor task and may begin processing the workload associated with another processor task or may begin processing low priority workloads until the data between the clusters is again coupled at timing window boundary 806. The data coupling as depicted in fig. 8A-8B is not necessarily exclusive and may be combined to have data coupled at both the timing window boundary and the frame boundary.

FIG. 9 depicts a diagram of coprocessor allocation policies applied to multiple clusters of coprocessors (such as GPUs). The allocation policy described in the context of fig. 9 may be modified based on the data coupling techniques of fig. 8A-8B. Although four allocation policies are shown in fig. 9, other allocation policies and combinations thereof are possible. For example, at a first timing interval (e.g., within one data frame), the workload is allocated to the clusters according to an interleaving policy, whereas at a different timing interval (e.g., a subsequent data frame), the workload is allocated according to a policy-distributed policy. In addition, an allocation policy may be applied to each cluster of GPUs or to a subset of clusters, where one subset of clusters of GPUs may follow a different allocation policy than another subset of clusters.

In some examples, GPU jobs are assigned to clusters using an exclusive policy, wherein workloads associated with different CPU tasks are assigned exclusively to different clusters within one or more timing intervals. Referring to fig. 9, a job (job a) associated with a first CPU task is assigned to cluster 1, a job (job B) associated with a second CPU task is assigned to cluster 2, and a job (job C) associated with a third CPU task is assigned to cluster 3, wherein the first, second, and third CPU tasks are different from each other. The exclusivity policy may be implemented in different ways. For example, in an exclusive access policy, all clusters of GPUs (and thus all compute units) are dedicated to jobs that handle the same CPU task. In an exclusive slicing strategy, a GPU is sliced into multiple GPU partitions, each partition comprising one or more clusters or portions of clusters. In this example, workloads from the same CPU task are assigned to only a single GPU partition or cluster set. In the case of multiple CPU tasks, the workload from each CPU task is assigned to a different partition of the GPU, respectively. For clusters that have sufficient isolation to include multiple isolated clusters or partitions, an exclusive policy (or any of the allocation policies further described herein) enables workload allocation at a partition level or cluster level. This allocation strategy may be used to achieve spatial isolation as described above.

In contrast to the exclusive allocation policy, the staggered allocation policy allocates workloads associated with the same CPU task to multiple clusters of GPUs simultaneously. As shown in fig. 9, jobs associated with a first CPU task (job a) are assigned to clusters 1-3 for processing within a given time period (e.g., a timing window, a data frame, or a portion thereof), followed by jobs associated with a second CPU task (job B) being assigned to those same clusters at a next data coupling boundary. This process may be repeated for any additional CPU tasks that need to be allocated within a specified given period of time. In addition, incomplete workloads may be restored from the earlier data coupling boundary and distributed to all three clusters simultaneously at the next data coupling boundary. As shown in fig. 9, work a is first assigned to clusters 1-3, then work B, then work C ends, and is repeated for subsequent timing intervals. This allocation strategy can be used to achieve time isolation as described above.

Both the exclusive allocation policy and the staggered allocation policy correspond to a static allocation policy that allocates workload to clusters/partitions independent of workload priority or computing resources. In contrast, policy-distributed allocation policies exemplify dynamic allocation policies that consider workload priorities and computing resources of clusters/partitions. The workload associated with a processor task (which is higher in priority than another workload associated with another processor task) will typically be assigned before the lower priority workload and will typically be assigned to more available clusters than the lower priority workload. The amount of clusters or partitions to which a workload is allocated depends on the amount of resources necessary to process the workload and/or the amount of computing resources currently available in the coprocessor.

In the example depicted in fig. 9, job a requires a maximum amount of computing resources to process, while job B requires a minimum amount of computing resources to process. To accommodate the greater number of necessary resources, work a occupies the population of computing resources of cluster 1 for a given timing interval, and occupies computing resources of clusters 2 and 3 for a portion of the given timing interval. In contrast, work B occupies the computing resources of cluster 2 for only a portion of a given timing interval. Policy-distributed policies may be used to evaluate a new set of GPU workloads for subsequent timing windows to dynamically adjust the allocation of the workloads so that higher priority or larger workloads may begin once computing resources are available. In a policy-distributed allocation policy (such as the allocation policy shown in fig. 9), any processing resources that become available after execution of a workload during a given timing window are then assigned to the highest priority workload in the timing window (even if the highest priority workload is scheduled after a lower priority workload). As shown in FIG. 9, the highest priority workload is work A, and once work B and C complete execution, the processing resources of clusters 2 and 3 are allocated.

The work load may sometimes require processing beyond the currently available computing resources in the coprocessor. Thus, in some examples, the allocation policies (including any of the previously described allocation policies) include policies that govern allocation of queued workloads that exceed currently available computing resources depending on hardware and system parameters of the coprocessor. In one example, a workload exceeding the currently available computing resources simply remains queued until more computing resources become available that meet the processing requirements of the workload, thereby freeing a limited number of the available computing resources until a subsequent period of time. In another example, the available computing resources are allocated to the highest priority workload currently executing on the coprocessor; that is, the highest priority workload currently executing receives more processing resources (e.g., clusters, partitions, or computing units) than originally requested. In another example, heavy workloads begin executing even if the currently available computing resources are insufficient. In yet another example, the highest priority workload with sufficient processing requirements to meet the available computing resources is allocated the available computing resources.

FIG. 9 also depicts an example of a shared allocation policy. The shared allocation policy allocates workloads associated with different processor tasks to the same cluster of coprocessors to process the workloads simultaneously within a time interval. For example, cluster 1 receives portions of work a, work B, and work C that it processes simultaneously within a timing window. Clusters 2 and 3 also receive portions of work a, work B, and work C within the same timing window. The portions shared between the clusters may or may not be aliquoted, and in some examples depend on the processing power of the clusters and the processing requirements of the workload.

Coprocessor allocation policies may include combinations of policies described herein. For example, coprocessor allocation policies may include a hybrid exclusive sharing policy in which one or more clusters are exclusively allocated workloads (i.e., one cluster receives a workload associated with one queue and another cluster receives a workload associated with another queue), while another cluster implements a sharing policy that includes workloads associated with different tasks.

Fig. 10A-10C depict block diagrams illustrating an exemplary system configured to distribute a workload to multiple clusters of GPUs 1000. GPU 1000 includes hardware 1004 configured to perform execution of workloads associated with CPU tasks that need to be processed on GPU 1000, and in some examples, includes hardware or software isolation between different clusters to reduce or eliminate processing faults from affecting parallel processing. Referring to FIG. 10A, shared context 1012 includes a queue 1014 containing one or more workloads (jobs) 1015 corresponding to a first CPU task. Workload 1015 is distributed to one or more of clusters 1006 or computing units 1007 based on coprocessor scheduling policies and/or coprocessor allocation policies. The background 1012 optionally includes one or more additional workloads 1016 associated with different CPU tasks.

The workload 1015 and optional workload 1016 are sent to the command fluidizer 1010 for distribution to the clusters 1006 or 1008. For example, if the queue 1014 includes only the workload 1015, the workload 1015 is allocated to at least one of the clusters 1006 including the plurality of computing units 1007. However, when queue 1014 contains workload 1016, command fluidizer 1010 is configured to allocate workload 1016 to at least one compute unit 1009 of cluster 1008. In other examples, the allocation of the workload is managed by a software method. As shown in fig. 10A, GPU 1000 includes a single command fluidizer 1010 that corresponds to a single queue 1014.

In some examples, GPU 1000 includes multiple queues and command fluidizers that distribute the workload to different computing resources on the GPU. For example, fig. 10B depicts a background 1012 that includes a first queue 1014 and a second queue 1013. The first queue 1014 differs from the second queue 1013 to provide sufficient spatial or temporal isolation and includes one or more workloads 1015 corresponding to the first CPU task, while the second queue 1013 includes one or more workloads 1016 corresponding to the second CPU task. GPU 1000 also includes a first command fluidizer 1010 and a second command fluidizer 1011. In this example, the first command fluidizer 1010 is configured to receive the workload 1015 from the first queue 1014 and to distribute the request to one or more of the clusters 1006 (or computing units 1007 of the clusters 1006) for processing the workload in the request. Meanwhile, the second command fluidizer 1011 is configured to receive the workload 1016 from the second queue 1013 and distribute the workload to the cluster 1008 including the plurality of computing units 1009. Although fig. 10B shows a first command fluidizer 1010 coupled to two clusters 1006 and a second command fluidizer 1011 coupled to a single cluster 1008, the first command fluidizer 1010 and the second command fluidizer 1011 may be configured to distribute the workload to any number of clusters supported by the GPU 1000 by software, hardware, or a combination thereof.

In another example, fig. 10C depicts multiple personal contexts 1012A and 1012B each including a respective workload 1015 and 1016. Each context is different and isolated from the other contexts by hardware or software constraints to provide sufficient spatial or temporal isolation. In addition, each context may be configured to provide a workload to a respective command fluidizer for distribution to clusters of GPUs 1000. As shown in fig. 10C, context 1012A provides workload 1015 to command fluidizer 1010 and context 1012B provides workload 1016 to command fluidizer 1011. The command fluidizer 1010 then distributes the workload to one or more clusters 1006 and the command fluidizer 1011 distributes the workload 1016 to the cluster 1008. Within a given context, the workloads may be executed in parallel or out of order.

FIG. 11 depicts a flowchart that shows an exemplary method for allocating cores to processing resources of a coprocessor. The method 1100 may be implemented via the techniques described with respect to fig. 1-10, but may also be implemented via other techniques. For ease of explanation, the blocks of the flow diagrams are arranged in a generally sequential manner; however, it is to be understood that such an arrangement is merely exemplary, and it is to be appreciated that the processes associated with the methods described herein (and the blocks shown in the figure) may occur in different orders (e.g., with at least some of the processes associated with the blocks performed in a parallel manner and/or in an event-driven manner).

Method 1100 includes receiving one or more workload placement requests from one or more tasks executing on a processor, as indicated in block 1102. The method 1100 then proceeds to block 1104 by generating at least one launch request including one or more workloads based on the coprocessor allocation policy. The method 1100 then proceeds to block 1105 by distributing the workload identified in the launch request to processing resources on the coprocessor based on the coprocessor allocation policy. For example, the method 1100 optionally proceeds to block 1106 to allocate each workload of the launch request to a dedicated cluster of computing units according to an exclusivity policy.

Additionally or alternatively, the method 1100 proceeds to block 1108 and distributes each workload of the launch request across a plurality of different clusters according to an interleaving policy. In one example of the present policy, a first workload in an launch request (e.g., the workload with the highest priority) is assigned to all clusters first during a first timing interval, then a second workload is assigned to all clusters during a second timing interval, and so on, such that each workload is assigned to each of the clusters sequentially.

Additionally or alternatively, the method 1100 proceeds to block 1110 and each workload of the launch request is distributed to at least one cluster based on the computing parameters and/or the priority of the workload according to a policy-distributed policy. For example, each workload is individually assigned to at least one cluster for the duration of execution on the cluster. A workload associated with a processor task (which is higher in priority than another workload) will typically be assigned before a lower priority workload and will typically be assigned to more available clusters than a lower priority workload. The amount of clusters or partitions to which a workload is allocated depends on the amount of resources necessary to process the workload and/or the amount of computing resources currently available in the coprocessor.

In some examples, coprocessor allocation policies include policies that govern allocation of queued workloads beyond currently available computing resources depending on hardware and system parameters of the coprocessor. In one example, a workload exceeding the currently available computing resources simply remains queued until more computing resources become available that meet the processing requirements of the workload, thereby freeing a limited number of the available computing resources until a subsequent period of time. In another example, the available computing resources are allocated to the highest priority workload currently executing on the coprocessor; that is, the highest priority workload currently executing receives more processing resources (e.g., clusters, partitions, or computing units) than originally requested. In another example, heavy workloads begin executing even if the currently available computing resources are insufficient. And in yet another example, the highest priority workload with sufficient processing requirements to meet the available computing resources is allocated the available computing resources.

Additionally or alternatively, the method 1100 proceeds to block 1112 and allocates the plurality of workloads of the launch request among the plurality of clusters during the same timing interval such that portions of the workload are shared among the plurality of clusters according to a sharing allocation policy. In one example, each workload in the launch request is shared across all clusters during the same timing interval such that each cluster processes each workload simultaneously. Other coprocessor allocation policies are possible.

FIG. 12 depicts a flowchart illustrating an exemplary method for managing processing resources when a workload on a coprocessor is executed, such as when the currently executing workload runs out of an execution budget on the coprocessor. The method 1200 may be implemented via the techniques described with respect to fig. 1-11, but may also be implemented via other techniques. For example, method 1200 may be implemented in sequence as workloads are scheduled and/or distributed to coprocessors for execution, but may also be implemented during scheduling events to maintain proper synchronization between the processors and coprocessors. For ease of explanation, the blocks of the flow diagrams are arranged in a generally sequential manner; however, it is to be understood that such an arrangement is merely exemplary, and it is to be appreciated that the processes associated with the methods described herein (and the blocks shown in the figure) may occur in different orders (e.g., with at least some of the processes associated with the blocks performed in a parallel manner and/or in an event-driven manner).

Method 1200 includes block 1202 and receives information regarding workload budget constraints, e.g., from a workload launch request received by a drive. When the currently executing workload runs out of budget, the method 1200 proceeds to block 1203 and determines whether there is any additional processing budget remaining after processing the workload on the coprocessor. If there is additional processing budget remaining, the method 1200 proceeds to block 1204 and obtains the corresponding task budget from the completed workload and additionally receives the priority of the completed workload. From there, additional budget and priority information can be used to handle queued workloads during subsequent timing intervals.

If no budget is available, the method 1200 proceeds to block 1206 to preempt and/or stop the currently executing workload. Optionally, the method 1200 may then proceed to block 1208 and reschedule the workload to the coprocessor for execution. This example may be implemented when the workload is scheduled according to a coprocessor scheduling policy as previously described. Additionally or alternatively, the method 1200 proceeds to block 1210 (directly from block 1208 or from block 1206) to reallocate workload priorities and optionally reschedule the workload for execution based on the updated workload priorities.

Fig. 13 depicts a flowchart illustrating an exemplary method for prioritizing operational loads in the background. The method 1300 may be implemented via the techniques described with respect to fig. 1-12, but may also be implemented via other techniques. For ease of explanation, the blocks of the flow diagrams are arranged in a generally sequential manner; however, it is to be understood that such an arrangement is merely exemplary, and it is to be appreciated that the processes associated with the methods described herein (and the blocks shown in the figure) may occur in different orders (e.g., with at least some of the processes associated with the blocks performed in a parallel manner and/or in an event-driven manner).

Method 1300 includes block 1302 and sorting workloads from one or more workload launch requests into one or more contexts. In some examples, the coprocessor includes a plurality of queues, each queue independently configured with a different workload isolated from a workload associated with another queue. In these examples, the workload is sorted into each of a plurality of contexts. Alternatively, for coprocessors with only one context, all workloads are sorted into a single context.

The method 1300 then proceeds to block 1304 by sorting the workloads within a given context based on the priority of each workload in the context. This step is repeated or performed in parallel for each context supported by the coprocessor. In some examples, the number of contexts depends on the number of queues on the coprocessor. For example, a coprocessor may have two queues that each correspond to one of the contexts. For coprocessors implementing multiple contexts, method 1300 optionally proceeds to block 1306 to sort the contexts based on the priority of the queues associated with the contexts. The queue with the highest priority will be scheduled and allocated first in the queuing background list. For a single queue coprocessor, block 1306 is not needed because the coprocessor computing resource will receive a single queue containing a list of workloads scheduled for execution. Once the contexts are selected based on the priority of the queues, the computing resources begin executing the respective queues in the selected contexts based on the priorities of the workloads within the contexts. The sorting of the priority of each queue and/or context is determined for a point in time and may be further updated or adjusted as additional workload requests become available.

Fig. 14A-14C depict a flowchart illustrating an exemplary method for scheduling and distributing workloads. Methods 1400A-1400C may be implemented via the techniques described with respect to fig. 1-13, but may also be implemented via other techniques. In particular, the methods 1400A-1400C may be performed sequentially as described further below. Furthermore, some of the steps in methods 1400A-1400C may be optional, depending on the architecture and coupling between the processor and the coprocessor. For ease of explanation, the blocks of the flow diagrams are arranged in a generally sequential manner; however, it is to be understood that such an arrangement is merely exemplary, and it is to be appreciated that the processes associated with the methods described herein (and the blocks shown in the figure) may occur in different orders (e.g., with at least some of the processes associated with the blocks performed in a parallel manner and/or in an event-driven manner).

Beginning at block 1402, the method 1400A selects a highest priority context of a plurality of contexts, each of the plurality of contexts including a plurality of workloads of processing resources to be scheduled and allocated to coprocessors. Method 1400A then proceeds to block 1403 and determines, for the workload in the given context, whether there is a higher priority workload remaining in the context. If there is no higher priority workload in the background, method 1400A optionally proceeds to block 1404 to dispatch one or more clusters to perform work as defined by a coprocessor allocation policy, examples of which are described previously above. Additionally or alternatively, method 1400A terminates at block 1408.

For higher priority workloads that remain in the context, the method 1400A instead proceeds to block 1406 and prepares the highest priority workload in the context for execution. Method 1400A may then proceed further to indicator block A (block 1410) to proceed further to method 1400B.

From indicator box a (box 1410), method 1400B proceeds to box 1411 and determines whether there is sufficient space on the coprocessor to execute the higher priority workload prepared in box 1406. If so, the method 1400B proceeds to block 1418 and launches a higher priority workload on the coprocessor. In examples where multiple queues are supported, the workload may be distributed among the queues before the associated workload of the queues is executed on the coprocessor.

If there is insufficient space available on the coprocessor, method 1400B optionally proceeds to block 1412 by determining if there is any workload currently executing or scheduling with additional clusters on the coprocessor, which may be determined based on scheduling parameters associated with the task, partition, or timing window (including budget, priority rules, requested cluster number, etc. parameters). If neither the executing nor the scheduled workload utilizes additional clusters, the method 1400B proceeds to block 1416 and preempts the lower priority workload based on the preemption policy until there are enough clusters for the higher priority workload to execute. From there, method 1400B may proceed back to block 1411 to determine whether there is sufficient space on the GPU. Otherwise, if there is such a workload that utilizes additional clusters on the coprocessor, the method 1400B optionally proceeds to block 1413 by instead preempting the workload using the additional clusters. The method 1400B then optionally determines again at block 1414 whether there is sufficient space on the coprocessor to launch the higher priority workload after the additional clusters are preempted. If not, the method 1400B proceeds to block 1416 and camps on the lower priority workload based on the camping policy until there are enough clusters for the higher priority workload to execute. However, if sufficient space is available at block 1414, the method 1400B proceeds to block 1418 and launches a higher priority workload on the coprocessor. Method 1400B may then proceed to indicator block B (block 1420) and further proceed to method 1400C.

Beginning with indicator block B (block 1420), method 1400C proceeds to block 1421 and determines whether there are any free or available clusters on the coprocessor. If there are no free clusters (all clusters are currently processing workload), then the method 1400C ends at block 1428. If there are idle clusters on the coprocessor, method 1400C then optionally proceeds to block 1422 to determine if there is sufficient space available for the next highest priority workload. If there is enough space available on the free clusters to process the next workload, the method 1400C proceeds to block 1426 and prepares the highest priority job for execution on at least one of the free clusters. However, if there is a free cluster but insufficient space exists to execute the next highest priority workload at block 1422, then method 1400C optionally proceeds to block 1424 to dispatch the free cluster based on the coprocessor allocation policy. For example, rather than executing the next highest priority workload, the idle cluster may be dispatched by either the policy-distributed coprocessor policy or any of the other coprocessor allocation policies described herein to process the currently executing workload on the other clusters. Method 1400C may then end at block 1428.

The methods and techniques described here may be implemented in digital electronic circuitry, or in programmable processor (e.g., special purpose processor or a general purpose processor such as a computer) firmware, software, or in various combinations of each. An apparatus embodying these techniques may include appropriate input and output devices, a programmable processor, and a storage medium tangibly embodying program instructions for execution by the programmable processor. The processes embodying these techniques may be performed by: the programmable processor executes a program of instructions to perform desired functions by operating on input data and generating output. The techniques may advantageously be implemented in one or more programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. Generally, a processor will receive instructions and data from a read-only memory and/or a random access memory. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory devices; magnetic disks, such as internal hard disks and removable disks; magneto-optical disk; and Digital Video Discs (DVDs). Any of the foregoing may be supplemented by, or incorporated in, specially-designed application-specific integrated circuits (ASICs).

Exemplary embodiments

Embodiment 1 includes a processing system comprising: a processor; a coprocessor configured to implement a processing engine; a processing engine scheduler configured to schedule a workload for execution on the coprocessor; wherein the processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing on the processor and, in response, generate at least one launch request for submission to the coprocessor based on the coprocessor scheduling policy; wherein based on the coprocessor scheduling policy, the processing engine scheduler selects which coprocessor clusters to activate to execute the workload identified by the queue based on the at least one launch request; and wherein the coprocessor scheduling policy defines at least one of: a tightly coupled coprocessor schedule, wherein a workload identified by the at least one launch request is scheduled to execute immediately on the coprocessor within a time of a timing window in which the one or more tasks execute on the processor; or a close-coupled coprocessor schedule, wherein the workload identified by the at least one launch request is scheduled to be executed on the coprocessor based on a priority order and by any one of: during a subsequent timing window subsequent to the time of the timing window in which the one or more tasks are executing on the processor, or with respect to an external event common to both the processor and the coprocessor.

Embodiment 2 includes the processing system of embodiment 1 wherein the coprocessor scheduling policy defines loosely coupled coprocessor scheduling, wherein the workload identified by the at least one launch request is scheduled to be executed independent of a timing window of the one or more tasks executing on the processor and based on a priority order of the workload.

Embodiment 3 includes the processing system of any of embodiments 1-2, wherein the processor includes a Central Processing Unit (CPU) including at least one processing core, and the coprocessor includes a Graphics Processing Unit (GPU), a processing accelerator, a Field Programmable Gate Array (FPGA), or an Application Specific Integrated Circuit (ASIC).

Embodiment 4 includes the processing system of any of embodiments 1-3, wherein the processor includes a plurality of processor cores, and wherein the processing engine scheduler is configured to generate at least one launch request that schedules a workload associated with one of the plurality of processor cores to a plurality of clusters of the coprocessor for execution.

Embodiment 5 includes the processing system of any of embodiments 1-4, wherein the workload identified by the at least one launch request is scheduled to be executed during a subsequent timing window after the time of the timing window in which the one or more tasks are executed on the processor and/or during a subsequent data frame boundary of the processor.

Embodiment 6 includes the processing system of any of embodiments 1-5, wherein the coprocessor scheduling policy includes a preemption policy defining a coupling coprocessor schedule in which one or more workloads scheduled for execution or currently executing on the coprocessor are configured to be preempted by one or more workloads queued for execution based on the priority order.

Embodiment 7 includes the processing system of embodiment 6, wherein the one or more workloads currently executing on the coprocessor are configured to be preempted by one or more higher priority workloads queued for execution, and wherein the coprocessor is configured to: storing the one or more workloads currently executing on the coprocessor; and rescheduling the stored one or more workloads for execution during a subsequent timing window after the higher priority workload has been executed.

Embodiment 8 includes the processing system of any of embodiments 6-7, wherein the preemption policy defines at least one of: coupling a coprocessor schedule in which one or more workloads currently executing on the coprocessor are configured to complete and queue for subsequent workloads executing preempted by higher priority workloads; coupling a coprocessor schedule wherein one or more workloads currently executing on the coprocessor are configured to be preempted by higher priority workloads; coupling a coprocessor schedule, wherein one or more workloads currently executing on the coprocessor are configured to be preempted by higher priority workloads, wherein the one or more workloads include an indicator identifying a portion of the respective workloads that have been executed, and wherein the one or more workloads are configured to be stored and re-executed starting from the indicator; or coupled with coprocessor scheduling, wherein the one or more workloads scheduled for execution are partitioned into a plurality of sub-portions, and each of the plurality of sub-portions is configured to be preempted by higher priority workloads.

Embodiment 9 includes the processing system of any of embodiments 1-8, wherein the processing engine comprises a computing engine, a rendering engine, or an Artificial Intelligence (AI) inference engine, and wherein the processing engine scheduler comprises a computing engine scheduler, a rendering engine scheduler, or an inference engine scheduler.

Embodiment 10 includes a coprocessor configured to be coupled to a processor and configured to implement a processing engine, the coprocessor comprising: at least one cluster configured to execute a workload; a processing engine scheduler configured to schedule a workload for execution on the coprocessor; wherein the processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks executing on the processor and, in response, generate at least one launch request for submission based on a coprocessor scheduling policy; wherein based on the coprocessor scheduling policy, the processing engine scheduler selects which of the at least one cluster to activate to execute the workload identified by the queue comprising the at least one launch request; and wherein the coprocessor scheduling policy defines at least one of: a tightly coupled coprocessor schedule, wherein a workload identified by the at least one launch request is scheduled to execute immediately on the coprocessor within a time of a timing window in which the one or more tasks execute on the processor; or a close-coupled coprocessor schedule, wherein the workload identified by the at least one launch request is scheduled to be executed on the coprocessor based on a priority order and by any one of: during a subsequent timing window subsequent to the time of the timing window in which the one or more tasks are executing on the processor, or with respect to an external event common to both the processor and the coprocessor.

Embodiment 11 includes the co-processor of embodiment 10, wherein the processing engine comprises a computing engine, a rendering engine, or an Artificial Intelligence (AI) inference engine, and wherein the processing engine scheduler comprises a computing engine scheduler, a rendering engine scheduler, or an inference engine scheduler.

Embodiment 12 includes the coprocessor of any of embodiments 10-11 wherein the coprocessor scheduling policy defines a loosely coupled coprocessor schedule in which the workload identified by the at least one launch request is scheduled to be executed independent of a timing window of the one or more tasks executing on the processor and based on a priority order of the workload.

Embodiment 13 includes the coprocessor of any of embodiments 10-12 wherein the processing engine scheduler is configured to generate at least one launch request that schedules a workload associated with one of a plurality of processing cores of the processor to a plurality of clusters of the coprocessor for execution.

Embodiment 14 includes the coprocessor of any of embodiments 10-13 wherein the workload identified by the at least one launch request is scheduled to be executed during a subsequent timing window after the time of the timing window in which the one or more tasks are executed on the processor and/or during a subsequent data frame boundary of the processor.

Embodiment 15 includes the coprocessor of any of embodiments 10-14 wherein the coprocessor scheduling policy includes a preemption policy defining a coupling coprocessor schedule in which one or more workloads scheduled for execution or currently executing on the coprocessor are configured to be preempted by one or more workloads queued for execution based on the priority order.

Embodiment 16 includes the coprocessor of embodiment 15 wherein the one or more workloads currently executing on the coprocessor are configured to be preempted by one or more higher priority workloads queued for execution, and wherein the coprocessor is configured to: storing one or more workloads currently executing on the coprocessor; and rescheduling the stored one or more workloads for execution during a subsequent timing window after the higher priority workload has been executed.

Embodiment 17 includes the coprocessor of any of embodiments 15-16 wherein the preemption policy defines at least one of: coupling a coprocessor schedule in which one or more workloads currently executing on the coprocessor are configured to complete and queue for subsequent workloads executing preempted by higher priority workloads; coupling a coprocessor schedule wherein one or more workloads currently executing on the coprocessor are configured to be preempted by higher priority workloads; coupling a coprocessor schedule, wherein one or more workloads currently executing on the coprocessor are configured to be preempted by higher priority workloads, wherein the one or more workloads include an indicator identifying a portion of the respective workloads that have been executed, and wherein the one or more workloads are configured to be stored and re-executed starting from the indicator; or coupled with coprocessor scheduling, wherein the one or more workloads scheduled for execution are partitioned into a plurality of sub-portions, and each of the plurality of sub-portions is configured to be preempted by higher priority workloads.

Embodiment 18 includes a method comprising: receiving one or more workload initiation requests from one or more tasks executing on a processor, wherein the one or more workload initiation requests include one or more workloads configured for execution on a coprocessor; generating at least one launch request in response to one or more workload launch requests based on a coprocessor scheduling policy; scheduling one or more workloads identified in the at least one launch request for execution on the coprocessor based on the coprocessor scheduling policy by at least one of: a tightly coupled coprocessor schedule, wherein a workload identified by the at least one launch request is scheduled to execute immediately on the coprocessor within a time of a timing window in which the one or more tasks execute on the processor; or a close-coupled coprocessor schedule, wherein the workload identified by the at least one launch request is scheduled to be executed on the coprocessor based on a priority order and by any one of: during a subsequent timing window subsequent to the time of the timing window in which the one or more tasks are executing on the processor, or with respect to an external event common to both the processor and the coprocessor.

Embodiment 19 includes the method of embodiment 18, comprising preempting, by one or more workloads queued for execution based on the priority order, at least one workload scheduled for execution or currently executing on the coprocessor.

Embodiment 20 includes the method of embodiment 19 wherein preempting at least one workload scheduled for execution or current execution on the coprocessor comprises: coupling a coprocessor schedule in which one or more workloads currently executing on the coprocessor are configured to complete and queue for subsequent workloads executing preempted by higher priority workloads; coupling a coprocessor schedule wherein one or more workloads currently executing on the coprocessor are configured to be preempted by higher priority workloads; coupling a coprocessor schedule, wherein one or more workloads currently executing on the coprocessor are configured to be preempted by higher priority workloads, wherein the one or more workloads include an indicator identifying a portion of the respective workloads that have been executed, and wherein the one or more workloads are configured to be stored and re-executed starting from the indicator; or coupled with coprocessor scheduling, wherein the one or more workloads scheduled for execution are partitioned into a plurality of sub-portions, and each of the plurality of sub-portions is configured to be preempted by higher priority workloads.

Embodiment 21 includes a processing system comprising: a processor; a coprocessor configured to implement a processing engine; a processing engine scheduler configured to schedule a workload for execution on the coprocessor; wherein the processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks being executed or already executed on the processor and, in response, generate at least one launch request for submission to the coprocessor; wherein the coprocessor comprises a plurality of computing units and at least one command fluidizer associated with one or more of the plurality of computing units; wherein based on the coprocessor allocation policy, the processing engine scheduler is configured to allocate, via the at least one command fluidizer, a cluster of computing units of the coprocessor for a given execution partition to execute one or more workloads identified by the one or more workload launch requests according to workload priorities; wherein the coprocessor allocation policy defines at least: an exclusive allocation policy in which each workload is performed by a dedicated cluster of computing units; a staggered allocation policy in which each workload is exclusively executed across all computing units of the cluster of computing units; a policy-distributed allocation policy in which each workload is individually allocated to at least one of the clusters of computing units and execution duration during the given execution partition; or a shared allocation policy, wherein each workload is not exclusively performed by a cluster of computing units, each executing multiple workloads simultaneously.

Embodiment 22 includes the processing system of embodiment 21, wherein the processor includes a Central Processing Unit (CPU) including at least one processing core, and the coprocessor includes a Graphics Processing Unit (GPU), a processing accelerator, a Field Programmable Gate Array (FPGA), or an Application Specific Integrated Circuit (ASIC).

Embodiment 23 includes the processing system of any of embodiments 21-22, wherein the at least one command fluidizer is configured to receive workloads from a shared background that includes workloads associated with a plurality of tasks being executed or already executed on the processor, and wherein the at least one command fluidizer is configured to assign workloads associated with a first task of the plurality of tasks to a first cluster set and to assign workloads associated with a second task of the plurality of tasks to a second cluster set that is different from the first cluster set.

Embodiment 24 includes the processing system of any of embodiments 21-23, wherein the at least one command fluidizer comprises a plurality of command fluidizers configured to receive a workload from a shared context, wherein the shared context comprises a plurality of queues, wherein a first command fluidizer of the plurality of command fluidizers is configured to: receiving a first workload associated with a first task being executed or already executed on the processor from a first queue of the plurality of queues and assigning the first workload to a first set of clusters of computing units; wherein a second command fluidizer of the plurality of command fluidizers is configured to: receiving a second workload associated with a second task different from the first task being executed or already executed on the processor from a second queue of the plurality of queues different from the first queue; and assigning the second workload to a second set of clusters of computing units different from the first set of clusters of computing units.

Embodiment 25 includes the processing system of any of embodiments 21-24, wherein the at least one command fluidizer comprises a plurality of command fluidizers, wherein a first command fluidizer of the plurality of command fluidizers is configured to: receiving a first workload associated with a first task being or having been executed on the processor from a first queue of a first context and assigning the first workload to a first set of clusters of computing units; wherein a second command fluidizer of the plurality of command fluidizers is configured to: receiving a second workload associated with a second task different from the first task being executed or already executed on the processor from a second queue of a second context different from the first context; and assigning the second workload to a second set of clusters of computing units different from the first set of clusters of computing units.

Embodiment 26 includes the processing system of any of embodiments 21-25, wherein the coprocessor allocation policy includes a preemption policy defining one or more clusters of the clusters that have been allocated to the computing unit or one or more workloads of the one or more clusters that are being allocated to the computing unit are configured to be preempted by one or more workloads queued for allocation for execution on the coprocessor based on the workload priority.

Embodiment 27 includes the treatment system of any one of embodiments 21-26, wherein: the processing engine comprises a compute engine and the processing engine scheduler comprises a compute engine scheduler; the processing engine includes a rendering engine and the processing engine scheduler includes a rendering engine scheduler; or the process includes an Artificial Intelligence (AI) inference engine and the process engine scheduler includes an inference engine scheduler.

Embodiment 28 includes the processing system of any of embodiments 21-27, wherein the processing engine scheduler is configured to allocate one or more of the clusters of computing units to execute the workload based on an amount of processing required to complete the workload.

Embodiment 29 includes the processing system of embodiment 28 wherein the processing engine scheduler is configured to allocate one or more additional ones of the clusters of computing units to execute the workload to compensate for a situation in which the amount of processing required to complete the workload exceeds currently available processing resources on the coprocessor.

Embodiment 30 includes a coprocessor configured to be coupled to a processor and configured to implement a processing engine, the coprocessor comprising: a plurality of computing units each configured to execute a workload; a processing engine scheduler configured to allocate a workload for execution on the coprocessor; wherein the processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks being executed or already executed on the processor and, in response, generate at least one launch request for submission to the coprocessor; wherein the coprocessor includes at least one command fluidizer associated with one or more of the plurality of computing units; wherein based on the coprocessor allocation policy, the processing engine scheduler is configured to allocate, via the at least one command fluidizer, a cluster of computing units of the coprocessor for a given execution partition to execute one or more workloads identified by the one or more workload launch requests according to workload priorities; wherein the coprocessor allocation policy defines at least: an exclusivity policy, wherein each workload is performed by a dedicated cluster of clusters of the computing unit; an interleaving policy in which each workload is exclusively executed across all computing units of at least one of the clusters of computing units; policy-distributed policy in which each workload is individually assigned to at least one of the clusters of computing units and execution duration during the given execution partition; a sharing policy in which each workload is not exclusively performed by a cluster of computing units that each execute multiple workloads simultaneously.

Embodiment 31 includes the coprocessor of embodiment 30 wherein the coprocessor allocation policy defines two or more of the exclusive policy, the interleave policy, the policy-distributed policy, or the shared policy, and wherein the processing engine scheduler is configured to adjust the coprocessor allocation policy from one policy to a second policy at a subsequent timing boundary associated with the processor.

Embodiment 32 includes the coprocessor of any of embodiments 30-31 wherein the processing engine scheduler is configured to determine unused processing resources allocated to completed workloads and allocate subsequent queued workloads for execution on the coprocessor based on the unused processing resources and processing resources allocated to the subsequent queued workloads.

Embodiment 33 comprises the coprocessor of any of embodiments 30-32, wherein the exclusive policy defines at least one of an exclusive access policy and an exclusive slice policy, wherein the exclusive access policy defines an allocation policy, wherein each workload is allocated to all clusters of the clusters of computing units, wherein the exclusive slice policy defines an allocation policy: wherein a workload associated with a first task being executed or executed on the processor is assigned to a first plurality of clusters and wherein a workload associated with a second task being executed or executed on the processor is assigned to a second plurality of clusters; and/or wherein a workload associated with a first task being or being executed on the processor is assigned to a first portion of the cluster, and wherein a workload associated with a second task being or being executed on the processor is assigned to a second portion of the cluster.

Embodiment 34 includes the coprocessor of any of embodiments 30-33 wherein the coprocessor allocation policy includes a preemption policy defining one or more clusters of clusters that have been allocated to the computing unit or one or more workloads of the one or more clusters that are being allocated to the computing unit are configured to be preempted by one or more workloads that are queued for allocation for execution based on the workload priority.

Embodiment 35 includes the coprocessor of any of embodiments 30-34 wherein: the processing engine comprises a compute engine and the processing engine scheduler comprises a compute engine scheduler; the processing engine includes a rendering engine and the processing engine scheduler includes a rendering engine scheduler; or the process includes an Artificial Intelligence (AI) inference engine and the process engine scheduler includes an inference engine scheduler.

Embodiment 36 includes the coprocessor of any of embodiments 30-35 wherein the processing engine scheduler is configured to allocate one or more of the clusters of computing units to execute the workload based on an amount of processing required to complete the workload.

Embodiment 37 includes the coprocessor of embodiment 36 wherein the processing engine scheduler is configured to allocate one or more additional ones of the clusters of computing units to execute the workload to compensate for the situation in which the amount of processing required to complete the workload exceeds the currently available processing resources on the coprocessor.

Embodiment 38 includes a method comprising: receiving one or more workload placement requests from one or more tasks being executed or already executed on the processor, wherein the one or more workload placement requests include one or more workloads configured for execution on the coprocessor; generating at least one launch request in response to the one or more workload launch requests; assigning the cluster of computing units of the coprocessor to execute the one or more workloads identified in the one or more workload placement requests according to workload priorities based on a coprocessor assignment policy, wherein the coprocessor assignment policy defines at least: an exclusivity policy, wherein each workload is performed by a dedicated cluster of clusters of the computing unit; an interleaving policy in which each workload is exclusively executed across all computing units of at least one of the clusters of computing units; policy-distributed policy in which each workload is individually assigned to at least one of the clusters of computing units and execution duration during a given execution partition; a sharing policy in which each workload is not exclusively performed by a cluster of computing units that each execute multiple workloads simultaneously.

Embodiment 39 includes the method of embodiment 38, comprising preempting, by one or more workloads queued for execution based on the workload priority, at least one workload scheduled for execution or currently executing on the coprocessor.

Embodiment 40 includes a method according to any of embodiments 38-39, wherein assigning the cluster of computing units of the coprocessor to execute the one or more workloads identified in the one or more workload launch requests includes assigning one or more additional clusters of the cluster of computing units to execute the one or more workloads in order to compensate for a situation in which the amount of processing required to complete the one or more workloads exceeds currently available processing resources on the coprocessor.

Although specific embodiments have been illustrated and described herein, those of ordinary skill in the art will appreciate that any arrangement which is calculated to achieve the same purpose may be substituted for the specific embodiments shown. It is manifestly therefore intended that this invention be limited only by the claims and the equivalents thereof.

Claims

1. A processing system, comprising:

a processor;

a coprocessor configured to implement a processing engine;

A processing engine scheduler configured to schedule a workload for execution on the coprocessor;

wherein the processing engine scheduler is configured to receive one or more workload launch requests from one or more tasks being executed or already executed on the processor and, in response, generate at least one launch request for submission to the coprocessor;

wherein the coprocessor comprises a plurality of computing units and at least one command fluidizer associated with one or more of the plurality of computing units;

wherein based on a coprocessor allocation policy, the processing engine scheduler is configured to allocate a cluster of computing units of the coprocessor for a given execution partition via the at least one command fluidizer to execute one or more workloads identified by the one or more workload launch requests according to workload priorities;

wherein the coprocessor allocation policy defines at least:

an exclusive allocation policy in which each workload is performed by a dedicated cluster of computing units;

a staggered allocation policy in which each workload is exclusively executed across all computing units of the cluster of computing units;

A policy-distributed allocation policy in which each workload is individually allocated to at least one of the clusters of computing units and execution duration during the given execution partition; or (b)

A shared allocation policy, wherein each workload is not exclusively performed by a cluster of computing units, each executing multiple workloads simultaneously.

2. The processing system of claim 1, wherein the coprocessor allocation policy defines two or more of an exclusive policy, an interleave policy, a policy-distributed policy, or a shared policy, and wherein the processing engine scheduler is configured to adjust the coprocessor allocation policy from one policy to a second policy at a subsequent timing boundary associated with the processor.

3. A method, comprising:

receiving one or more workload placement requests from one or more tasks executing on a processor, wherein the one or more workload placement requests include one or more workloads configured for execution on a coprocessor;

generating at least one launch request in response to the one or more workload launch requests;

Assigning a cluster of computing units of the coprocessor to execute one or more workloads identified in the one or more workload placement requests according to workload priority based on a coprocessor assignment policy,

wherein the coprocessor allocation policy defines at least:

an exclusivity policy, wherein each workload is performed by a dedicated cluster of clusters of the computing units;

an interleaving policy, wherein each workload is exclusively executed across all computing units of at least one of the clusters of computing units;

policy-distributed policy in which each workload is individually assigned to at least one of the clusters of computing units and execution duration during a given execution partition;

a sharing policy, wherein each workload is not exclusively performed by a cluster of computing units, each executing multiple workloads simultaneously.