CN117492973A - Efficient task allocation - Google Patents

Efficient task allocation Download PDF

Info

Publication number
CN117492973A
CN117492973A CN202310956868.8A CN202310956868A CN117492973A CN 117492973 A CN117492973 A CN 117492973A CN 202310956868 A CN202310956868 A CN 202310956868A CN 117492973 A CN117492973 A CN 117492973A
Authority
CN
China
Prior art keywords
task
command
tasks
processor
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310956868.8A
Other languages
Chinese (zh)
Inventor
亚历山大·尤金·查尔芬
约翰·韦克菲尔德·布拉泽斯三世
鲁纳·霍姆
萨穆埃·詹姆斯·爱德华·马丁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ARM Ltd
Original Assignee
ARM Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ARM Ltd filed Critical ARM Ltd
Publication of CN117492973A publication Critical patent/CN117492973A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Advance Control (AREA)
  • Image Processing (AREA)
  • Image Generation (AREA)

Abstract

The invention discloses a method and a processor, wherein the processor comprises a command processing unit, a command processing unit and a control unit, wherein the command processing unit is used for receiving a command sequence to be executed from a main processor; and generating a plurality of tasks based on the command sequence. The processor also includes a plurality of computing units, each computing unit having a first processing module for performing tasks of a first task type, a second processing module for performing tasks of a second task type different from the first task type, and a local cache shared by at least the first processing module and the second processing module. The command processing unit issues a plurality of tasks to at least one of the plurality of computing units, and wherein the at least one of the plurality of computing units is to process the at least one of the plurality of tasks.

Description

Efficient task allocation
Background
Technical Field
The present invention relates to a method, processor and non-transitory computer readable storage medium for handling management of different task types, such as neural network processing operations and graphics processing operations.
Description of related Art
Neural networks can be used in processes such as machine learning, computer vision, and natural language processing operations. The neural network may operate on suitable input data (e.g., image or sound data) to ultimately provide a desired output (e.g., recognition of speech in an object or sound clip in an image, or other useful output inferred from the input data). This process is commonly referred to as "inference" or "classification. In a graphics (image) processing environment, neural network processing may also be used for image enhancement ("denoising"), segmentation, "antialiasing," supersampling, etc., in which case the appropriate input image may be processed to provide the desired output image.
The neural network will typically process input data (e.g., image or sound data) according to the network of operators, each operator performing a particular operation. These operations will typically be performed sequentially to produce the desired output data (e.g., based on classification of image or sound data). Each operation may be referred to as a "layer" of neural network processing. Thus, neural network processing may include processing a sequence of "layers" such that the output from each layer is used as input to the next processing layer.
In some data processing systems, a dedicated Neural Processing Unit (NPU) is provided as a hardware accelerator, the dedicated neural processing unit being operable to perform such machine learning processes as and when required, for example in response to an application executing on a host processor (e.g., a Central Processing Unit (CPU)) requiring such machine learning processes. Similarly, a dedicated Graphics Processing Unit (GPU) may be provided as a hardware accelerator operable to perform graphics processing. These dedicated hardware accelerators may be provided along the same interconnect (bus) side-by-side with other components such that the main processor is operable to request the hardware accelerator to perform a set of operations accordingly. Thus, the NPU and GPU are dedicated hardware units for performing operations such as machine learning processing operations and graphics processing operations upon request of the host processor.
Disclosure of Invention
According to a first aspect, there is provided a processor comprising: a command processing unit to receive a command sequence to be executed from a main processor; and generating a plurality of tasks based on the command sequence; and a plurality of computing units, wherein at least one computing unit of the plurality of computing units comprises: a first processing module for executing tasks of a first task type generated by the command processing unit; a second processing module for executing tasks of a second task type different from the first task type generated by the command processing unit; a local cache shared by at least the first processing module and the second processing module; wherein the command processing unit is configured to issue a plurality of tasks to at least one of the plurality of computing units, and wherein the at least one of the plurality of computing units is configured to process at least one of the plurality of tasks. This enables tasks to be issued to different processing modules sharing the local cache. This increases the efficiency and resource usage of the processor and reduces the component size, as scheduling and job resolution tasks are undertaken by the command processing unit. Further, the command processing unit issues tasks based on computing unit availability such that tasks that need to use the same resources (such as in the case where one task generates output data and the output data is input data for another task) may schedule tasks in a manner that may use a shared local cache. This reduces memory read/write operations to higher level/external memory, thereby reducing throughput and thus processing time.
The command processing unit may issue tasks of a first task type to a first processing module of a given computing unit and issue tasks of a second task type to a second processing module of a plurality of given computing units. This enables different types of tasks to be issued to different processing modules of the computing unit. This increases efficiency because the scheduling and issuing of tasks to individual processing modules is undertaken by the command processing unit rather than by each computing unit and/or host processor.
The first task type is a task for assuming at least a portion of a graphics processing operation forming one of a set of predefined graphics processing operations that collectively enable a graphics processing pipeline, and wherein the second task type is a task for assuming at least a portion of a neural processing operation. The graphics processing operations include at least one of: a graphics compute shader task; vertex shader tasks; a fragment shader task; a tessellation task; geometry shader tasks. This enables the assignment of the tasks of a given command in the command sequence to the most appropriate processing module based on the type of processing operation.
Each computing unit may be a shader core of a graphics processing unit. This enables commands that include tasks that require both graphics processing and neural processing to be undertaken using a single piece of hardware, thereby reducing the number of memory transactions and the hardware size.
The first processing module may be a graphics processing module and the second processing module may be a neural processing module. This enables efficient sharing of tasks within a single command requiring use of both the neural processing unit and the graphics processing unit, thereby improving efficiency and resource usage.
The command processing unit may further include at least one dependency tracker to track dependencies between commands in the sequence of commands; and wherein the command processing unit is to wait for completion of processing of a given task of a first command in the command sequence using the at least one dependency tracker and then issue an associated task of a second command in the command sequence for processing, wherein the associated task depends on the given task. This enables the command processing unit to issue tasks of commands to a given computing unit in a given order based on whether the given computing unit uses the output of a previous command. This increases efficiency by enabling the task of a given command to use/reuse data stored in the local cache of a given computing unit.
The output of a given task may be stored in a local cache. This increases efficiency by enabling the task of a given command to use/reuse data stored in the local cache of a given computing unit.
Each command in the sequence of commands may have metadata, where the metadata may include an indication of at least a plurality of tasks in the command, and a task type associated with each of the tasks. This ensures that the command processing unit can effectively break down commands into tasks and indicate their task type so that it can issue tasks to the desired computing unit and in the most efficient manner.
The command processing unit may assign each command in the command sequence, a command identifier, and the dependency tracker tracks dependencies between commands in the command sequence based on the command identifiers. Furthermore, when a given task of a first command depends on an associated task of a second command, the command processing unit assigns the same task identifier to the given task and the associated task. In addition, the task of each of the commands to which the same task identifier has been assigned may be executed on the same one of the plurality of computing units. This enables commands, tasks, and their dependencies to be tracked efficiently, thereby improving the efficiency assigned to individual computing units.
The task assigned the first task identifier may be performed on a first computing unit of the plurality of computing units and the task assigned the second, different task identifier may be performed on a second computing unit of the plurality of computing units. This enables tasks with different task identifiers to be allocated to different computing units because they are irrelevant and do not need to be performed efficiently using a shared local cache.
Tasks assigned a first task identifier and having a first type may be executed on a first processing module of a given one of the plurality of computing units, and tasks assigned a second, different task identifier and having a second task type may be executed on a second processing module of the given one of the plurality of computing units. This enables tasks with different identifiers and types to be issued to the same computing unit but executed on different processing modules, thereby improving efficiency and ensuring maximization of the use of available resources.
Each computing unit of the plurality of computing units may further include at least one task queue, wherein the queue task includes at least a portion of the command sequence. This enables the computing unit to have a task queue in a given sequence generated by the command processing unit. Thereby enabling scheduling of tasks according to sequence and based on resource availability.
A given queue may be associated with at least one task type. This enables the queues to be formed of multiple task types, allowing the computing units to process different types of tasks, which increases the efficiency of scheduling tasks.
According to a second aspect, there is provided a method of assigning tasks associated with commands in a sequence of commands, the method comprising: receiving, at a command processing unit, a sequence of commands to be executed from a host processor; generating, at the command processing unit, a plurality of tasks based on the received command sequence; assigning an identifier to each of a plurality of tasks in a given command of the sequence of commands based on metadata associated with each command; and issuing, by the command processing unit, each task to a computing unit of the plurality of computing units for execution, each computing unit comprising: the first processing module is used for executing tasks of a first task type; the second processing module is used for executing tasks of a second task type; and a local cache shared by at least the first processing module and the second processing module; wherein a task associated with a first command of the plurality of commands and a task associated with a second command of the plurality of commands are assigned to a given computing unit, the tasks each being assigned the same identifier. This increases the efficiency and resource usage of the processor and reduces the component size, as scheduling and job resolution tasks are undertaken by the command processing unit. Further, the command processing unit issues tasks based on computing unit availability such that tasks that need to use the same resources (such as in the case where one task generates output data and the output data is input data for another task) may schedule tasks in a manner that may use a shared local cache. This reduces memory read/write operations to higher level/external memory, thereby reducing throughput and thus processing time.
When the task associated with the second command depends on the task associated with the first command, the command processing unit may wait for processing of the task associated with the first command to complete and then issue the task associated with the second command to the given computing unit. This enables the command processing unit to issue tasks of commands to a given computing unit in a given order based on whether the given computing unit uses the output of a previous command. This increases efficiency by enabling the task of a given command to use/reuse data stored in the local cache of a given computing unit.
The metadata may include an indication of at least a plurality of tasks in a given command, and a task type associated with each of the plurality of tasks. This ensures that the command processing unit can effectively break down commands into tasks and indicate their task type so that it can issue tasks to the desired computing unit and in the most efficient manner.
According to a third aspect, there is provided a non-transitory computer readable storage medium comprising a set of computer readable instructions stored thereon, which when executed by at least one processor are arranged to allocate tasks associated with commands in a command sequence, wherein the instructions when executed cause the at least one processor to receive the command sequence to be executed from a host processor at a command processing unit; generating, at the command processing unit, a plurality of tasks based on the received command sequence; assigning an identifier to each of a plurality of tasks in a given command of the sequence of commands based on metadata associated with each command; and issuing, by the command processing unit, each task to a computing unit of the plurality of computing units for execution, each computing unit comprising: the first processing module is used for executing tasks of a first task type; the second processing module is used for executing tasks of a second task type; and a local cache shared by at least the first processing module and the second processing module; wherein a task associated with a first command of the plurality of commands and a task associated with a second command of the plurality of commands are assigned to a given computing unit, the tasks each being assigned the same identifier. This increases the efficiency and resource usage of the processor and reduces the component size, as scheduling and job resolution tasks are undertaken by the command processing unit. Further, the command processing unit issues tasks based on computing unit availability such that tasks that need to use the same resources (such as in the case where one task generates output data and the output data is input data for another task) may schedule tasks in a manner that may use a shared local cache. This reduces memory read/write operations to higher level/external memory, thereby reducing throughput and thus processing time.
Drawings
Other features and advantages will become apparent from the following description of preferred embodiments, given by way of example only, which is to be read in connection with the accompanying drawings, wherein like reference numerals are used to designate like features.
FIG. 1 is a schematic diagram of a processor according to an embodiment;
FIG. 2 is a schematic diagram of a command processing unit according to an embodiment;
FIG. 3 is a schematic representation of task allocation to a processing module by a command processing unit according to an embodiment;
FIG. 4 is a flow chart of a method for assigning tasks according to an embodiment; and is also provided with
FIG. 5 is a schematic diagram of a system including features according to an example.
Detailed Description
In some systems, dedicated hardware units, such as a Neural Processing Unit (NPU) and a Graphics Processing Unit (GPU), are provided as distinct hardware accelerators that are operable to perform related processing operations under separate control of a main processor, such as a Central Processing Unit (CPU). For example, the NPU may be operable to perform machine learning processes as and when required, e.g., in response to an application executing on the host processor requiring machine learning processes and issuing instructions to the NPU for execution. For example, the NPU may be provided along the same interconnect line (bus) as other hardware accelerators, such as graphics processors (graphics processing units, GPUs), such that the host processor is operable to request the NPU to perform a set of machine learning processing operations accordingly, e.g., in a manner similar to the host processor being able to request the graphics processor to perform graphics processing operations. Thus, an NPU is a dedicated hardware unit for performing such machine learning processing operations upon request of a main processor (CPU).
It has been recognized that while not necessarily designed or optimized for this purpose, the graphics processor GPU may also be used (or re-utilized) to perform machine learning processing tasks. For example, convolutional neural network processing typically involves a series of Multiply and Accumulate (MAC) operations for multiplying input eigenvalues with the associated eigenvalues of the kernel filter to determine output eigenvalues. The graphics processor shader core may be well suited to performing these types of arithmetic operations because these operations are generally similar to the arithmetic operations that may be required when performing graphics processing work (but on different data). In addition, graphics processors typically support advanced concurrent processing (e.g., support a large number of threads of execution) and are optimized for data plane (rather than control plane) processing, all of which means that graphics processors may be well suited to perform machine learning processing.
Thus, the GPU may be operable to perform machine learning processing work. In this case, the GPU may be used to perform any suitable and desired machine learning processing task. The machine learning process performed by the GPU may thus include general training and inference jobs (which do not involve the graphics processing work itself). However, the GPU may also perform machine-learned (e.g., inferred) jobs of graphics processing operations, such as when performing "supersampling" techniques using deep learning, or when performing denoising, for example, during a ray tracing process.
However, using a graphics processor to perform machine learning processing tasks may be a relatively inefficient use of resources of the graphics processor, as compared to using a dedicated machine learning processing unit (e.g., NPU), for example, because graphics processors are typically not designed (or optimized) for such tasks and, thus, may result in lower performance. At least where the machine learning process involves graphics processing (rendering) tasks, reusing some of the functional units of the graphics processor to perform the desired machine learning processing operations also prevents those functional units from performing the graphics processing work for which they were designed, which can further reduce the overall performance of the overall (rendering) process.
However, in some cases, it may still be desirable to perform machine learning processing tasks using a graphics processor (e.g., rather than using an external machine learning processing unit, such as an NPU). For example, this may be desirable, e.g., to reduce silicon area and reduce data movement, etc., especially in mobile devices where area and resources may be limited, and thus it may be particularly desirable to be able to perform the desired work using existing and available resources, thereby potentially completely avoiding the need for NPUs. There are other examples where this may be desirable, especially where the machine learning process itself involves graphics processing tasks, and where it may be particularly desirable to free up execution units and other functional units of the graphics processor to perform actual graphics processing operations.
Fig. 1 is a schematic diagram 100 of a processor 130 that provides dedicated circuitry that may be used to perform operations that would normally be undertaken by dedicated hardware accelerators, such as NPUs and GPUs. It should be appreciated that the type of hardware accelerator for which processor 130 may provide dedicated circuitry is not limited to the type of NPU or GPU, but may be dedicated circuitry for any type of hardware accelerator. As described above, GPU shader cores may be well suited to performing these particular types of arithmetic operations (such as neural processing operations) because these operations are generally similar to the arithmetic operations that may be required when performing graphics processing work (but on different data). Furthermore, graphics processors typically support advanced concurrent processing (e.g., support a large number of threads of execution) and are optimized for data plane (rather than control plane) processing, all of which means that the graphics processor may be well suited to perform other types of operations.
That is, rather than using a completely independent hardware accelerator, such as a machine learning processing unit (such as an NPU) that is independent of the graphics processor, or being able to perform machine learning processing operations using only the hardware of the GPU, dedicated circuitry may be incorporated into the GPU itself.
This means that the hardware accelerator circuitry incorporated into the GPU is operable to utilize some of the existing resources of the GPU (e.g., so that at least some of the functional units and resources of the GPU may be effectively shared between, for example, different hardware accelerator circuitry), while still allowing improved (more optimized) performance compared to performing all of the processing in a general-purpose execution.
Thus, in one embodiment, processor 130 may be a GPU adapted to include a plurality of dedicated hardware resources (such as those hardware resources that will be described below).
In some examples, this may be particularly beneficial when performing machine learning tasks that themselves involve graphics processing work, as in that case all associated processing may be (and preferably is) performed locally on the graphics processor, thus improving data locality and, for example, reducing the need for external communication along the interconnect with other hardware units (e.g., NPUs). In this case, at least some of the machine learning processing work may be offloaded to the machine learning processing circuitry, freeing the execution units as needed to perform the actual graphics processing operations.
In other words, in some examples, the machine learning processing circuitry is provided within the graphics processor, meaning that the machine learning processing circuitry is then preferably operable to perform at least some machine learning processing operations while other functional units of the graphics processor are concurrently performing graphics processing operations. In the case where the machine learning process involves a portion of the overall graphics processing task, this may thus improve the overall efficiency (in terms of energy efficiency, throughput, etc.) of the overall graphics processing task.
The processor 130 is arranged to receive a command stream 120 from a main processor 110, such as a CPU. As will be described in further detail below with reference to fig. 3, the command stream includes at least one command in a given sequence, each command to be executed, and each command may be decomposed into a plurality of tasks. These tasks may be self-contained operations such as a given machine learning operation or a graphics processing operation. It should be appreciated that other types of tasks may exist depending on the command.
The command stream 120 is sent by the main processor 110 and received by a command processing unit 140, which is arranged to schedule commands within the command stream 120 according to a sequence of commands. The command processing unit 140 is arranged to schedule commands and decompose each command in the command stream 120 into at least one task. Once the command processing unit 140 has scheduled the commands in the command stream 120 and generated a plurality of tasks for the commands, the command processing unit issues each of the plurality of tasks to at least one computing unit 150a, 150b, each of which is configured to process at least one of the plurality of tasks.
The processor 130 includes a plurality of computing units 150a, 150b. As described above, each computing unit 150a, 150b may be a shader core that is specifically configured to handle a variety of different types of operating GPUs, however it should be understood that other types of specifically configured processors may be used, such as general purpose processors configured with individual computing units (such as computing units 150a, 150 b). Each computing unit 150a, 150b comprises a plurality of components, and at least a first processing module 152a, 152b for performing tasks of a first task type, and a second processing module 154a, 154b for performing tasks of a second task type different from the first task type. In some examples, the first processing module 152a, 152b may be a processing module for processing nerve processing operations, such as those typically undertaken by a separate NPU as described above. Similarly, the second processing module 154a, 154b may be a processing module for processing graphics processing operations that form a predefined set of graphics processing operations that enable a graphics processing pipeline. For example, such graphics processing operations include graphics compute shader tasks, vertex shader tasks, fragment shader tasks, tessellation shader tasks, and geometry shader tasks. These graphics processing operations may all form part of a predefined set of operations as defined by the application programming interface API. Examples of such APIs include Vulkan, direct3D and Metal. Such tasks will typically be undertaken by a separate/external GPU. It should be appreciated that any number of other graphics processing operations may be capable of being processed by the second processing module.
Thus, the command processing unit 140 issues tasks of a first task type to the first processing module 152a, 152b of a given computing unit 150a, 150b and issues tasks of a second task type to the second processing module 154a, 154b of the given computing unit 150a, 150 b. Continuing with the above example, the command processing unit 140 will issue machine learning/neural processing tasks to the first processing module 152a, 152b of a given computing unit 150a, 150b, wherein the first processing module 152a, 152b is optimized to process neural network processing tasks, for example, by an efficient means including processing a large number of multiply-accumulate operations. Similarly, the command processing unit 140 will issue graphics processing tasks to the second processing module 154a, 154b of a given computing unit 150a, 150b, wherein the second processing module 152a, 154a is optimized to process such graphics processing tasks.
In addition to including the first processing module 152a, 152b and the second processing module 154a, 154b, each computing unit 150a, 150b also includes memory in the form of local caches 156a, 156b for use by the respective processing module 152a, 152b, 154a, 154b during processing of tasks. An example of such a local cache 156a, 156b is an L1 cache. The local caches 156a, 156b may be, for example, synchronous Dynamic Random Access Memories (SDRAMs). For example, the local caches 156a, 156b may comprise double data rate synchronous dynamic random access memory (DDR-SDRAM). It should be appreciated that the local caches 156a, 156b may include other types of memory.
The local caches 156a, 156b are used to store data related to tasks being processed on a given computing unit 150a, 150b by the first processing module 152a, 152b and the second processing module 154a, 154 b. It may also be accessed by other processing modules (not shown) forming part of the computing units 150a, 150b associated with the local caches 156a, 156 b. However, in some examples, it may be necessary to provide access data associated with a given task executing on a processing module of a given computing unit 150a, 150b to a task executing on a processing module of another computing unit (not shown) of processor 130. In such examples, the processor 130 may also include a cache 160, such as an L2 cache, for providing access to data for processing tasks being performed on the different computing units 150a, 150 b.
By providing local caches 156a, 156b, tasks that have been issued to the same computing unit 150a, 150b can access data stored in the local caches 156a, 156b, whether or not they form part of the same command in the command stream 120. As will be described in further detail below, the command processing unit 140 is responsible for assigning the task of a command to a given computing unit 150a, 150b such that it can most efficiently use available resources, such as local caches 156a, 156b, thereby reducing the number of read/write transactions required by memory external to the computing unit 150a, 150b, such as cache 160 (L2 cache) or higher level memory. One such example is that the task of one command issued to a first processing module 152a of a given computing unit 150a may store its output in a local cache 156a such that it is accessible by a second task of a different (or the same) command issued to a given processing module 152a, 154a of the same computing unit 150 a.
One or more of command processing unit 140, computing units 150a, 150b, and cache 160 may be interconnected using a bus. This allows data to be transferred between the various components. The bus may be or include any suitable interface or bus. For example, one can useAdvanced microcontroller bus architecture->An interface, such as an advanced extensible interface (AXI).
Fig. 2 is a schematic diagram 200 of the command processing unit 140 according to an embodiment. As described above, the command processing unit 140 forms part of a processor (such as the processor 130) and receives the command stream 120 from a host processor (such as the host processor 110). The command processing unit 140 includes a host interface module 142 for receiving the command stream 120 from the host processor 110. The received command stream 120 is then parsed by a command stream parser module 144. As described above, the command stream 120 includes a sequence of commands in a given order. The command stream parser 144 parses and breaks down the command stream 120 into individual commands and breaks down each command in the command stream 120 into individual tasks 210, 220. The dependency tracker 146 then schedules the tasks 210, 220 and issues them to related computing units, such as computing units 150a, 150b of the processor 130. Although example 200 in fig. 2 illustrates a command processing unit that includes a single dependency tracker, it should be appreciated that in some examples, there may be more than one dependency tracker, such as including a dependency tracker for each type of task.
In some examples, the dependency tracker 146 tracks dependencies between commands in the command stream 120 and schedules and issues tasks associated with the commands such that the tasks 210, 220 operate in a desired order. That is, where task 210 depends on task 220, dependency tracker 146 will only issue task 220 once task 210 has completed.
To facilitate decomposing the commands in the command stream 120 into tasks, each command in the command stream 120 may include associated metadata. The metadata may include information such as the number of tasks in a given command and the type of those tasks. In some examples, command stream parser 144 may assign a command identifier to each command in command stream 120. The command identifiers may be used to indicate the order in which commands of the command stream 120 are to be processed, so that the dependency tracker may track dependencies between commands and issue tasks of the commands to the necessary computing units 150a, 150b in the required order. Further, once each command in the command stream 120 has been broken up into multiple tasks (such as tasks 210, 220), the dependency tracker 146 may assign each task a given task identifier.
As shown in fig. 2, task 210 is assigned a task identifier of "0" and task 220 is assigned a task identifier of "1". Since task 210 and task 220 have different task identifiers, command processing unit 140 may issue these tasks to different computing units 150a, 150b simultaneously. More specifically, because each of tasks 210 and 220 have different task types, task 210 has a type "X" and task 220 has a type "Y", they may be issued to different processing modules 152a, 152b, 154a, 154b, whereby the processing modules to which they are issued correspond to the type of task and the particular configuration of processing modules. For example, where a first processing module 152a, 152b of a given computing unit 150a, 150b is configured to process machine learning operations, then where the task 210, 220 is a machine learning task, it may be issued to that processing module 152a, 154a. Similarly, where the second processing module 154a, 154b of a given computing unit 150a, 150b is configured to process graphics processing operations, then where the task 210, 220 is a graphics processing task, the task may be issued to that processing module 154a, 154b.
Alternatively, where tasks 210, 220 are assigned the same task identifier, dependency tracker 146 will issue the task to the same computing unit 150a, 150b. This enables tasks to use the local caches 156a, 156b, thereby improving efficiency and resource usage, as data does not need to be written to external memory, such as the cache 160 or other higher level memory. Even if the task types are different, they may be executed by the corresponding processing modules 152a, 152b, 154a, 154b of the same computing unit 150a, 150b. In still other examples, each computing unit 150a, 150b may include at least one task queue for storing tasks representing at least a portion of commands of a sequence of commands. Each queue may be specific to a task type and thus corresponds to one of the processing modules 152a, 152b, 154a, 154b.
Fig. 3 is a schematic representation 300 of the assignment of tasks 310a, 310b, 320a, 320b to processing modules 150a, 150b by a command processing unit 140 according to the above-described example. The commands 310c, 320c are part of the command stream 120 received at the processor 130 from the host processor 100. Each command 310c, 320c includes two tasks 310a, 310b, 320a, 320b. The command processing unit 140 decomposes the commands 310c, 320c into each of these tasks and schedules them as described above. For example, where the command 320c depends on the command 310a, the tasks 310a, 310b, 320a, 320b will be assigned to the same computing unit, with the output of one task 310a, 310b being the input of the other task 320a, 320b. As shown in fig. 3, tasks 310a, 320a are assigned to computing unit 150a, and tasks 310b, 320b are assigned to computing unit 150b. The task need not be of the same type, for example, the task 310a may be a machine learning operation such that it is assigned to a processing module 152a configured to perform the machine learning operation. Task 320a may be a graphics processing operation such that it is assigned to a processing module 154b configured to perform the graphics processing operation.
Task 320a depends on task 310a, as indicated by the "×" command processing unit 140 issues task 320a to processing module 150a once task 310a has completed. By publishing the task 320a after the task 310a is completed, any data required by the task 320a generated as an output of the task 310a may be stored in the local cache 156a of the processing module. This enables the dependent task 320a to quickly and efficiently access the desired data from the local cache 156a without requiring a request for data from an external memory, such as the cache 160 (L2 cache) or higher level memory.
Tasks 310b and 320b are not dependent on each other and thus may be assigned to the same processing module or different processing modules. In the example 300 of fig. 3, both tasks 310b and 320b are published to the same processing module 150b, however, it should be understood that they may have been published to different processing modules. Furthermore, where tasks 310b and 320b have different task types, they may each be issued to different processing modules 152b, 154b of the same computing unit 150b to run substantially simultaneously. Alternatively, they may be issued to different computing units 150a, 150b to run substantially simultaneously.
Fig. 4 is a flow chart 400 of a method for assigning tasks. At step 410, a sequence of commands (such as the command stream 120 described above) is received from the host processor 110 at the command processing unit 140 of the processor 120. As described above, the command stream 120 includes a plurality of commands, each of which includes a plurality of tasks.
After command stream 120 is received at command processing unit 140, command processing unit 140 generates a plurality of tasks. As described above, the command processing unit 140 may generate a plurality of tasks based on metadata associated with each of the commands of the command stream 120. For example, each command in the command stream 120 may be assigned a command identifier that indicates at least dependencies between commands. Each task generated may also have associated metadata such as a task identifier and a task type.
At step 430 after the plurality of tasks are generated, the tasks are issued by the command processing unit 140 to a computing unit of the plurality of computing units (such as computing units 150a, 150b described above). Tasks may be assigned based on task identifiers and task types and assigned to a given processing module 152a, 152b, 154a, 154b based on task type. For example, as described above, machine learning tasks may be issued to machine learning processing modules and graphics processing tasks may be issued to graphics processing modules of a given computing unit.
As described above, where a given task depends on the completion of another task, the given task may be issued to the same computing unit as the other task. This enables data required by another task and generated by a given task or data required by both tasks to be stored in a local cache, such as local caches 156a, 156 b. This reduces the number of external memory transactions that need to be issued, thereby increasing efficiency and improving resource usage.
Fig. 5 schematically illustrates a system 500 for assigning tasks associated with commands in a sequence of commands.
The system 500 includes a main processor 110, such as a central processing unit or any other type of general purpose processing unit. The host processor 110 issues a command stream comprising a plurality of commands, each having a plurality of tasks associated therewith.
The system 500 also includes at least one other processor 130 configured to efficiently perform different types of tasks as described above. The one or more other processors 130 may be any type of processor specifically configured to include at least a plurality of computing units 150a, 150b and a command processing unit 140 as described above. Each computing unit may include a plurality of processing modules, each configured to perform at least one type of operation. Processor 130 and host processor 110 may be combined as a system on a chip (SoC) or onto multiple socs to form one or more application processors.
The system 500 may also include a memory 520 for storing data generated by tasks external to the processor 130 so that other tasks operating on other processors can easily access the data. However, it should be appreciated that external memory usage will be used less often due to the allocation of tasks as described above, such that tasks that require the use of data generated by other tasks or that require the same data as other tasks will be allocated to the same computing units 150a, 150b of the processor 110 in order to maximize the use of the local caches 156a, 156 b.
In some examples, system 500 can include a memory controller (not shown), which can be a Dynamic Memory Controller (DMC). The memory controller is coupled to a memory 520. The memory controller is configured to manage data flow into and out of the memory. The memory may include a main memory, alternatively referred to as "primary memory". The memory may be external memory because the memory is external to the system 400. For example, memory 460 may include "off-chip" memory. The memory may have a larger storage capacity than the memory cache of the processor 130 and/or the main processor 110. In some examples, memory 520 is included in system 520. For example, memory 5200 can comprise "on-chip" memory. The memory 520 may include, for example, a magnetic disk or optical disk, a magnetic disk drive, or a Solid State Drive (SSD). In some examples, memory 430 includes Synchronous Dynamic Random Access Memory (SDRAM). For example, memory 460 may include double data rate synchronous dynamic random access memory (DDR-SDRAM).
One or more of the main processor 110, the processor 130, and the memory 520 may be interconnected using a system bus 510. This allows data to be transferred between the various components. The system bus 510 may be or include any suitable interface or bus. For example, one can useAdvanced microcontroller bus architecture->An interface, such as an advanced extensible interface (AXI).
Each of the above examples results in a reduced complexity and improved efficiency of the neural network, since the neural network does not need to determine feature data from the image data nor exposure information.
The above embodiments are to be understood as illustrative examples of the invention. Additional embodiments of the present invention are contemplated. It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of features of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims (20)

1. A processor, the processor comprising:
a command processing unit to:
receiving a command sequence to be executed from a main processor; and
generating a plurality of tasks based on the command sequence; and
a plurality of computing units, wherein at least one computing unit of the plurality of computing units comprises:
a first processing module for executing tasks of a first task type generated by the command processing unit;
a second processing module for executing tasks of a second task type different from the first task type generated by the command processing unit;
a local cache shared by at least the first processing module and the second processing module;
wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of computing units, and wherein at least one of the plurality of computing units is to process at least one of the plurality of tasks.
2. The processor of claim 1, wherein the command processing unit is to issue tasks of the first task type to the first processing module of a given computing unit and to issue tasks of the second task type to the second processing modules of a plurality of given computing units.
3. The processor of claim 1 or claim 2, wherein the first task type is a task for assuming at least a portion of a graphics processing operation forming one of a set of predefined graphics processing operations, the predefined graphics processing operations collectively enabling a graphics processing pipeline, and wherein the second task type is a task for assuming at least a portion of a neural processing operation.
4. The processor of claim 3, wherein the graphics processing operation comprises at least one of:
a graphics compute shader task;
vertex shader tasks;
a fragment shader task;
a tessellation task; and
geometry shader tasks.
5. The processor of any preceding claim, wherein each computing unit is a shader core in a graphics processing unit.
6. The processor of any preceding claim, wherein the first processing module is a graphics processing module, and wherein the second processing module is a neural processing module.
7. A processor according to any preceding claim, wherein the command processing unit further comprises at least one dependency tracker to track dependencies between commands in the sequence of commands; and wherein the command processing unit is to wait for completion of processing of a given task of a first command in the sequence of commands using the at least one dependency tracker and then issue an associated task of a second command in the sequence of commands for processing, wherein the associated task depends on the given task.
8. The processor of claim 7, wherein the output of the given task is stored in the local cache.
9. The processor of claim 7 or claim 8, wherein each command in the sequence of commands has metadata, wherein the metadata includes an indication of at least a plurality of the tasks in the command, and a task type associated with each of the tasks.
10. The processor of claim 9, wherein the command processing unit assigns each command in the sequence of commands, a command identifier, and the dependency tracker tracks dependencies between commands in the sequence of commands based on the command identifiers.
11. The processor of claim 10, wherein when the given task of the first command depends on the associated task of the second command, the command processing unit assigns the same task identifier for the given task and the associated task.
12. The processor of claim 11, wherein a task of each of the commands to which the same task identifier has been assigned is performed on a same one of the plurality of computing units.
13. The processor of any one of claims 10 to 12, wherein the task assigned a first task identifier is performed on a first computing unit of the plurality of computing units and the task assigned a second, different task identifier is performed on a second computing unit of the plurality of computing units.
14. The processor of claim 11 or claim 12, wherein tasks assigned a first task identifier and having the first type are executed on the first processing module of a given one of the plurality of computing units and tasks assigned a different second task identifier and having the second task type are executed on the second processing module of the given one of the plurality of computing units.
15. The processor of any preceding claim, wherein each of the plurality of computing units further comprises at least one task queue, wherein the queue task comprises at least a portion of the command sequence.
16. The processor of claim 15, wherein a given queue is associated with at least one task type.
17. A method of assigning tasks associated with commands in a sequence of commands, the method comprising:
Receiving, at a command processing unit, the command sequence to be executed from a host processor;
generating, at the command processing unit, a plurality of tasks based on the received command sequence; and
issuing, by the command processing unit, each task to a computing unit of a plurality of computing units for execution, each computing unit comprising:
the first processing module is used for executing tasks of a first task type;
the second processing module is used for executing tasks of a second task type; and
a local cache shared by at least the first processing module and the second processing module;
wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of computing units, and wherein at least one of the plurality of computing units is to process at least one of the plurality of tasks.
18. The method of claim 17, wherein when the task associated with the second command depends on the task associated with the first command, the command processing unit waits for processing of the task associated with the first command to complete and then issues the task associated with the second command to the given computing unit.
19. The method of claim 17 or claim 18, wherein each command has associated metadata including an indication of at least a plurality of tasks in the given command, and a task type associated with each of the plurality of tasks.
20. A non-transitory computer-readable storage medium comprising a set of computer-readable instructions stored thereon, which when executed by at least one processor are arranged to allocate tasks associated with commands in a sequence of commands, wherein the instructions, when executed, cause the at least one processor to:
receiving, at a command processing unit, the command sequence to be executed from a host processor;
generating, at the command processing unit, a plurality of tasks based on the received command sequence; and
issuing, by the command processing unit, each task to a computing unit of a plurality of computing units for execution, each computing unit comprising:
the first processing module is used for executing tasks of a first task type;
the second processing module is used for executing tasks of a second task type; and
A local cache shared by at least the first processing module and the second processing module;
wherein the command processing unit is to issue the plurality of tasks to at least one of the plurality of computing units, and wherein at least one of the plurality of computing units is to process at least one of the plurality of tasks.
CN202310956868.8A 2022-08-01 2023-08-01 Efficient task allocation Pending CN117492973A (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US202263394053P 2022-08-01 2022-08-01
EP22188053.7 2022-08-01
US63/394,053 2022-08-01
EP22188051.1 2022-08-01
EP22386054.5 2022-08-01
GB2214192.3 2022-09-28

Publications (1)

Publication Number Publication Date
CN117492973A true CN117492973A (en) 2024-02-02

Family

ID=89669664

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202310956868.8A Pending CN117492973A (en) 2022-08-01 2023-08-01 Efficient task allocation
CN202310956459.8A Pending CN117495655A (en) 2022-08-01 2023-08-01 Graphics processor

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202310956459.8A Pending CN117495655A (en) 2022-08-01 2023-08-01 Graphics processor

Country Status (1)

Country Link
CN (2) CN117492973A (en)

Also Published As

Publication number Publication date
CN117495655A (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN106557367B (en) Apparatus, method and device for providing granular quality of service for computing resources
US9804666B2 (en) Warp clustering
US9477526B2 (en) Cache utilization and eviction based on allocated priority tokens
JP2020537784A (en) Machine learning runtime library for neural network acceleration
KR101626378B1 (en) Apparatus and Method for parallel processing in consideration of degree of parallelism
US11609792B2 (en) Maximizing resource utilization of neural network computing system
US9250848B2 (en) Dynamically adjusting the complexity of worker tasks in a multi-threaded application
WO2021000281A1 (en) Instructions for operating accelerator circuit
US20150089202A1 (en) System, method, and computer program product for implementing multi-cycle register file bypass
US10949259B2 (en) System and method of scheduling and computing resource allocation optimization of machine learning flows
WO2023082575A1 (en) Graph execution pipeline parallelism method and apparatus for neural network model computation
CN111708639A (en) Task scheduling system and method, storage medium and electronic device
US11875425B2 (en) Implementing heterogeneous wavefronts on a graphics processing unit (GPU)
US9477480B2 (en) System and processor for implementing interruptible batches of instructions
Kim et al. Las: locality-aware scheduling for GEMM-accelerated convolutions in GPUs
CN117492973A (en) Efficient task allocation
US20240036919A1 (en) Efficient task allocation
CN114035847B (en) Method and apparatus for parallel execution of kernel programs
US20230024130A1 (en) Workload aware virtual processing units
Han et al. GPU-SAM: Leveraging multi-GPU split-and-merge execution for system-wide real-time support
CN112433847B (en) OpenCL kernel submitting method and device
US11847507B1 (en) DMA synchronization using alternating semaphores
CN112130977B (en) Task scheduling method, device, equipment and medium
CN112685168B (en) Resource management method, device and equipment
US10956210B2 (en) Multi-processor system, multi-core processing device, and method of operating the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication