CN117495655A

CN117495655A - Graphics processor

Info

Publication number: CN117495655A
Application number: CN202310956459.8A
Authority: CN
Inventors: 达仁·克罗克斯福德; 莎尔吉欧·赛义德; 伊西多罗斯·希德瑞斯
Original assignee: ARM Ltd
Current assignee: ARM Ltd
Priority date: 2022-08-01
Filing date: 2023-08-01
Publication date: 2024-02-02
Also published as: CN117492973A

Abstract

Disclosed herein is a graphics processor including a programmable execution unit operable to execute a program to perform graphics processing operations. The graphics processor also includes special-purpose machine learning processing circuitry operable to perform processing operations for machine learning processing tasks. The machine learning processing circuit communicates with the programmable execution unit within the graphics processor. In this way, the graphics processor may be configured such that machine learning processing tasks may be performed by the programmable execution unit, the machine learning processing circuitry, or a combination of both, with different units being able to message each other accordingly to control the processing.

Description

Graphics processor

Background

The technology described herein relates to graphics processors, and in particular to using graphics processors to perform machine learning processes, such as neural network processes.

Neural networks can be used in processes such as machine learning, computer vision, and natural language processing operations. The neural network may operate on suitable input data (e.g., such as image or sound data) to ultimately provide a desired output (e.g., recognition of speech in an object or sound clip in an image, or other useful output inferred from the input data). This process is commonly referred to as "inference" or "classification. In the context of graphics (image) processing, neural network processing may also be used for image enhancement ("denoising"), segmentation, "anti-aliasing," supersampling, etc., in which case appropriate input images may be processed to provide the desired output image.

The neural network will typically process input data (e.g., image or sound data) according to the network of operators, each operator performing a particular operation. These operations will typically be performed sequentially to produce the desired output data (e.g., based on classification of image or sound data). Each operation may be referred to as a "layer" of neural network processing.

Thus, neural network processing may include processing a sequence of "layers" such that the output from each layer is used as input to the next processing layer. Fig. 1 shows an exemplary sequence of neural network processing layers from an initial input layer 101 to a final output layer 107, between which are layers comprising various convolutional layers (C-layers) 102, 103, 104 and fully-connected layers (FC-layers) 105, 106.

The input layer 101 may be configured to receive input data (e.g., image or sound data) and provide the input data in a suitable form (e.g., as an array of data elements, or referred to as a "signature") for use by a subsequent neural network layer. The signature will typically comprise a three-dimensional array of data elements, each data element having data associated therewith. The feature map may have a width (W), a height (H), and a depth (C), wherein the width (W) and the height (H) may be defined as the number of data elements in the width direction and the height direction, respectively, and the depth (C) may correspond to the number of data channels. For example, for input data comprising an image, the width and height of the array provided by the input layer may correspond to the values (e.g., pixels) of the data locations along the width and height directions of the image, respectively, while the channels may comprise RGB channels of the image.

After the input layer, there may be one or more other layers of neural network processing (e.g., including a convolutional layer, a fully-connected layer, a pooled layer, a deconvoluted layer, or any other layer of neural network processing that may be present).

Typically, the neural network processing layer will process the Input Feature Map (IFM) in order to generate a corresponding Output Feature Map (OFM) (e.g., in the case of a convolutional layer, a deconvolution layer, or a pooling layer), or an output value (e.g., probability in the case of a fully connected layer). The output generated by the neural network processing layer will be used as input to the next neural network processing layer in the sequence, and so on. This is shown in fig. 2.

As used herein, the term "feature map" may refer to an input feature map or an output feature map.

The operations performed by each neural network processing layer may include any suitable operations that manipulate inputs (feature graphs) to provide outputs (feature graphs). This operation may require process parameters (e.g., weights such as filters or "kernels") that may be specific to a particular layer of neural network processing. Thus, as shown in fig. 2, the appropriate process parameters (e.g., weights and biases) may be read from the working memory (e.g., buffer) in order to execute each neural network processing layer.

Referring to fig. 1, the final neural network processing layers in the sequence may include an output layer 107. The output layer may process the input feature map to generate useful output data (e.g., inferred or categorized results, or output images in the case of image processing).

Although fig. 1 illustrates an example of a particular convolutional neural network, it should be understood that the neural network may have various other layer types and/or network architectures (e.g., recurrent neural network architecture).

Typically, in existing arrangements, data corresponding to the output profile generated by the neural network processing layer may be written to a suitable working memory (e.g., buffer), as shown in fig. 2. The next neural network processing layer may then read the data from the buffer for use as an input profile for the next neural network processing layer.

In some data processing systems, a dedicated Neural Processing Unit (NPU) is provided as a hardware accelerator, the dedicated neural processing unit being operable to perform such machine learning processes as and when required, for example in response to an application program being executed on a host processor (e.g., a Central Processing Unit (CPU)) requiring such machine learning processes. For example, the NPU may be provided along the same interconnect (bus) as other hardware accelerators, such as graphics processors (graphics processing units, GPUs), such that the host processor (CPU) is operable to request the NPU to perform a set of machine learning processing operations accordingly, e.g., in a manner similar to the host processor being able to request the graphics processor to perform graphics processing operations. Thus, an NPU is a specialized hardware unit for performing such machine learning processing operations upon request of a host processor (CPU).

It has been recognized that while not necessarily designed or optimized for this purpose, a Graphics Processor (GPU) may also be used (or re-used) to perform machine learning processing tasks. For example, neural network processing typically involves a series of Multiply and Accumulate (MAC) operations for multiplying input eigenvalues with the associated eigenvalues of the kernel filter to determine output eigenvalues. The graphics processor shader core may be well suited to performing these types of arithmetic operations because these operations are generally similar to the arithmetic operations that may be required when performing graphics processing work (but on different data). Moreover, graphics processors typically support advanced concurrent processing (e.g., support a large number of threads of execution) and are optimized for data plane (rather than control plane) processing, all of which means that the graphics processor may be well suited to performing machine learning processing.

Thus, the graphics processor is operable to perform machine learning processing work. In this case, a Graphics Processor (GPU) may be used to perform any suitable and desired machine learning processing tasks. Thus, the machine learning process performed by a Graphics Processor (GPU) may include general training and inference jobs (which do not involve the graphics processing work itself). However, a Graphics Processor (GPU) may also perform machine learning (e.g., inference) jobs for graphics processing operations, such as when deep learning is used to perform "supersampling" techniques, or when denoising is performed, for example, during a ray tracing process.

Accordingly, applicants believe that there is room for an improved (e.g., more efficient) method for performing a machine learning process using a graphics processor.

Drawings

Embodiments of the technology described herein will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 shows an exemplary sequence of neural network processing layers including an input layer and an output layer, with between these layers a neural network layer including various convolutional layers (C-layer) and fully-connected layers (FC-layer);

FIG. 2 illustrates a series of neural network processing layers, wherein the output profile from a neural network processing layer may be written to an appropriate buffer and then used as an input profile for the next layer in the series, and wherein each neural network processing layer may use processing parameters (e.g., such as weights) read from the appropriate buffer;

FIG. 3 schematically illustrates an exemplary graphics processing system including a graphics processor, in accordance with one embodiment;

FIG. 4 schematically illustrates an embodiment of a graphics processor capable of operating in the manner of the techniques described herein;

FIG. 5 schematically illustrates an embodiment of another graphics processor capable of operating in the manner of the techniques described herein;

FIG. 6 schematically illustrates an example of how the processing of a convolutional neural network may be performed, in accordance with one embodiment; and is also provided with

Fig. 7 and 8 illustrate ray trace denoising operations that may be performed using a graphics processor, according to one embodiment.

Like reference numerals are used for like features in the drawings (where appropriate).

Detailed Description

A first implementation of the technology described herein includes a graphics processor comprising:

a programmable execution unit operable to execute a program to perform graphics processing operations; and

a machine learning processing circuit operable to perform processing operations for machine learning processing tasks and in communication with the programmable execution unit within the graphics processor,

the graphics processor is configured to enable machine learning processing tasks to be performed by the programmable execution unit, the machine learning processing circuit, or a combination of both.

A second implementation of the technology described herein includes a method of operating a graphics processor, the graphics processor comprising:

A machine learning processing circuit operable to perform machine learning processing operations and in communication with the programmable execution unit within the graphics processor,

the graphics processor is configured to enable machine learning processing tasks to be performed by the programmable execution unit, the machine learning processing circuit, or a combination of both;

the method comprises the following steps:

the graphics processor performs machine learning tasks using a combination of the programmable execution unit and the machine learning processing circuit.

The technology described herein relates to a graphics processor (graphics processing unit, GPU) that includes a programmable execution unit operable to execute programs to perform graphics processing operations. In one embodiment, the graphics processor acts as an accelerator, e.g., and in one embodiment, it is under the control of a host processor (e.g., a Central Processing Unit (CPU)). Thus, when an application executing on the host processor requires graphics processing work, the host processor is operable to issue an appropriate request for graphics processing work to be performed by the graphics processor.

However, graphics processors may also be used to perform other more general purpose processing tasks. The technology described herein relates specifically to the case where a graphics processor is operating to perform processing for machine learning processing tasks, such as neural network processing.

In this regard, applicants have recognized that using a graphics processor to perform machine learning processing tasks may be a relatively inefficient use of the resources of the graphics processor, as graphics processors are typically not designed (or optimized) for such tasks and, thus, may result in lower performance, e.g., compared to using a dedicated machine learning processing unit (e.g., NPU). At least in cases where the machine learning process involves graphics processing (rendering) tasks, rearranging some of the functional units of the graphics processor to perform the desired machine learning processing operations also prevents those functional units from performing the graphics processing work for which they were designed, which can also reduce the performance of the overall (rendering) process.

However, in some cases, it may still be desirable to perform machine learning processing tasks using a graphics processor (e.g., rather than using an external machine learning processing unit (NPU)). For example, this may be desirable, e.g., to reduce silicon area and reduce data movement, etc., especially in mobile devices where area and resources may be limited, and thus it may be particularly desirable to be able to perform the desired work using existing and available resources, thereby potentially completely avoiding the need for NPUs. There are other examples where this may be desirable, particularly where the machine learning process itself involves graphics processing tasks, and where it may be particularly desirable to free up execution units and other functional units of the graphics processor to perform actual graphics processing operations.

To facilitate this, the techniques described herein provide dedicated machine learning processing circuitry within the graphics processor, which may thus be used to perform machine learning operations as needed. The machine learning circuitry is provided (logically) within the graphics processor, for example and in one embodiment, concurrent with the execution unit, wherein the machine learning circuitry is operable to communicate with the execution unit within the graphics processor. The machine learning circuitry and execution units may, and thus do, in one embodiment share at least some of the resources of the graphics processor, which may also improve overall efficiency (e.g., throughput, latency, energy efficiency) and/or reduce area, as will be explained further below.

In this way, by providing machine learning processing circuitry within the graphics processor, the machine learning processing circuitry may allow for more efficient (e.g., optimized) operation in performing at least some machine learning processing operations, e.g., while still allowing for machine learning processing to be performed locally at the graphics processor (e.g., rather than using a separate NPU accelerator independent of the graphics processor that may also be operated under control of the host processor), as compared to using execution units of the graphics processor for general purpose computing, which may be beneficial in some circumstances.

That is, rather than using a completely separate machine learning processing unit (such as an NPU) that is independent of the graphics processor, or being able to perform machine learning processing operations using only execution units entirely, the techniques described herein propose to add dedicated machine learning processing circuitry to the graphics processor itself.

This means that the machine learning processing circuitry is operable, for example, to utilize some of the existing resources of the graphics processor (e.g., so that at least some of the functional units and resources of the graphics processor may be effectively shared, for example, between the machine learning processing circuitry and the execution units), while still allowing improved (more optimal) performance than if all of the processing were performed with general execution in the execution units.

Accordingly, in an embodiment, processing work may be divided between execution units and machine learning processing circuitry in order to provide more efficient use of the available processing resources of the graphics processor.

For example, when performing machine learning tasks that themselves involve graphics processing work, methods according to the techniques described herein may be particularly beneficial because in such cases all associated processing may (and in one embodiment is) performed locally to the graphics processor, thereby improving data locality, and reducing the need for external communications along the interconnect with other hardware units (e.g., NPUs), for example. In this case, at least some of the machine learning processing work may be offloaded to the machine learning processing circuitry, freeing the execution units as needed to perform the actual graphics processing operations.

In other words, by providing the machine learning processing circuitry within the graphics processor, it is meant that in one embodiment the machine learning processing circuitry is then operable to perform at least some of the machine learning processing operations while other functional units of the graphics processor concurrently perform the graphics processing operations. In the case where the machine learning process involves a portion of the overall graphics processing task, this may thus increase the overall efficiency (in terms of energy efficiency, throughput, etc.) of the overall graphics processing task.

In this regard, various arrangements are possible, as will be further explained below.

Thus, even when NPUs are also provided, it may still be desirable to be able to use a graphics processor to perform at least some of the machine learning processing, particularly when machine learning involves graphics processing operations.

That is, applicants have recognized that many graphics processing operations themselves involve machine learning processes, and in such cases, executing the required machine learning processes locally at the graphics processor may be particularly beneficial, even where a separate NPU is provided that may otherwise be used to perform the machine learning processing tasks. One example of this is when a deep learning process is used to perform so-called "supersampling" and/or other "antialiasing" techniques. Another example may be a denoising application when performing a ray tracing process. Various other examples are also possible.

Thus, in some embodiments, the machine learning processing operations performed by the graphics processor are part of the overall graphics processing job performed by the graphics processor.

However, the machine learning process work performed by the graphics processor may generally include any suitable and desired machine learning process work, and need not involve the graphics process work itself. In this case, the techniques described herein may still provide various benefits as compared to more conventional approaches, as will be further explained below.

In particular, providing dedicated machine learning processing circuitry within the graphics processor allows a degree of optimization of the machine learning processing operations, while still allowing the beneficial effects of (re) using some of the local resources and regions of the graphics processor when the graphics processor is used to perform the machine learning processing. For example, in an embodiment, by utilizing machine learning processing circuitry, a graphics processor may be used to perform at least some machine learning processing operations in a more efficient manner (e.g., as compared to more conventional graphics processor arrangements in which execution units may (and may only) be fully used to perform such computations), thus reducing the need for a separate NPU (and thus reducing the overall area, although a separate NPU may still be provided if desired).

Accordingly, graphics processors in accordance with the techniques described herein may provide various benefits when compared to more conventional graphics processors when performing machine learning processing operations.

The graphics processor may comprise any suitable and desired graphics processor including programmable execution units (circuitry).

The programmable execution unit may be any suitable and desirable programmable execution unit (circuit) that the graphics processor may contain. The programmable execution unit should be operable to execute a graphics shading program to perform graphics processing operations. Thus, the programmable execution unit will receive the graphics threads to be executed and execute the appropriate graphics shading program for those threads to generate the desired graphics output.

There may be a single or multiple programmable execution units. In one embodiment, there are multiple execution units, which in one embodiment are arranged as respective "shader cores". Thus, a "shader core" typically includes an execution unit along with the corresponding interfaces and one or more other functional units with which the execution unit may communicate, as described below. Where there are multiple programmable execution units (shader cores), in one embodiment, each execution unit may operate in the manner of the techniques described herein.

The graphics processor is operable to perform any desired processing work. This may be the graphics processing job itself, or may include more general processing operations. However, the techniques described herein are particularly directed to cases where the processing effort includes a set of machine learning processing operations (such as for neural network processing).

To facilitate this, the graphics processor of the techniques described herein thus also includes machine learning processing circuitry operable (and dedicated) to perform operations directed to machine learning processing tasks.

Thus, machine learning processing tasks issued to a graphics processor may typically be performed entirely using programmable execution units (e.g., in computational shading), entirely using machine learning processing circuitry, or (in one embodiment) using a combination of both. In this regard, various examples will be possible, as will be further explained below.

The machine learning processing circuitry of the graphics processor may be, and in one embodiment is, a hardware unit (circuitry) of (substantially) fixed function configured to perform processing operations for machine learning processing tasks. The machine learning processing circuitry should therefore include one or more appropriate fixed function circuits to perform the required operations, but may include and have some limited form of configurability in use, for example if required.

In an embodiment, the machine learning processing circuit is configured to perform arithmetic operations, such as and in one embodiment, multiply and Accumulate (MAC) operations. Thus, in one embodiment, the machine learning processing circuit includes one or more MAC circuits configured to perform such operations. Thus, the machine learning processing circuitry may load the input feature maps from respective buffers (generally, 'storage' that may be integrated with the machine learning processing circuitry or may be located elsewhere in the graphics processor (shader core) and accessed by the machine learning processing circuitry, as various arrangements are possible), load the input feature maps together with a set of weights, offsets, etc., perform the required arithmetic (e.g., MAC) operations to generate the corresponding output feature maps, and then write the output feature maps into the appropriate buffers. In this regard, various arrangements will be possible.

Thus, in one embodiment, the machine learning processing circuit may also have access to one or more buffers for storing data that may be required for machine learning processing operations. These buffers may be integrated into the machine learning processing circuitry or may be located within the graphics processor (shader core) but accessible by the machine learning processing circuitry and may be used to store data for machine learning processing operations. For example, machine learning processes typically involve input data in the form of input feature maps, output data in the form of output feature maps, weights to be applied, and any other control information (data structures, programs, etc.) that determines the processing operations to be performed, and thus when performing machine learning tasks, the data needs to be loaded into a graphics processor and at least temporarily stored for use by it.

Thus, in an embodiment, the machine learning processing circuit has an interface to a memory system of a graphics processor in which such data resides. For example, in an embodiment, the graphics processor communicates with an external (e.g., main) memory.

In an embodiment, the graphics processor has one or more external memory access interfaces that are common to all types of data that may need to be transferred between the graphics processor and external memory. That is, in one embodiment, all memory requests (whether graphics processing work or machine learning processing work in) are made via the same shared memory interface, in one embodiment via a shared cache system. For example, in one embodiment, the graphics processor includes a cache (or arrangement of caches) local to the graphics processor, such as one or more level 2 (L2) caches, via which data may be transferred to/from external memory, and which may also (and in one embodiment does) be utilized by the machine learning processing circuitry when fetching machine learning data from external memory. In other words, in one embodiment, a cache system (e.g., one or more L2 caches) is shared between execution units and machine learning processing circuitry.

In one embodiment, the machine learning processing circuit has at least some dedicated local storage (e.g., buffers). This may be used, for example, to store the machine learning algorithm (e.g., neural network) itself.

Feature maps, weights, biases, etc., or portions thereof, may also be stored locally to the machine learning processing circuitry in dedicated corresponding buffers for the data. For example, portions of the weights may be stored locally to the machine learning processing circuitry, or at least to the graphics processor (shader core). The feature map may be more typically streamed from the cache and/or memory system, but in this regard various arrangements will be possible.

However, it should be appreciated that the machine learning process may generate a large amount of data. For example, when processing a neural network, the feature map may generally be a relatively large data structure. Also, when processing different layers, the kernel weights need to be stored/retrieved accordingly.

Thus, in embodiments, the graphics processor is configured to allow other storage devices (buffers) that are already available to the graphics processor and that can be reused to store data for the machine learning process when needed to store machine learning data, rather than adding dedicated storage devices for this purpose.

Indeed, a benefit of the techniques described herein is that graphics processors have typically (and in an embodiment) already had relatively large (e.g., tile) buffers on-chip and a cache system for external memory accesses, which, as described above, have been available to handle large amounts of graphics data and thus may also (and in one embodiment does) be utilized by machine learning processing circuitry to store machine learning data.

Thus, in addition to any dedicated storage devices (buffers) that the machine learning processing circuitry may have, in one embodiment, the machine learning processing circuitry also has access to various other storage devices (buffers) within the graphics processor that may be reused to store data that may be required for machine learning processing operations.

In an embodiment, this includes at least a set of one or more tile buffers that are also used when performing normal tile-based rendering, but reused to store machine learning data. Thus, in an embodiment, a graphics processor is configured to perform tile-based rendering, wherein graphics data is stored in one or more tile buffers, and wherein when performing machine learning processing tasks, the tile buffers are used to store at least some data for the machine learning processing tasks.

In embodiments, the storage available to the machine learning processing circuit may also include, for example, a load/store unit (cache) associated with the execution unit, and/or any other suitable storage (buffer) that may be reused to store machine learning data.

Thus, the machine learning processing circuit may, and in one embodiment does, have a (direct) interface to at least some of these buffers (e.g., to the tile buffers). As described above, in one embodiment, the machine learning processing circuit also has access to an external memory interface of the graphics processor, for example and in one embodiment, via an L2 cache.

Thus, when the graphics processor is performing machine learning processing tasks, the graphics processor may request the required data (e.g., feature maps, weights, etc.) from memory. These data may then be loaded via the cache system and then provided accordingly to the graphics processor for use in performing machine learning processing operations.

For example, data may be transferred from the L2 cache appropriately into various buffers (e.g., tile buffers) available to the machine learning processing circuitry.

For example, it may be appropriate, and thus in embodiments, to store weights in a load/store cache and/or within a tile buffer. Thus, the shader core may request weight data from memory (e.g., via a cache system). The weight data may then be read in via a cache and then transferred to an appropriate (e.g., tile) buffer associated with the shader core under consideration.

Feature maps are typically relatively large data structures. In an embodiment, the feature map currently being used for machine learning processing operations is stored in one or more buffers (e.g., tile buffers) within the graphics processor. For example, an input signature is transferred from an L2 cache into one of the tile buffers of the graphics processor to be ready for processing, and then an output signature resulting from the processing is written into another one of the tile buffers.

Where there are multiple execution units arranged as respective shader cores, each shader core may have its own set (e.g., tile) of buffers. However, in one embodiment, all shader cores share the same cache system. In an embodiment, machine learning processing tasks may be distributed/partitioned among multiple shader cores such that all of the shader cores perform portions of the same processing task. In this case, in one embodiment, machine learning data (feature maps, weights, etc.) is transmitted from the cache to all of the shader cores that need the data, e.g., in a broadcast fashion, simultaneously. This helps reduce memory access bandwidth and may improve data locality.

Thus, when feature maps (and potentially kernel weights/biases) are not currently used, they are maintained in the L2 cache in one embodiment, provided that there is enough space in the L2 cache to do so. The feature maps (and potentially also weights, biases, etc.) may then be transmitted to any shader cores that need them for particular processing operations. Of course, if the signature is not suitable for an L2 cache, it may be written out to external memory and read back when needed, e.g., in the normal manner for cache operation.

Because these are relatively large data structures, in one embodiment, feature maps, weights, etc. are stored in a compressed form in memory and then decompressed for use by the graphics processor.

To this end, the machine learning processing circuitry and/or shader core may have associated compression/decompression circuitry.

However, in an embodiment, compression/decompression of the machine learning data is performed when/while data is transferred to/from the external memory system. For example, a cache system of a graphics processor may include suitable compression/decompression circuitry (which already exists for compressing graphics data), which may thus be used to compress machine learning process data.

Thus, in an embodiment, the graphics processor further includes compression and decompression circuitry for compressing and decompressing data as it is transferred between the graphics processor (shader core) and external memory.

Thus, in one embodiment, the machine learning processing circuitry also has access to these compression and/or decompression circuitry so that the activation layer, weights, etc. may be transferred to/from the memory system in a compressed format. The compression and/or decompression unit may be associated with the machine learning processing circuit itself or, in some embodiments, with the cache system.

For example, when transferring data from a graphics processor (shader core) to a cache system, the data may be compressed/decompressed, e.g., such that the data is stored in a compressed form in the cache. Alternatively, the data may be stored in the cache in uncompressed form and compressed/decompressed as it is transferred from the cache system of the graphics processor to the external memory. Thus, compression/decompression may generally occur at any suitable location between the graphics processor shader core that uses data and external memory. In this regard, various arrangements will be possible.

There may be a compression and decompression unit operable to perform a combination of both compression and decompression, or this may provide separate compression and decompression units. In one embodiment, the compression and decompression circuitry is configured to be capable of compressing all types of data to be transferred from the graphics processor to the memory, including both graphics data and machine learning process data, for example. However, separate, corresponding compression and decompression circuits may also be used for different types of data.

In one embodiment, the machine learning processing circuit further includes a suitable local storage device, such as a queue or cache, for buffering requests/data for the machine learning process. For example, the machine learning processing circuitry may include a translation look-aside buffer to store translations of recently used virtual to physical memory addresses ("VA/PA translations") to speed retrieval of data.

There may be a single or multiple machine learning processing circuits, for example, such that multiple programmable execution units share a given (or single) machine learning processing circuit, and/or such that a given programmable execution unit has access to and may communicate with and use multiple different machine learning processing circuits. Where there are multiple machine learning processing circuits, in one embodiment, each such circuit may operate in the manner of the techniques described herein.

The machine learning processing circuitry may be configured to perform any suitable operations that may be desirable for the machine learning process. For example, in some embodiments, the machine learning processing circuitry may be designed to be capable of performing all of the processing required for a particular machine learning processing task, such as for processing a convolutional neural network. However, in other embodiments, the machine learning processing circuit is configured to perform some, but not all, of the required operations, thus dividing the machine learning processing work between the machine learning processing circuit and the execution unit.

For example, in one embodiment, where the machine learning processing task involves processing a convolutional neural network, the machine learning processing circuitry is configured in one embodiment to perform at least the processing of the convolutional layer. Thus, for a given convolution layer, the machine learning processing circuitry may read in the relevant input feature map along with the relevant kernel weights, offsets, etc. (e.g., using its MAC circuitry) and then write the output feature map out to the appropriate buffer.

In addition to processing convolutional layers as described above, the machine learning processing circuitry may also perform at least some (e.g., relatively simple) pooling operations and/or may activate functions. The machine learning processing circuitry may also perform any other desired operations when processing the neural network, but in some embodiments the processing of the fully connected layer, and in one embodiment any other more complex pooling operations, etc., are passed to the execution unit and executed by executing the appropriate (computational) shader program. Some convolution operations may also be passed to the execution unit as desired, for example, in the case where the convolution operations correspond to non-standard convolution operations. That is, it may be better, e.g., more efficient, to configure the machine learning processing circuitry to perform some, but not all, of the convolutions.

In this regard, various arrangements will be possible, and the benefit of the techniques described herein is that there is flexibility in distributing processing among the various functional units of the graphics processor in this manner.

Thus, in an embodiment, when the graphics processor is performing machine learning processing tasks, at least some (but not all in an embodiment) of the processing operations are offloaded to the machine learning processing circuitry.

Another benefit of providing dedicated machine learning processing circuitry within the graphics processor is that the graphics processor can then be designed to better handle different data formats. For example, the execution units of a graphics processor are typically (and in embodiments) configured to perform floating point and fixed point computations (only), e.g., and in one embodiment, configured to support only some standard floating point or fixed point data formats (e.g., standard 32-bit, 16-bit, 8-bit fixed point or floating point data formats) as is typically desired for graphics processing tasks. In this case, the machine learning processing circuitry may be operable and arranged to perform processing on any (or all) of the floating point, fixed point or integer data formats as required. That is, providing dedicated machine learning processing circuitry within the graphics processor means that the machine learning processing circuitry may then be configured to act on any data format desired for machine learning processing operations, while the execution unit may be configured to perform (only) certain types of floating point and fixed point computations. For example, machine learning processing tasks may use specialized (non-standard) floating point or fixed point data formats, such as 12-bit, 9-bit, etc., that are different from (and thus for which execution units are configured in one embodiment) data formats typically used for graphics processing tasks. Thus, the machine learning processing circuitry may be configured to process different data formats for the execution units, for example, according to machine learning processing operations for which the machine learning processing circuitry is designed to accelerate. This may also facilitate the distribution of operation between the two circuits. Various arrangements will of course be possible in this respect.

The graphics processor in one embodiment also includes a (overall) job controller (interface) operable to schedule processing work for the graphics processor. For example, the job controller may be operable to receive tasks/jobs to be executed by the graphics processor, e.g., via an appropriate command stream provided to the graphics processor by a driver for the graphics processor. The job manager may then, for example, schedule and assign processing of the corresponding task/job to the graphics processor (and the appropriate functional units of the graphics processor).

In one embodiment the (overall) job controller is common to all types of processing work and is therefore able to schedule both graphics processing and machine learning processing work as required (although there may then be additional lower level job controllers that break such work down into subtasks etc. to issue to different functional units and these lower level job controllers may be dedicated to a particular functional unit/type of work).

As described above, in an embodiment, there are multiple execution units (which in one embodiment are arranged as respective shader cores). In an embodiment, each shader core has its own respective machine learning processing circuitry. However, the machine learning processing circuitry may also be provided external to the shader cores, and/or multiple shader cores may share one or more machine learning processing circuitry.

Thus, in one embodiment, the job controller is arranged to schedule and allocate jobs to different execution units (shader cores) accordingly. For example, in embodiments where there are multiple shader cores, the machine learning processing tasks may be distributed among the multiple shader cores.

In this case, multiple shader cores may be arranged to process the same region at the same time. In this case, the input signature may be broadcast from the L2 cache, for example, to each of the plurality of shader cores to perform a respective portion of the processing operations for the region. For example, each shader core may then process a respective subset of cores. This approach works well to increase data locality and/or reduce external memory accesses, as all of the shader cores will typically need data at the same time, and the graphics processor has the ability to allocate work in this way. Moreover, machine learning processes are typically deterministic, enabling a job controller to accurately allocate a corresponding number of shader cores for performing processing work, and schedule work accordingly.

The machine learning processing work may be distributed between the execution units and the machine learning processing units within the respective shader cores in various suitable manners. Various arrangements for controlling the allocation of machine learning processes working between execution units and machine learning processing units are envisaged.

In an embodiment, the job controller is operable to schedule processing work (only) for the execution units. In this case, the operation of the machine learning processing circuit may be controlled (triggered) by the execution unit. Thus, in an embodiment, the job controller schedules one or more processing tasks for the execution units. The thread generator circuit then generates corresponding execution threads for the execution units accordingly. The execution unit may be caused to execute a program and the program may include one or more instructions that cause the machine learning processing circuit to perform a machine learning process. Thus, when an execution unit encounters and executes such instructions, in one embodiment, the execution unit is then caused to message the machine learning processing circuit to cause the machine learning processing circuit to perform a set of one or more machine learning processing operations as needed. The result of the processing may then be returned to the execution unit accordingly.

The message to the machine learning processing circuit may thus include any suitable and required information related to the machine learning processing operation to be performed. For example, the message may include an indication of one or more of: a machine learning processing operation to be performed; inputting the position of the feature map; the location where the signature should be written is output. Any other suitable and desired information related to the machine learning process may also be indicated in the message.

Thus, in one embodiment, the graphics processor is configured such that (and the method involves correspondingly the steps of), when the execution unit is executing a program comprising instructions related to a set of machine learning operations to be performed by the machine learning processing circuit: in response to the execution unit executing the instructions, the programmable execution unit is caused to message the machine learning processing circuit to cause the machine learning processing circuit to perform the set of machine learning processing operations.

In this case, the machine learning processing task is effectively performed under the control of the execution unit, which offloads at least some (but not all in one embodiment) of the machine learning processing operations to the machine learning processing circuitry, and then returns the results of these processing operations to the execution unit. For example, as described above, the execution unit may offload processing of at least the convolutional layer to the machine learning processing circuit. However, more complex pooling and/or full-connection layer processing may still be performed by the execution units as appropriate. In this regard, various arrangements will be possible.

Alternatively, in some embodiments, the execution unit triggers the machine learning processing circuit to perform the machine learning processing task, but the machine learning processing task is then managed by the machine learning processing circuit.

In this case, the machine learning processing circuit itself may perform all of the processing tasks, or may pass some operations back to the execution unit (e.g., by triggering the generation of a thread, as will be explained below). Thus, the execution unit may perform some processing work for the machine learning processing task being performed by the machine learning processing circuit, and return the result of the processing work accordingly from the execution unit to the machine learning processing circuit.

The overall result of the machine learning processing task (i.e., the completed task) may then be returned to the execution unit accordingly, at least in the event that the execution unit triggered the operation.

As described above, in embodiments, the machine learning processing circuitry may be configured to perform some, but not all, of the processing for a given machine learning processing task. In this case, the machine learning processing circuit is operable to cause the execution unit to perform one or more operations (subtasks) as part of the overall machine learning processing task.

For example, in one embodiment, the machine learning processing circuit is operable to trigger generation of a thread of a (sub) program to be executed by the execution unit, which thread, when executed, causes the execution unit to perform a set of one or more processing operations for a machine learning process. In one embodiment, the machine learning processing circuitry is configured to message thread generation circuitry (e.g., compute shader endpoints) of the execution unit to trigger generation of such threads. That is, in one embodiment, the machine learning processing circuit has an interface to a thread generation circuit that is also used to generate other (e.g., computing) threads. However, the machine learning processing circuit may have its own thread generation circuit dedicated to generating machine learning threads.

In this case, the machine learning processing task is effectively managed by the machine learning processing circuit, where the execution unit acts as an accelerator, to which the machine learning processing circuit can offload some of the processing as needed, for example by generating appropriate threads.

In other embodiments, the job controller may be configured to schedule processing work directly for both the execution unit (e.g., in a normal manner) and the machine learning processing circuitry, i.e., so that the job controller may issue work to the machine learning processing circuitry independently of its execution unit. In this case, when the graphics processor is desired to perform a machine learning process, the job controller may schedule one or more tasks to be performed by the machine learning processing circuit accordingly to directly trigger the machine learning processing circuit to perform the machine learning processing task (e.g., the execution unit does not have to trigger the operation).

Various other arrangements are also possible. Thus, when a machine learning processing task is to be performed, the machine learning processing may be divided between the machine learning processing circuitry and the computational shading performed by the execution unit in various suitable ways, wherein internal communication between the two circuitry facilitates the method.

Communication between the machine learning processing circuitry and the like and the programmable execution units may be facilitated as desired. In one embodiment, there is a suitable communication (messaging) network for passing messages between the various units. The communication (messaging) network may operate any desired communication protocol and standard, such as using a suitable interconnect/messaging protocol.

The graphics processor may otherwise have any suitable and desired form or configuration of graphics processor, and include and execute any other suitable and desired processing elements, circuits, units, and stages that the graphics processor may contain, and execute any suitable and desired form of graphics processing pipeline, as required to operate in the manner of the techniques described herein.

For example, in addition to machine learning processing circuitry, there may be other accelerators (specialized units) within the graphics processor that can communicate with the programmable execution units, such as load/store units (circuitry), one or more arithmetic units (circuitry), texture mappers, and the like, if desired. In principle, any of these units may also be utilized by the machine learning processing circuit when performing the machine learning processing task.

The graphics processor may also have any other suitable elements that the graphics processor may have. For example, in some implementations, the graphics processor may be arranged to perform tile-based graphics processing, in which case the graphics processor may include a tile divider circuit, one or more (and in one implementation, a plurality of) tile buffers, and so forth. Graphics processors may also include, for example, graphics processing pipelines including primitive setup circuitry, rasterizers, etc., as well as any other such functional units that a graphics processor typically or desirably has.

The graphics processor may be arranged to perform any desired processing work. However, as noted above, the techniques described herein relate specifically to situations in which a graphics processor is being used to perform machine learning processing. The machine learning process may be any suitable and desirable machine learning process job. For example, in embodiments, neural network processing may be included, e.g., for "inference" or "classification" purposes. As other examples, the machine learning process may include image processing, such as denoising, segmentation, and the like. The machine learning process may also involve training tasks.

Thus, the machine learning process itself may be performed for any purpose. That is, in some implementations, the machine learning process may involve a general machine learning process task (i.e., not involve the graphics process itself).

However, in some embodiments, the machine learning process involves portions of the overall graphics processing task. Examples of machine learning processes involving graphics processing may include deep learning "supersampling" or denoising for ray tracing processes. Other examples are also possible.

In these cases, the image to be processed may be an image that has been previously generated by the graphics processor itself. For example, the image to be subjected to the machine learning process may currently be stored in a suitable buffer (e.g., tile buffer) of the graphics processor. The machine learning processing circuit may then process the images in the (tile) buffer and output the results of the machine learning processing accordingly into another (tile) buffer.

For example, in the case of ray traced denoising processing, the graphics processor is first operable to perform a ray traced rendering (or hybrid ray traced rendering) process to generate an initial output frame, e.g., in the normal manner for a ray traced (or hybrid ray traced) process. That is, the graphics processor may first perform some actual graphics processing (ray traced rendering) work to generate (render) an initial version of the output frame.

As part of the ray traced (or hybrid ray traced) rendering process, it may also be desirable to perform "denoising" of the initial output frame, e.g., in order to provide a better (e.g., smoother) frame for display. For example, ray tracing calculations are relatively complex, such that projecting a large number of rays would require a large amount of processing resources, which may not be practical for real-time rendering. This means that when generating the initial output frame, only a limited number (relatively few) of light rays will be projected and thus the initially generated output frame may be noisy.

To denoise the initial frame, the initial frame may be processed using a suitable neural network, i.e., the neural network has been trained to provide smoother images. In one embodiment, the denoising is offloaded (at least partially offloaded) to a machine learning processing circuit of the techniques described herein. Thus, the current (noisy) frame is loaded into an appropriate buffer for input to the neural network, and then neural network processing is performed accordingly to generate a denoised output frame, which is then stored in another buffer. In an embodiment, one or more other (previous) frames or in one embodiment an accumulation buffer storing one or more previous frames are also provided as input to a denoising algorithm along with information about frame motion (e.g., per-pixel motion vectors) to facilitate denoising (although this is not required).

To facilitate this operation, in one embodiment, the graphics processor (shader core) has multiple tile buffers. Furthermore, in one embodiment, the tile buffer is oversized to allow data (pixels) from adjacent tiles to be fetched and used simultaneously, e.g., because machine learning algorithms will typically require overlap from adjacent tiles. Thus, in one embodiment, adjacent tiles are processed as part of a quad to allow more efficient data access.

In a similar manner, when performing rasterization-based rendering techniques, various supersampling/antialiasing techniques may be performed in an attempt to improve image quality. These may involve a deep learning process. For example, when performing rasterization-based rendering, it may again be desirable to attempt to reduce the amount of processing required, resulting in lower quality images, but then perform additional supersampling/antialiasing techniques to improve the image quality for output.

By performing the machine learning process in the machine learning processing circuit, other functional units in the shader core are then freed to perform the process for which they are optimized. That is, while the machine learning processing circuitry is performing one or more machine learning processing tasks (deep learning supersampling, denoising, etc.) on the image data currently in the tile buffer, the rest of the graphics processor may perform actual graphics processing in parallel to continue graphics processing throughput. For example, in the ray tracing example given above, the graphics processor is free to project additional rays, such as continuing the ray tracing rendering process in parallel with the denoising operation performed for the current frame.

Thus, a particular benefit of the techniques described herein is that when machine learning involves graphics processing jobs, the execution unit, texture mapper, etc. are free to perform the graphics processing for which they are optimized, while machine learning processing is performed in the machine learning processing circuitry. Thus, overall throughput and energy efficiency may be improved. This energy efficiency may be particularly important for mobile devices (such as smartphones or tablets, etc.) that are limited in their battery life and in which a maximum power budget may exist. Thus, in various embodiments, the techniques described herein are employed within a data processing system within a mobile device. However, the techniques described herein may find utility within any suitable data processing system that may include a graphics processor and that may be used to perform machine learning processing.

The techniques described herein may be used for all forms of output that a graphics processor is capable of outputting. Thus, it may be used in generating frames for display, for rendering to output of textures, and so on. In one embodiment, the output from the graphics processor is exported to the outside (e.g., main memory) for storage and use.

In one embodiment, the graphics processor is part of an overall graphics (data) processing system that (e.g., and in one embodiment) includes a host processor (CPU) that, for example, executes applications that need to be processed by the graphics processor. The host processor will send appropriate commands and data to the graphics processor to control the graphics processor to perform graphics processing operations and produce graphics processing output required by the application executing on the host processor. To facilitate this, the host processor should (and in one embodiment is) also execute a driver for the graphics processor and one or more compilers for compiling programs to be executed by the programmable execution units of the graphics processor.

The overall graphics processing system may, for example, include one or more of the following: a host processor (central processing unit (CPU)), a graphics processor (processing unit), a display processor, a video processor (codec), a system bus, and a memory controller.

The data processing system may also include a stand-alone Neural Processing Unit (NPU) that is also operable to perform operations under control of the host processor. For example, the NPU may be connected to the host processor along the same interconnect as the graphics processor, but otherwise independent of the graphics processor. However, NPU is not required and the techniques described herein have the benefit that NPU may be avoided while still using a graphics processor to provide a more efficient machine learning process.

Where the system also includes an NPU, machine learning tasks may be distributed among the host processor (central processing unit (CPU)), the graphics processor (processing unit), and the NPU, if desired.

The graphics processor and/or graphics processing system may also include and/or be in communication with one or more memories and/or memory devices storing data described herein and/or output data generated by the graphics processor and/or storing software (e.g., a (shader) program) for performing the processes described herein. The graphics processor and/or graphics processing system may also be in communication with a display for displaying images based on data generated by the graphics processor. For example, the graphics processor may write its frame buffer out to memory, where the display processor then reads the frame buffer from memory for display. Various arrangements will be possible in this regard.

As will be appreciated from the foregoing, in a graphics processing system that may operate in the manner of the techniques described herein, in at least embodiments of the techniques described herein, a compiler (e.g., a compiler executing on a host processor) will generate and issue to the graphics processor one or more shader programs that, when executed, will perform the required processing operations in accordance with the techniques described herein, wherein the graphics processor (programmable execution unit of the graphics processor) then executes the program to perform the processing, and as part of that program execution, exchange the above-described messages with the machine learning processing circuitry of the graphics processor.

The technology described herein also extends to such overall data processing systems and the operation of such systems.

Another embodiment of the technology described herein includes a data processing system comprising:

a host processor; and

a graphics processor operable to perform operations under control of a host processor, wherein the graphics processor comprises:

Another embodiment of the technology described herein includes a method of operating a data processing system, wherein the data processing system includes:

a host processor; and

The method comprises the following steps:

the host processor requesting the graphics processor to perform a machine learning processing task; and is also provided with

The machine learning processing task is performed by the graphics processor using a combination of the programmable execution unit and the machine learning processing circuit.

As will be appreciated by those of skill in the art, such embodiments of the techniques described herein may (and in one embodiment do) include any one or more or all of the features of the techniques described herein.

The techniques described herein may be implemented in any suitable system, such as a suitably configured microprocessor-based system. In one embodiment, the techniques described herein are implemented in a computer and/or microprocessor based system. In one embodiment, the techniques described herein are implemented in a portable device (such as a mobile phone or tablet in one embodiment).

The various functions of the techniques described herein may be performed in any desired and suitable manner. For example, the functionality of the techniques described herein may be implemented in hardware or software, as desired. Thus, for example, unless indicated otherwise, the various functional elements, stages, units, and "devices" of the techniques described herein may include suitable processor(s), controller(s), functional unit(s), circuit(s), processing logic, microprocessor arrangement, etc., that are operable to perform various functions, etc., such as appropriate special purpose hardware element(s) and/or programmable hardware element(s), that may be programmed to operate in a desired manner.

It should also be noted herein that various functions of the techniques described herein, etc., may be replicated and/or performed in parallel on a given processor, as will be appreciated by those skilled in the art. Also, the various processing stages, etc. may share one or more processing circuits, etc. if desired.

Methods in accordance with the techniques described herein may be implemented, at least in part, using software, such as a computer program. Thus, it can be seen that the techniques described herein provide, when viewed from a further embodiment: computer software, which is particularly adapted to perform the method described herein when installed on a data processor; a computer program element comprising computer software code portions for performing the method described herein when the program element is run on a data processor; and a computer program comprising code adapted to perform all the steps of one or more of the methods described herein when the program is run on a data processing system. The data processor may be a microprocessor system, a programmable FPGA (field programmable gate array), or the like.

The techniques described herein also extend to a computer software carrier comprising such software, which when used to operate a display processor or a microprocessor system comprising a data processor, causes the controller or system to incorporate the data processor to perform the steps of the methods of the techniques described herein. Such a computer software carrier may be a physical storage intermediate such as a ROM chip, CD ROM, RAM, flash memory or magnetic disk, or may be a signal such as an electronic signal over wire, an optical signal, or a radio signal such as sent to a satellite, etc.

It should also be understood that not all steps of the methods of the techniques described herein need be performed by computer software, and thus the techniques described herein provide computer software and such software installed on a computer software carrier for performing at least one of the steps of the methods described herein, relative to another broad embodiment.

Accordingly, the techniques described herein may be suitably embodied as a computer program product for use with a computer system. Such implementations may include a series of computer readable instructions fixed on a tangible non-transitory intermediate, such as a computer readable intermediate, for example, a diskette, CD ROM, RAM, flash memory, or hard disk. The implementations may also include a series of computer readable instructions capable of being transmitted to a computer system via a modem or other interface device through a tangible intermediate (including, but not limited to, optical or analog communications lines) or passively using wireless technology (including, but not limited to, microwave, infrared, or other transmission technologies). The series of computer readable instructions embodies all or part of the functionality previously described herein.

Those skilled in the art will appreciate that such computer readable instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Further, such instructions may be stored using any memory technology (including but not limited to semiconductor, magnetic, or optical technologies) present or future, or transmitted using any communications technology (including but not limited to optical, infrared, or microwave technologies) present or future. It is contemplated that such a computer program product may be distributed as a removable intermediate with accompanying printed or electronic documentation (e.g., shrink wrapped software), may be preloaded with a computer system (e.g., system ROM or fixed disk), or may be distributed from a server or electronic bulletin board over a network (e.g., the internet or world wide web).

Fig. 3 illustrates an exemplary system-on-chip (SoC) graphics processing system 8 in which the techniques described herein may be employed. As shown in fig. 3, the graphics processing system 8 in the present embodiment includes a host processor in the form of a Central Processing Unit (CPU) 1, a Graphics Processor (GPU) 2, a display processor 3, and a memory controller 5.

As shown in fig. 3, these units communicate via the interconnect 4 and have access to off-chip memory 6. In this system, the graphics processor 2 will render frames (images) to be displayed, and the display processor 3 then provides these frames to the display panel 7 for display.

In use of the system, an application 13, such as a game, executing on the host processor (CPU) 1 will for example require that frames be displayed on the display panel 7. For this purpose, the application will submit the appropriate commands and data to the driver 11 for the graphics processor 2 executing on the CPU 1. The driver 11 will then generate the appropriate commands and data to cause the graphics processor 2 to render the appropriate frames for display and store those frames in the appropriate frame buffer, for example in the main memory 6. Then, the display processor 3 reads these frames into a buffer for display, and then reads these frames from the buffer and displays them on the display panel 7 of the display.

Of course, other arrangements are possible. For example, rather than displaying the frames on the local display panel 7, the rendered frames may be sent over a network to a remote device for display.

While its primary purpose within graphics processing system 8 is to perform such graphics processing operations, graphics Processor (GPU) 2 may also be used to perform more general purpose processing operations. That is, it has been recognized that graphics processors may also find utility for various other types of processing that do not necessarily involve the graphics processing itself, but where similar operations are also performed on different data as may be performed during graphics processing.

The present embodiments are particularly directed to the operation of a graphics processor (e.g., in a graphics processing system such as that shown in fig. 3) when the graphics processor is being used to perform a machine learning process, such as a neural network process. Neural network processing typically includes a plurality of processing layers, each of which performs operations on an input signature to generate an output signature, as shown in fig. 1 and 2, and described above. It should be appreciated that while fig. 1 and 2 show examples of specific convolutional neural networks for illustrative purposes, other examples are certainly possible, and that the techniques described herein may be applied to any suitable neural network processing and any suitable neural network architecture (e.g., including any suitable layer arrangement that may be arranged as a convolutional neural network, but may also be a recursive neural network, etc., depending on the machine learning processing task under consideration). Moreover, although shown as a series of separate layers in fig. 1 and 2, the processing of multiple layers may be combined together ("layer fusion").

Thus, the machine learning process performed may generally involve any suitable machine learning process, such as using any suitable neural network.

The neural network processing operations described above may be performed by a shader core of a graphics processor, such as entirely in computational shading. However, this may be inefficient because the Graphics Processor (GPU) 2 is not optimized for this work. Furthermore, this means that the Graphics Processor (GPU) 2 is prevented from performing the actual graphics processing operations for which it is designed.

Thus, in accordance with the techniques described herein, dedicated machine learning processing circuitry is provided within Graphics Processor (GPU) 2, as will be explained further below.

Fig. 4 schematically illustrates the relevant elements and components of a Graphics Processor (GPU) 2 according to an embodiment of the present invention.

As shown in fig. 4, a Graphics Processor (GPU) 2 includes one or more shader (processing) cores 61, 62 along with a shared level 2 cache 64 that is operable to communicate (e.g., via an appropriate interconnect 4 and (dynamic) memory controller 5, as shown in fig. 3) with an off-chip memory system 6. In the configuration shown in fig. 4, a compression unit (compressor) 63 is provided that is operable to compress data as it is written back into the level 2 cache 64 (and conversely decompress data as it is loaded from the level 2 cache 64 for use by the Graphics Processor (GPU) 2). In fig. 4, the compression unit (compressor) 63 is thus a combined compression and decompression unit. However, if desired, there may be separate compression and decompression units. However, other arrangements are also possible. For example, in fig. 4, a compression (decompression) unit 63 is associated with a level 2 cache 64. However, the compression and decompression units may alternatively (or in addition) be provided within the shader (processing) cores 61, 62. As another example, compression/decompression may occur between the level 2 cache 64 and the external memory system 6, e.g., such that data is stored in the cache in uncompressed form, but is compressed as it is written from the cache 64 to the external memory 6.

Fig. 4 schematically shows the relevant configuration of one shader core 61, but any other shader core of the Graphics Processor (GPU) 2 will be configured in a corresponding manner, as will be appreciated by those skilled in the art.

Graphics Processor (GPU) shader cores 61, 62 include programmable processing units (circuits) in the form of execution engines 65 that perform processing operations by running an applet (commonly referred to as a "shader" program) for each "item" in the output to be generated, such as a render target, e.g., a frame. (in this regard, an "item" may be, for example, a vertex, one or more sampling locations, etc.) each "item" will be processed by the shader core(s) through one or more execution threads that will execute instructions of the shader program under consideration for the "item" under consideration. Typically, there will be multiple threads of execution that each execute simultaneously (in parallel).

Other elements of Graphics Processor (GPU) 2 not shown in fig. 4 may be present, as will be appreciated by those skilled in the art. It should also be noted here that fig. 4 is only schematic and that, for example, in practice, even though the functional units shown are schematically shown as separate units in fig. 4, these functional units may share important hardware circuits. It will also be appreciated that, unless otherwise indicated, each of the elements and units, etc. of the graphics processor as shown in fig. 4 may be implemented as desired and will accordingly include, for example, appropriate circuitry (processing logic) or the like for performing the necessary operations and functions.

As shown in fig. 4, each shader core of Graphics Processor (GPU) 2 includes a suitable programmable execution unit (execution engine) 65 operable to execute a graphics shader program for execution threads to perform graphics processing operations.

Shader core 61 also includes an instruction cache 66 that stores instructions to be executed by programmable execution units 65 to perform graphics processing operations.

Shader core 61 also includes an appropriate load/store unit 76 in communication with programmable execution unit 65 that is operable to, for example, load an appropriate cache, data, etc. to be processed by programmable execution unit 65, and write data back to the memory system (via level 2 cache 64) (for data loading and storing of programs executing in the programmable execution unit).

As shown in fig. 4, shader core 61 also includes a texture mapper unit in the form of texture mapping device 74, which is in communication with programmable execution unit 65 and is operable to perform texturing operations. Texture mapping device 74 includes suitable processing circuitry to follow the texturing instructions. In this embodiment, the processing circuitry is in the form of one or more dedicated hardware elements that are suitably configured. In one embodiment, texture mapping device 74 is also operable to fetch data from the memory system (although this is not shown in FIG. 4).

The graphics processor also includes local storage in the form of one or more tile buffers 75. For example, when performing (normal) tile-based graphics processing, the graphics processor may be operable to write data into these tile buffers 75. The tile buffer 75 may also be reused to store machine learning data while the graphics processor is performing machine learning processing tasks.

To perform graphics processing operations, programmable execution unit 65 will execute a graphics shader program (sequence of instructions) for the respective execution thread (e.g., corresponding to the respective sampling location of the frame to be rendered). Thus, as shown in FIG. 4, shader core 61 also includes a fragment thread creator (generator) 72 operable to generate execution threads for execution by programmable execution unit 65 as needed.

A job controller (job control interface) 77 is also provided which receives a request from the host processor (CPU) 1 for processing work to be performed by the Graphics Processor (GPU) 2 and issues corresponding processing tasks towards the shader cores 61, 62 accordingly. The job controller (job control interface) 77 is generally capable of scheduling any desired processing work for the Graphics Processor (GPU) 2, including normal graphics processing work as well as computing and machine learning processing work.

To facilitate performing machine learning processing work using Graphics Processor (GPU) 2, the shader cores of Graphics Processor (GPU) 2 are each provided with a respective machine learning processing circuit (neural processing accelerator, "NPA") 78 operable to communicate with an execution engine internal to the graphics processor. In this way, processing work can be distributed among the functional units as needed. In this regard, various options are contemplated, and generally, work may be distributed between the machine learning processing circuit (NPA) 78 and the execution engine 65 in various suitable manners.

For example, machine learning processing work may be initially triggered by the job controller (job control interface) 77 issuing appropriate processing tasks to the Graphics Processor (GPU) 2. The execution engine 65 may then execute an appropriate program to perform the processing tasks, the program including one or more instructions related to machine learning processing operations to be performed by the machine learning processing circuitry (NPA) 78.

When the execution engine 65 encounters and executes such instructions, the execution engine 65 may then appropriately message the machine learning processing circuit (NPA) 78 to cause the machine learning processing circuit (NPA) 78 to perform the desired processing operations.

As shown in fig. 4, a machine learning processing circuit (NPA) 78 has an interface to the tile buffer 75 and also to the shader core interconnect and thus to the level 2 cache 64. The machine learning processing circuit (NPA) 78 is thus operable to take machine learning data from memory via the level 2 cache 64 using the resources of the graphics processor when performing the machine learning process and temporarily store it in, for example, the tile buffer 75 and/or the level 2 cache 64.

In the example shown in fig. 4, the machine learning processing circuit (NPA) 78 is unable to perform all of the machine learning processing work required for the current machine learning processing task. The machine learning processing circuit (NPA) 78 is thus capable of messaging the Compute Shader Endpoint (CSE) 73 of the shader core to generate threads for the execution engine 65 to perform work.

In this example, the machine learning process may thus be triggered by the execution engine 65, but then managed by machine learning processing circuitry (NPA) 78, where the machine learning processing circuitry (NPA) 78 causes the execution engine 65 to perform some of the processing tasks as needed. However, other arrangements are also possible.

For example, fig. 5 shows another example in which the machine learning processing circuit (NPA) 78 is capable of performing more (e.g., all) machine learning processing. Thus, the machine learning processing circuit (NPA) 78 in this example may not need to create threads to the Compute Shader Endpoint (CSE) 73. The machine learning processing circuit (NPA) 78 may still be requested by the execution engine 65 to perform work, as described above, or may be requested directly by the job controller (job control interface) 77 to perform work, as shown in fig. 5.

In other arrangements, the processing may be performed under control of the execution engine 65, where the job controller 77 requests work to be performed by the execution engine 65, and the execution engine 65 is able to message machine learning processing circuitry (NPA) 78 to perform the processing work, where the results are then written out and/or otherwise returned for use by the execution engine 65 accordingly.

Thus, in this embodiment, the machine learning processing circuit (NPA) 78 is operable to communicate with the execution engine 65 internal to the graphics processor in order to distribute processing work between the machine learning processing circuit (NPA) 78 and the execution engine 65 as needed.

In this regard, various options will be possible, and in general, the graphics processor of the techniques described herein may be operated in any of the above-described ways, or according to some combination of these methods, depending on the processing task under consideration.

For example, the machine learning processing tasks performed by the graphics processor may generally include any suitable and desired machine processing tasks. In embodiments, this involves the processing of convolutional neural networks, as shown in fig. 1 and 2.

Fig. 6 schematically illustrates one method for partitioning the processing of a convolutional neural network between a machine learning processing circuit (NPA) 78 and an execution engine 65.

In fig. 6, the processing of the convolution layers is performed by a machine learning processing circuit (NPA) 78, and the machine learning processing circuit (NPA) 78 is configured and optimized accordingly for performing such convolutions. However, in this example, the pooling operation and any full connection layer processing is still performed by the execution engine 65.

Of course, other examples are possible. For example, the machine learning processing circuit (NPA) 78 may also be configured to perform at least some of the pooling operations, where these pooling operations are offloaded to the execution engine 65 only in particularly complex cases. Also, the machine learning processing circuitry (NPA) 78 may be configured for only some types of convolutions (e.g., 3 x c convolutions) while others (e.g., more complex convolutions (e.g., non-3 x c convolutions) are passed to the execution engine 65. Alternatively, only portions of the convolution operations may be performed by the machine learning processing circuitry (NPA) 78 while other portions of the convolution operations are performed by the execution engine 65. For example, the machine learning processing circuitry (NPA) 78 may be used to perform MAC operations while the bias and activate functions are performed by the execution unit 65. Various examples will be possible in this regard.

Thus, in general, a given machine learning task may be performed entirely using the machine learning processing circuitry (NPA) 78, entirely using the execution engine 65, or some combination of both.

The machine learning task may be any suitable and desired machine learning task. For example, a task may involve general training or inference jobs. However, the machine learning process work itself may involve graphics processing operations. An example of this is ray trace denoising, as schematically shown in fig. 7 and 8.

Ray tracing is a known rendering process that involves tracing paths of rays from a point of view (sometimes referred to as a "camera") into a scene through sampling locations in an image plane, and simulating the interaction between the rays and objects in the scene. Output data values (e.g., sample points in an image) are determined based on objects in the scene that intersect the ray passing through the sample locations and the characteristics of the surfaces of those objects. Ray tracing calculations are complex and involve determining, for each sampling location, a set of objects within the scene that intersect the ray passing through the sampling location.

Thus, after ray tracing is performed using the first set of rays, the initial output frame may be relatively noisy. The neural network may thus be trained to convert noisy images into smoother frames, for example for output. This process is shown in fig. 7. In fig. 7, denoising is performed by (only) analyzing the current frame. However, as shown in fig. 8, denoising analyzing the current frame as well as the previous (noisy or denoised) frame may be performed.

Thus, as shown in FIG. 7, when performing the ray traced rendering process, the graphics processor is operable to generate an initial (noisy) output frame. This processing will typically be performed by or at least managed by the execution engine 65 in the normal manner for the ray tracing process. This will involve projecting a certain amount of light to generate an initial output frame (step 80). Because the ray tracing process may be computationally expensive, only a relatively small amount of rays can be projected within the desired frame rate. Thus, this may result in a noisy image. Thus, it may be desirable to perform "denoising" in an attempt to generate a better frame for output.

Once the initial (noisy) output frame is generated (step 82), then the execution engine 65 may message the machine learning processing circuit (NPA) 78 to perform the desired denoising operation (where the machine learning processing circuit (NPA) 78 itself performs denoising entirely or passes some of this work back to the execution engine 65, as described above) (step 84). Thus, the denoising process generates a final, smoother frame for output (step 86). The final frame may then be written out, e.g. to a frame buffer, ready for output, e.g. in the normal way.

The process in fig. 8 is similar, except that the denoising algorithm (step 84) additionally takes as input information about one or more previous frames. For example, a plurality of previous (denoised) frames may be accumulated in a suitable accumulation buffer and then used along with corresponding per-pixel motion vectors indicating relative movement between the frames as part of a denoising process to generate a final frame (step 86).

By offloading the denoising process at least partially to the machine learning processing circuit (NPA) 78, this means that the execution engine 65 is then free to continue the ray tracing process, for example by projecting additional rays or the like. That is, the machine learning processing circuit (NPA) 78 may perform the denoising process concurrently with other functional units performing graphics processing operations. Thus, this may provide a particularly efficient method for performing such machine learning processes within a graphics processing operation.

From the foregoing, it can be seen that the techniques described herein, in their embodiments, can provide at least a more efficient process for performing machine learning processing using a graphics processor. At least in embodiments of the technology described herein, this is accomplished by performing at least some processing operations for the machine learning processing task to be performed using dedicated machine learning processing circuitry within the graphics processor, but other processing for the task may also be performed in one embodiment by executing one or more appropriate shader programs using programmable execution units of the graphics processor.

The foregoing detailed description has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the technology to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. The described embodiments were chosen in order to best explain the principles of the technology and its practical application, to thereby enable others skilled in the art to best utilize the technology in various embodiments and with various modifications as are suited to the particular use contemplated. The scope of the invention is intended to be defined by the appended claims.

Claims

1. A graphics processor, the graphics processor comprising:

2. The graphics processor of claim 1, configured such that, when the execution unit is executing a program comprising instructions related to a set of machine learning operations to be performed by the machine learning processing circuit: in response to the execution unit executing the instructions, the programmable execution unit is caused to send a message to the machine learning processing circuit to cause the machine learning processing circuit to perform the set of machine learning processing operations.

3. The graphics processor of claim 1 or 2, wherein when executing a machine learning processing task, the machine learning processing circuit is operable to cause the execution unit to perform one or more processing operations for the machine learning processing task being executed by the machine learning processing circuit.

4. The graphics processor of claim 3, wherein the machine learning processing circuit is operable to trigger generation of a thread for execution by the programmable execution unit to cause the execution unit to perform the one or more processing operations for the machine learning processing task being performed by the machine learning processing circuit.

5. The graphics processor of claim 2, wherein the machine learning processing circuit is configured to return results of its processing to the execution unit for further processing.

6. The graphics processor of claim 1 or 2, wherein the machine learning processing circuitry comprises one or more multiply and accumulate circuitry.

7. The graphics processor of claim 1 or 2, wherein the graphics processor includes a cache system for transferring data to and from an external memory, and wherein the machine learning processing circuit has access to the cache system of the graphics processor.

8. The graphics processor of claim 7, wherein when a machine learning processing task is to be performed using the graphics processor, the graphics processor is operable to fetch input data required for the machine learning processing task via the cache system and to write output of the machine learning processing task to memory via the cache system.

9. The graphics processor of claim 7, further comprising compression and decompression circuitry to compress and decompress data as it is transferred between the graphics processor and the external memory.

10. The graphics processor of claim 1 or 2, comprising a plurality of programmable execution units arranged as respective shader cores, wherein each shader core has its own respective machine learning processing circuitry, and wherein an overall job controller of the graphics processor is operable to allocate processing tasks among different shader cores.

11. The graphics processor of claim 1 or 2, wherein the graphics processor is configured to perform tile-based rendering, wherein graphics data is stored in one or more tile buffers, and wherein when performing machine learning processing tasks, the tile buffers are used to store at least some data for the machine learning processing tasks.

12. A method of operating a graphics processor, the graphics processor comprising:

the method comprises the following steps:

13. The method of claim 12, the method further comprising: when the execution unit is executing a program comprising instructions related to a set of machine learning operations to be performed by the machine learning processing circuit: in response to the execution unit executing the instructions, the programmable execution unit sends a message to the machine learning processing circuit to cause the machine learning processing circuit to perform the set of machine learning processing operations.

14. The method of claim 12 or 13, the method further comprising: when executing a machine learning processing task, the machine learning processing circuit causes the execution unit to perform one or more processing operations for the machine learning processing task being executed by the machine learning processing circuit.

15. The method of claim 14, wherein the machine learning processing circuit causes the execution unit to perform one or more processing operations by triggering generation of an execution thread that, when executed by the execution unit, causes the execution unit to perform the one or more processing operations for the machine learning processing task being performed by the machine learning processing circuit.

16. A method according to claim 12 or 13, comprising the machine learning processing circuit returning the results of its processing to the execution unit for further processing.

17. The method of claim 12 or 13, wherein the machine learning processing task comprises one or more multiply and accumulate operations.

18. The method of claim 12 or 13, wherein the graphics processor includes a cache system for transferring data to and from an external memory, and wherein the machine learning processing circuit has access to the cache system of the graphics processor.

19. The method of claim 18, wherein when a machine learning processing task is to be performed using the graphics processor, the graphics processor fetches input data required for the machine learning processing task via the cache system and writes output of the machine learning processing task to memory via the cache system.

20. The method of claim 18, further comprising compressing data as it is written to memory and/or decompressing data as it is retrieved from memory.

21. A method according to claim 12 or 13, the method comprising a plurality of programmable execution units arranged as respective shader cores, wherein each shader core has its own respective machine learning processing circuitry, and wherein an overall job controller of the graphics processor is operable to allocate processing tasks among different shader cores.

22. The method of claim 12 or 13, wherein the graphics processor is configured to perform tile-based rendering, wherein graphics data is stored in one or more tile buffers, and wherein the method comprises: the tile buffer is used to store at least some data for a machine learning processing task when the machine learning processing task is performed.