CN117971349A

CN117971349A - Computing device, method of configuring virtual registers for a computing device, control device, computer-readable storage medium, and computer program product

Info

Publication number: CN117971349A
Application number: CN202410382891.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-03-29
Filing date: 2024-03-29
Publication date: 2024-05-03

Abstract

The present disclosure provides a computing device, a method of configuring virtual registers for a computing device, a control device, a computer-readable storage medium, and a computer program product. The computing device includes: a plurality of computing units, each configured to run one thread of a thread bundle; a plurality of thread local registers dedicated to each compute unit for registering data associated with threads operated by the compute unit; and a shared buffer for the plurality of computing units, wherein a portion of the shared buffer is configured as a virtual register for the plurality of computing units, and at least one of the plurality of thread local registers of each computing unit is configured as a staging register for threads operated by the computing unit to access the virtual register.

Description

Computing device, method of configuring virtual registers for a computing device, control device, computer-readable storage medium, and computer program product

Technical Field

The present disclosure relates generally to the field of processors, and more particularly, to a computing device, a method of configuring virtual registers for a computing device, a control device, a computer-readable storage medium, and a computer program product.

Background

Currently, with the development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, AI chips are often required to process large amounts of data quickly, so it is desirable that the data can be as close as possible to the computational unit that accesses the data. Furthermore, the lower overhead of accessing data in registers may improve operator performance compared to caches and memory, and thus operator development typically prioritizes the use of registers to store data. However, the number of local registers fixedly configured for a computing unit is often limited, and when the number of registers needed to be used by an operator exceeds the number of configured hardware registers, a compiler stores part of data in a memory or a cache in order to ensure that the operator can normally execute, and the access cost to the memory or the cache is large, which will cause huge loss of performance of the operator.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a scheme of configuring virtual registers for a plurality of computing units or each computing unit to speed up data access of the computing units in a shared cache for data sharing among the plurality of computing units.

According to one aspect of the present disclosure, a computing device is provided. The computing device includes: a plurality of computing units, each configured to run one thread of a thread bundle; a plurality of thread local registers dedicated to each compute unit for registering data associated with threads operated by the compute unit; and a shared buffer for the plurality of computing units, wherein a portion of the shared buffer is configured as a virtual register for the plurality of computing units, and at least one of the plurality of thread local registers of each computing unit is configured as a staging register for threads operated by the computing unit to access the virtual register.

In some implementations, the virtual registers include one or more thread virtual registers that are respectively dedicated to threads of each computing unit, and each computing unit is configured to access the one or more thread virtual registers through a staging register of the computing unit when running the thread.

In some implementations, the computing unit is configured to write data to the staging register upon determining that the thread is to write data to the one or more thread virtual registers, and the staging register is configured to write the data to one or more thread virtual registers of the computing unit.

In some implementations, the computing unit is configured to send a read request to the staging register upon determining that the thread is to read data from the one or more thread virtual registers, and the staging register is configured to read the data from the one or more thread virtual registers of the computing unit for reading by a thread run by the computing unit in response to the read request.

In some implementations, the virtual registers include one or more thread bundle virtual registers that are shared by the thread bundles of the plurality of computing units, and each computing unit is configured to access the one or more thread bundle virtual registers through the transit registers of the computing unit when running the thread bundle.

In some implementations, upon determining that the thread bundle is to write data to the one or more thread bundle virtual registers, one of the plurality of computing units is configured to write the data to a staging register of the computing unit, and the staging register is configured to write the data to the one or more thread bundle virtual registers.

In some implementations, upon determining that the thread bundle is to read data from the thread bundle virtual register, each computing unit of the plurality of computing units sends a read request to a respective staging register, and each staging register is configured to read the data from the thread bundle virtual register for reading by a thread operated by the corresponding computing unit in response to the read request.

In some implementations, upon determining that the thread bundle is to read data from the thread bundle virtual register, each of the plurality of computing units sends a read request to a transit register of the computing unit, and the transit register of one of the plurality of computing units is configured to read the data from the thread bundle virtual register in response to the read request and broadcast to the plurality of computing units.

According to another aspect of the present disclosure, there is provided a method of configuring virtual registers for a computing device, wherein the computing device comprises: a plurality of computing units, each configured to run one thread of a thread bundle; a plurality of thread local registers dedicated to each compute unit for registering data associated with threads operated by the compute unit; and a shared buffer for the plurality of computing units. The method comprises the following steps: configuring a portion of the shared buffer as a virtual register for the plurality of computing units; and configuring at least one of the plurality of thread local registers of each compute unit as a staging register for threads operated by the compute unit to access the virtual register.

In some implementations, configuring a portion of the shared buffer as a virtual register for the plurality of computing units includes: one or more thread virtual registers are configured that are respectively dedicated to threads of each compute unit, and the method further comprises: each compute unit is configured to access the one or more thread virtual registers through the staging registers of the compute unit while running the thread.

In some implementations, configuring each computing unit to access the one or more thread virtual registers through the staging register of the computing unit while running the thread includes: the computing unit is configured to write data to the staging register when the thread is determined to be writing the data to the one or more thread virtual registers, and the staging register is configured to write the data to one or more thread virtual registers of the computing unit.

In some implementations, configuring each computing unit to access the one or more thread virtual registers through the staging register of the computing unit while running the thread includes: the computing unit is configured to send a read request to the staging register upon determining that the thread is to read data from the one or more thread virtual registers, and the staging register is configured to read the data from the one or more thread virtual registers of the computing unit for reading by a thread run by the computing unit in response to the read request.

In some implementations, configuring a portion of the shared buffer as a virtual register for the plurality of computing units includes: one or more thread bundle virtual registers shared by thread bundles of the plurality of compute units are configured, and the method further comprises: each compute unit is configured to access the one or more thread bundle virtual registers through the transit registers of the compute unit while running the thread bundle.

In some implementations, configuring each computing unit to access the one or more thread bundle virtual registers through the staging register of the computing unit while running the thread bundle includes: configuring one of the plurality of compute units to write data to the one or more thread bundle virtual registers upon determining that the thread bundle is to write the data to the transit registers of the compute units, and configuring the transit registers to write the data to the one or more thread bundle virtual registers.

In some implementations, configuring each computing unit to access the one or more thread bundle virtual registers through the staging register of the computing unit while running the thread bundle includes: each of the plurality of compute units is configured to send a read request to a respective staging register upon determining that the thread bundle is to read data from the thread bundle virtual register, and each staging register is configured to read the data from the thread bundle virtual register for reading by a thread operated by the corresponding compute unit in response to the read request.

In some implementations, configuring each computing unit to access the one or more thread bundle virtual registers through the staging register of the computing unit while running the thread bundle includes: each of the plurality of compute units is configured to send a read request to a staging register of the compute unit upon determining that the thread bundle is to read data from the thread bundle virtual register, and the staging register of one of the plurality of compute units is configured to read the data from the thread bundle virtual register in response to the read request and broadcast to the plurality of compute units.

According to still another aspect of the present disclosure, there is provided a control apparatus including: at least one processor; and at least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, which when executed by the at least one processor, cause the control device to perform the steps of the method as described above.

According to yet another aspect of the present disclosure, a computer readable storage medium is provided, having stored thereon computer program code which, when executed, performs the method as described above.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a machine, performs the method as described above.

Drawings

The disclosure will be better understood and other objects, details, features and advantages of the disclosure will become more apparent by reference to the description of specific embodiments thereof given in the following drawings.

FIG. 1 illustrates a schematic diagram of a computing device.

FIG. 2 illustrates a schematic diagram of a computing device, according to an embodiment of the invention.

FIG. 3A illustrates a schematic diagram of a computing device for a thread operation according to an embodiment of the invention.

FIG. 3B illustrates a schematic diagram of a computing device for another thread operation according to an embodiment of the invention.

FIG. 4 sets forth an exemplary flow chart illustrating a method for configuring virtual registers for a computing device according to embodiments of the present invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.

Fig. 1 shows a schematic diagram of a computing device 100. As shown in fig. 1, computing device 100 may include multiple computing units 110 and multiple thread local registers 120 dedicated to each computing unit 110. 8 computing units 110-1, 110-2 … …, 110-8 are illustratively shown in FIG. 1, and 4 thread local registers 120-1, 120-2 … … 120-4 are illustratively shown for each computing unit 110. Note that the number of compute units 110 and thread local registers 120 shown in fig. 1 is merely exemplary and is not intended to limit the scope of the invention. Furthermore, the thread local registers 120 of the different computing units 110 have been given the same reference numerals for ease of description, but those skilled in the art will appreciate that they are in fact different physical registers. For example, thread local register 120-1 of compute unit 110-1 and thread local register 120-1 of compute unit 110-2 are different physical registers.

While computing device 100 is running a thread bundle (warp), each computing unit 110 may run one thread in the thread bundle. The plurality of thread local registers 120 of each compute unit 110 are used to register data associated with a thread running the compute unit 110, such as data required by the thread running or data generated by the thread running. A plurality of thread local registers 120 are located near the computing unit 110, e.g., on the same chip module as the computing unit 110, and are directly accessible by threads running on the computing unit 110.

In one example, computing device 100 may be an AI chip, a GPU (Graphic Processing Unit, a graphics processing unit), a General Purpose computing unit CU (Computing Unit) of a General Purpose GPU (GPGPU), and computing unit 110 may be a computing execution unit EU (Executing Unit) or the like included in a General Purpose computing unit CU, each of which, when running one thread bundle, each of which, EU, runs one thread of the thread bundle according to a task schedule.

In addition, the computing device 100 may also include a shared buffer 130 for the plurality of computing units 110, the shared buffer 130 being on-chip memory of the computing device 100 for exchanging data between threads of the plurality of computing units 110 of the computing device 100.

Currently, the thread of each compute unit 110 is running and can only access the thread local registers 120 of that compute unit 110. When the number of registers that a thread needs to use exceeds the number of thread local registers 120 of the compute unit 110, a register overflow will result.

One solution to this is to save the overflowed data in off-chip memory of the computing device 100, such as thread local memory (Thread Local Memory, TLM), to ensure that the thread executes properly. However, in this implementation, the compiler has a large access overhead to the off-chip memory, which will cause a huge loss of computational performance.

In response to the above-described problems, in the presently disclosed approach, a portion of the shared buffer 130 of the computing device is logically configured as a thread bundle or dedicated space for each thread to be used as a virtual register having similar functionality as the thread local registers 120.

FIG. 2 illustrates a schematic diagram of a computing device 200, according to an embodiment of the invention. The computing device 200 shown in fig. 2 differs from the computing device 100 shown in fig. 1 in that a portion of the shared buffer 130 is configured as a virtual register 132 for a plurality of computing units 110 of the computing device 200.

Since the thread running on a compute unit 110 can only directly access the thread local registers 120 of that compute unit 110, in order to access the virtual registers 132, at least one of the plurality of thread local registers 120 of the compute unit 110 is configured as a staging register for use by the thread running on that compute unit 110 to access the virtual registers 132. In this case, the transfer register is used only for transferring between the thread on the computing unit 110 and the virtual register 132 in the shared buffer 130, and is not used for registering the relevant data of the computing unit 110.

The staging register is typically served by the last one or more thread local registers 120 of each compute unit 110, e.g., thread local registers 120-4 may be served as the staging registers of compute units 110 in FIG. 2. Note that one staging register is described herein as an example, but it will be understood by those skilled in the art that the number of staging registers of each computing unit 110 may be configured as one or more as desired. In addition, the staging register may be configured in a fixed manner in hardware, or may be configured flexibly as needed by software (e.g., software of a control device of the computing device 200).

There may be two different types of thread bundle operations for the computing device 200. In one thread operation, each thread in the thread bundle performs the same operation for different data, in which case a different virtual register needs to be configured for each thread for the operation of the respective thread, as described below in connection with fig. 3A. In another thread operation, each thread in the thread bundle performs the same operation on the same data, in which case virtual registers may be uniformly configured for the entire thread bundle for operation of all threads, as described below in connection with FIG. 3B.

FIG. 3A shows a schematic diagram of a computing device 200 for a thread operation according to an embodiment of the invention. As shown in fig. 3A, virtual registers 132 include one or more thread virtual registers 1322 that are respectively dedicated to threads of each compute unit 110. In this case, each computing unit 110 is configured to access the one or more thread virtual registers 1322 via the staging register (e.g., thread local register 120-4) of the computing unit 110 while running the thread. Here, the number of thread virtual registers 1322 of each computing unit 110 may be implemented by hardware pre-configuration, or flexibly configured as needed by software (e.g., software of a control device of the computing device 200). In some examples, the number of thread virtual registers 1322 of each compute unit 110 may be set to 1/2 to 2 times the number of thread local registers 120 of that compute unit 110.

The accesses by the threads of the compute unit 110 to the thread virtual registers 1322 include write operations and read operations.

For a write operation to a thread virtual register 1322, when a computing unit 110 determines that its thread is to write data to its one or more thread virtual registers 1322, the computing unit 110 writes the data to a staging register (e.g., thread local register 120-4) of the computing unit 110, and the staging register writes the data to the thread virtual registers 1322 of the computing unit 110.

More specifically, for example, the computing unit 110 may sequentially write the generated data to its plurality of thread local registers 120 while running the thread, and when the data arrives at a staging register (e.g., thread local register 120-4) of the plurality of thread local registers 120, the staging register may directly write the written data to the thread virtual registers 1322 of the computing unit 110, and so on until all of the thread virtual registers 1322 are full.

For a read operation of a thread virtual register 1322, when a computing unit 110 determines that its thread is to read data from the thread virtual register 1322, the computing unit 110 sends a read request to a forwarding register (e.g., thread local register 120-4) therein, and the forwarding register reads the data from the thread virtual register 1322 of the computing unit 110 in response to the read request for reading by the thread running by the computing unit 110.

More specifically, for example, when the computing unit 110 runs the thread, it needs to sequentially read data required for the thread from the plurality of thread local registers 120 thereof, and when a transit register (such as the thread local register 120-4) among the plurality of thread local registers 120 is read, the transit register can read data from the thread virtual register 1322 of the computing unit 110, and the computing unit 110 can read the required data from the transit register, for example, by software code or hardware configuration.

FIG. 3B illustrates a schematic diagram of a computing device 200 for another thread operation according to an embodiment of the present invention. As shown in fig. 3B, virtual registers 132 include one or more thread bundle virtual registers 1324 that are shared by the thread bundles of multiple compute units 110. In this case, each computing unit 110 is configured to access the one or more thread bundle virtual registers 1324 via the staging register (e.g., thread local register 120-4) of the computing unit 110 while running the thread bundle. Similarly, the number of thread bundle virtual registers 1324 may also be implemented by hardware pre-configuration, or flexibly configured as desired by software (e.g., software of a control device of computing device 200). In some examples, the number of thread bundle virtual registers 1324 may be set to tens of times the number of thread local registers 120 per compute unit 110.

Similarly, accesses to the thread bundle virtual register 1324 by the thread bundles run by the plurality of computing units 110 of the computing device 200 include write operations and read operations.

For a write operation of the thread bundle virtual register 1324, upon determining that a thread bundle run by the plurality of computing units 110 is to write data to one or more thread bundle virtual registers 1324, one computing unit 110 of the plurality of computing units 110 (e.g., computing unit 110-1) writes the data to a staging register (e.g., thread local register 120-4) of the computing unit 110, and the staging register writes the data to the thread bundle virtual register 1324.

More specifically, for example, while the plurality of computing units 110 are running the thread bundle, each computing unit 110 may sequentially write the generated data to its plurality of thread local registers 120, and when the data arrives at a staging register (e.g., thread local register 120-4) in the plurality of thread local registers 120, the staging register may directly write the written data to the thread bundle virtual register 1324 shared by the plurality of computing units 110, and so on until all of the thread bundle virtual registers 1324 are written. Here, since the threads of the plurality of computing units 110 perform the same operation with respect to the same data, the data generated at each computing unit 110 is the same. In this case, writing may be performed to the thread bundle virtual register 1324 only by the specified computing unit 110, or writing may be performed to the thread bundle virtual register 1324 by the plurality of computing units 110, respectively. In the latter case, the data written by the latter computing unit will overwrite the previous data.

For a read operation of the thread bundle virtual register 1324, upon determining that a thread bundle being run by the plurality of computing units 110 is to read data from the thread bundle virtual register 1324, each computing unit 110 of the plurality of computing units 110 respectively sends a read request to a respective staging register (e.g., thread local register 120-4), and each staging register reads the data from the thread bundle virtual register 1324 in response to the read request for reading by a thread being run by the corresponding computing unit 110.

More specifically, for example, during the execution of the thread bundle, each computing unit 110 needs to sequentially read data needed for the thread bundle from its plurality of thread local registers 120, which when read to a staging register (e.g., thread local register 120-4) of the plurality of thread local registers 120, can read data from thread bundle virtual register 1324 and allow computing unit 110 to read the needed data from the corresponding staging register.

Or in the case where the shared buffer 130 has a broadcast function, upon determining that a thread bundle run by a plurality of computing units 110 is to read data from the thread bundle virtual register 1324, a read request may be sent by each computing unit 110 to the corresponding staging register, and the data may be read from the thread bundle virtual register 1324 by the staging register of only one computing unit 110 and broadcast to the plurality of computing units 110 by a broadcast operation.

Herein, in the case where the staging register is configured by software, one or more thread local registers 120 of the computing unit 110 may be configured as the staging register according to the currently running thread bundle. For example, assuming that the current running thread bundle has a high data read/write speed, a plurality of transfer registers (e.g., 2) may be configured, so that reading/writing of two registers may be performed at a time.

Further, in the case where the virtual thread registers 1322 are configured by software, a desired number of virtual thread registers 1322 may be configured in the shared buffer 130 by software instructions prior to execution of the thread bundle, depending on the thread bundle to be executed.

Similarly, where virtual thread bundle registers 1324 are configured by software, a desired number of virtual thread bundle registers 1324 may be configured in shared buffer 130 by software instructions prior to execution of the thread bundle, depending on the thread bundle to be executed.

Note that while virtual thread registers 1322 and virtual thread bundle registers 1324 are shown and described above with respect to fig. 3A and 3B, respectively, in some embodiments shared buffer 130 may include both virtual thread registers 1322 dedicated to each computing unit 110 and virtual thread bundle registers 1324 shared by multiple computing units 110.

FIG. 4 illustrates an exemplary flow chart of a method 400 for configuring virtual registers 132 for a computing device 200 according to an embodiment of the invention. Computing device 200 is shown, for example, above in connection with fig. 2, 3A, and 3B. The method 400 may be performed by code solidified in the computing device 200, or may be controlled by a control device of the computing device 200 (e.g., in the case where the computing device 200 is a general purpose computing unit CU in a GPGPU, the control device may be a central processing unit CPU for controlling the computing device 200) by software.

As shown in fig. 4, method 400 includes block 410 in which a portion of shared buffer 130 is configured as virtual registers 132 for multiple computing units 110.

At block 420, at least one thread local register 120 of the plurality of thread local registers 120 of each compute unit 110 may be configured as a staging register for use by a thread running the compute unit 110 to access the virtual registers 132.

In some embodiments, block 410 may include: one or more thread virtual registers 1322 are configured that are respectively dedicated to the threads of each compute unit 110. In this case, the method 400 further includes: each computing unit 110 is configured to access the one or more thread virtual registers 1322 through the staging registers of the computing unit 110 while running the thread.

Specifically, for a write operation of a thread virtual register 1322, the compute unit 110 may be configured to, upon determining that the thread is to write data to one or more thread virtual registers 1322, write the data to the staging register, and configure the staging register to write the data to one or more thread virtual registers 1322 of the compute unit 110.

Specifically, for a read operation of a thread virtual register 1322, the compute unit 110 may be configured to send a read request to the staging register upon determining that the thread is to read data from one or more thread virtual registers 1322, and to read the data from the one or more thread virtual registers 1322 of the compute unit in response to the read request for reading by a thread operated by the compute unit 110.

In some embodiments, block 410 may further comprise: one or more thread bundle virtual registers 1324 are configured that are shared by the thread bundles of the plurality of compute units 110. In this case, the method 400 further includes: each computing unit 110 is configured to access the one or more thread bundle virtual registers 1324 through the staging registers of the computing unit 110 while running the thread bundle.

Specifically, for a write operation of a thread bundle virtual register 1324, one of the plurality of computing units 110 may be configured to, upon determining that the thread bundle is to write data to one or more thread bundle virtual registers 1324, write the data to a staging register of the computing unit 110, and configure the staging register to write the data to one or more thread bundle virtual registers 1324.

Specifically, for a read operation of the thread virtual register 1322, each compute unit 110 of the plurality of compute units 110 may be configured to send a read request to a respective staging register upon determining that the thread bundle is to read data from the thread bundle virtual register 1324, and to read the data from the thread bundle virtual register 1324 in response to the read request for reading by a thread operated by the corresponding compute unit 110.

Or for a read operation of the thread virtual register 1322, each of the plurality of compute units 110 may also be configured to send a read request to the staging register of that compute unit 110 upon determining that the thread bundle is to read data from the thread bundle virtual register 1324, and to read the data from the thread bundle virtual register 1324 by the staging register of only one of the compute units 110 in response to the read request and broadcast to the plurality of compute units 110.

By using the scheme of the invention, by logically configuring a part of the shared buffer of the computing device as a thread bundle or virtual register of each thread and configuring one or more of the existing thread local registers as transfer registers to perform data transfer between the thread/thread bundle and the virtual registers, the number of the thread local registers available to the computing device when the thread bundle is operated can be expanded, so that the performance degradation caused by register overflow is avoided, and the increase of the number of the registers also provides a larger operation space for the thread bundle operated on the computing device.

Those skilled in the art will appreciate that the computing device 200 shown in the figures is merely illustrative and may contain more or fewer components.

The computing device 200 and the method 400 of configuring virtual registers according to the present disclosure are described above in connection with the accompanying figures. Those skilled in the art will appreciate that computing device 200 need not include all of the components shown in the figures, may include only some or more of the components necessary to perform the functions described in this disclosure, and that the manner of connection of such components is not limited to the form shown in the figures, and that method 400 may include further steps not shown in the figures.

The present invention may be embodied as methods, computing devices, control devices for such computing devices, computer-readable storage media, and/or computer program products. The computer readable storage medium has stored thereon computer program code which, when executed, is adapted to carry out the methods of the present disclosure. The computer program product comprises a computer program which, when executed, performs the methods of the present disclosure. The computing device and/or computing device may include at least one processor and at least one memory coupled to the at least one processor, which may store instructions for execution by the at least one processor. The instructions, when executed by the at least one processor, the computing device and/or control device may perform the methods described above.

In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

Those of ordinary skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.

The previous description of the disclosure is provided to enable any person of ordinary skill in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A computing device, comprising:

a plurality of computing units, each configured to run one thread of a thread bundle;

A plurality of thread local registers dedicated to each compute unit for registering data associated with threads operated by the compute unit; and

A shared buffer for the plurality of computing units,

Wherein a portion of the shared buffer is configured as a virtual register for the plurality of compute units and at least one of the plurality of thread local registers of each compute unit is configured as a staging register for threads operated by the compute unit to access the virtual register.

2. The computing device of claim 1, wherein the virtual registers comprise one or more thread virtual registers that are respectively dedicated to threads of each computing unit, and each computing unit is configured to access the one or more thread virtual registers through a staging register of the computing unit when running the thread.

3. The computing device of claim 2, wherein

The computing unit is configured to, upon determining that the thread is to write data to the one or more thread virtual registers, write the data to the staging register, and

The staging register is configured to write the data to one or more thread virtual registers of the computing unit.

4. The computing device of claim 2, wherein

The computing unit is configured to send a read request to the staging register upon determining that the thread is to read data from the one or more thread virtual registers, and

The staging register is configured to read the data from one or more thread virtual registers of the computing unit for reading by a thread run by the computing unit in response to the read request.

5. The computing device of claim 1, wherein the virtual registers comprise one or more thread bundle virtual registers that are shared by thread bundles of the plurality of computing units, and each computing unit is configured to access the one or more thread bundle virtual registers through a staging register of the computing unit when running the thread bundle.

6. The computing device of claim 5, wherein

Upon determining that the thread bundle is to write data to the one or more thread bundle virtual registers, one of the plurality of computing units is configured to write the data to a staging register of the computing unit, an

The staging register is configured to write the data to the one or more thread bundle virtual registers.

7. The computing device of claim 5, wherein

Upon determining that the thread bundle is to read data from the thread bundle virtual register, each of the plurality of computing units sends a read request to a respective staging register, and

Each staging register is configured to read the data from the thread bundle virtual register for reading by a thread operated by a corresponding compute unit in response to the read request.

8. The computing device of claim 5, wherein

Upon determining that the thread bundle is to read data from the thread bundle virtual register, each of the plurality of computing units sends a read request to a staging register of the computing unit, and

The staging register of one of the plurality of computing units is configured to read the data from the thread bundle virtual register in response to the read request and broadcast to the plurality of computing units.

9. A method of configuring virtual registers for a computing device, wherein the computing device comprises:

A shared buffer for the plurality of computing units,

The method comprises the following steps:

configuring a portion of the shared buffer as a virtual register for the plurality of computing units; and

At least one of the plurality of thread local registers of each compute unit is configured as a staging register for threads operated by the compute unit to access the virtual registers.

10. The method of claim 9, wherein configuring a portion of the shared buffer as a virtual register for the plurality of computing units comprises:

one or more thread virtual registers are configured that are respectively dedicated to threads of each compute unit, and the method further comprises:

Each compute unit is configured to access the one or more thread virtual registers through the staging registers of the compute unit while running the thread.

11. The method of claim 10, wherein configuring each computing unit to access the one or more thread virtual registers through a staging register of the computing unit while running the thread comprises:

Configuring the computing unit to write data to the staging register upon determining that the thread is to write the data to the one or more thread virtual registers, and

12. The method of claim 10, wherein configuring each computing unit to access the one or more thread virtual registers through a staging register of the computing unit while running the thread comprises:

configuring the computing unit to send a read request to the staging register upon determining that the thread is to read data from the one or more thread virtual registers, an

The staging register is configured to read the data from one or more thread virtual registers of the compute unit for reading by a thread run by the compute unit in response to the read request.

13. The method of claim 9, wherein configuring a portion of the shared buffer as a virtual register for the plurality of computing units comprises:

one or more thread bundle virtual registers shared by thread bundles of the plurality of compute units are configured, and the method further comprises:

each compute unit is configured to access the one or more thread bundle virtual registers through the transit registers of the compute unit while running the thread bundle.

14. The method of claim 13, wherein configuring each computing unit to access the one or more thread bundle virtual registers through a staging register of the computing unit while running the thread bundle comprises:

Configuring one of the plurality of compute units to write data to the one or more thread bundle virtual registers upon determining that the thread bundle is to write the data to the transit register of the compute unit, and

15. The method of claim 13, wherein configuring each computing unit to access the one or more thread bundle virtual registers through a staging register of the computing unit while running the thread bundle comprises:

configuring each of the plurality of compute units to send a read request to a respective staging register upon determining that the thread bundle is to read data from the thread bundle virtual register, and

Each transfer register is configured to read the data from the thread bundle virtual register in response to the read request for reading by a thread operated by the corresponding compute unit.

16. The method of claim 13, wherein configuring each computing unit to access the one or more thread bundle virtual registers through a staging register of the computing unit while running the thread bundle comprises:

Configuring each of the plurality of compute units to send a read request to a staging register of the compute unit upon determining that the thread bundle is to read data from the thread bundle virtual register, and

17. A control apparatus comprising:

At least one processor; and

At least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, which when executed by the at least one processor, cause the control device to perform the steps of the method according to any one of claims 9 to 16.

18. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 9 to 16.

19. A computer program product comprising a computer program which, when executed by a machine, performs the method of any of claims 9 to 16.