CN116069392A - Computing device, operating method of computing device, electronic apparatus, and storage medium - Google Patents
Computing device, operating method of computing device, electronic apparatus, and storage medium Download PDFInfo
- Publication number
- CN116069392A CN116069392A CN202310071798.8A CN202310071798A CN116069392A CN 116069392 A CN116069392 A CN 116069392A CN 202310071798 A CN202310071798 A CN 202310071798A CN 116069392 A CN116069392 A CN 116069392A
- Authority
- CN
- China
- Prior art keywords
- instruction
- descriptor
- unbound
- chip address
- register
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000011017 operating method Methods 0.000 title description 3
- 238000000034 method Methods 0.000 claims abstract description 32
- 239000000872 buffer Substances 0.000 claims description 50
- 238000005070 sampling Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 4
- 239000012536 storage buffer Substances 0.000 claims description 3
- 230000008569 process Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000001934 delay Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000003116 impacting effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30101—Special purpose registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/06—Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
The present disclosure provides a computing device, a method of operating the computing device, an electronic apparatus, and a computer-readable storage medium. The computing device includes: a plurality of execution units, each of the execution units comprising an instruction scheduler circuitry and execution unit circuitry, wherein the instruction scheduler circuitry comprises a descriptor on-chip address register for holding an on-chip address of at least one unbound resource descriptor obtained by pre-allocating an instruction, wherein the instruction scheduler circuitry is configured to: upon receiving a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register, and sending the on-chip address of the unbound resource descriptor to the execution unit circuitry to execute the storage access instruction.
Description
Technical Field
The present disclosure relates generally to the field of processors, and more particularly, to a computing device, a method of operating a computing device, an electronic apparatus, and a computer-readable storage medium.
Background
Currently, some processors, such as Graphics Processors (GPUs), must first bind a resource to a pipeline when performing read and write operations on the resource. Such binding is typically an indirect binding, e.g., using a resource descriptor (resource descriptor). The resource descriptor is used for describing relevant information of the memory to be accessed, such as data format, access address, etc. The resource descriptor may be binding or unbinding. A binding-type resource descriptor is a descriptor of a pre-allocated set of memory resources that all instructions in the whole program must use for memory access. In this case, the number of resource descriptors that can be used by one program is limited, and thus the types of resources that it can access are also limited. To solve the overhead problem of binding resources to a pipeline and the number limitation problem of binding resource descriptors, unbound resource descriptors are also proposed, which are indexed by virtual addresses. The unbound resource descriptor requires each instruction (which has the virtual address of the resource descriptor) to interact with the private cache to obtain the resource pointed by the resource descriptor, so as to correctly read and write the data contained in the instruction. Thus, program operation using such unbound resource descriptors will be greatly delayed by the impact of the process, severely impacting processor processing speed.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a computing device and an operating method of the computing device that prestore, in an instruction scheduler circuit of each execution unit, an on-chip address of an unbound resource descriptor acquired by a pre-allocation instruction, and directly read the on-chip address when the instruction scheduler circuit receives a store access instruction.
According to one aspect of the present disclosure, a computing device is provided. The computing device includes: a plurality of execution units, each of the execution units comprising an instruction scheduler circuitry and execution unit circuitry, wherein the instruction scheduler circuitry comprises a descriptor on-chip address register for holding an on-chip address of at least one unbound resource descriptor obtained by pre-allocating an instruction, wherein the instruction scheduler circuitry is configured to: upon receiving a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register, and sending the on-chip address of the corresponding unbound resource descriptor to the execution unit circuitry to execute the storage access instruction.
In some implementations, the descriptor on-chip address register includes a first register to hold an on-chip address of an unbound resource descriptor of a load instruction, a store instruction, or an atomic operation instruction, and wherein the instruction scheduler circuitry is configured to: when the store access instruction is the load instruction, store instruction, or atomic operation instruction, the on-chip address of the unbound resource descriptor is read directly from the first register.
In some implementations, the descriptor on-chip address register includes a first register to hold an on-chip address of a texture descriptor of a texture instruction and a second register to hold an on-chip address of a sample descriptor of the texture instruction, and wherein the instruction scheduler circuitry is configured to: when the storage access instruction is the texture instruction, the on-chip address of the texture descriptor is read directly from the first register and the on-chip address of the sampling descriptor is read directly from the second register.
In some implementations, the preallocation instruction is executed a plurality of instruction cycles ahead of the store access instruction.
In some implementations, the computing device further includes an unbound descriptor buffer for storing a plurality of unbound resource descriptors, the descriptor on-chip address register obtains an on-chip address of the at least one unbound resource descriptor from the unbound descriptor buffer via the pre-allocation instruction, and a load store buffer for obtaining the corresponding unbound resource descriptor from the unbound descriptor buffer according to the on-chip address of the unbound resource descriptor provided by the execution unit circuit and corresponding to the store access instruction.
According to another aspect of the present disclosure, a method of operating a computing device is provided. The computing device includes a plurality of execution units, each of the execution units including an instruction scheduler circuit and an execution unit circuit, wherein the instruction scheduler circuit includes a descriptor on-chip address register. The method comprises the following steps: acquiring an on-chip address of at least one unbound resource descriptor through a pre-allocation instruction; saving an on-chip address of the at least one unbound resource descriptor in the descriptor on-chip address register; when the instruction scheduler circuit receives a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register; and the instruction scheduler circuitry sending an on-chip address of the corresponding unbound resource descriptor to the execution unit circuitry to execute the store access instruction.
In some implementations, the descriptor on-chip address register includes a first register to hold an on-chip address of an unbound resource descriptor of a load instruction, a store instruction, or an atomic operation instruction, and wherein when the store access instruction is the load instruction, store instruction, or atomic operation instruction, the on-chip address of the unbound resource descriptor is read directly from the first register.
In some implementations, the descriptor on-chip address register includes a first register to hold an on-chip address of a texture descriptor of a texture instruction and a second register to hold an on-chip address of a sample descriptor of the texture instruction, and wherein when the memory access instruction is the texture instruction, the on-chip address of the texture descriptor is read directly from the first register and the on-chip address of the sample descriptor is read directly from the second register.
In some implementations, the pre-allocation instruction is executed a plurality of instruction cycles ahead of the store access instruction.
In some implementations, the computing device further includes an unbound descriptor buffer and a load store buffer, the method comprising: acquiring an on-chip address of the at least one unbound resource descriptor from a plurality of unbound resource descriptors stored in the unbound descriptor buffer through the pre-allocation instruction; and the loading storage buffer memory obtains the corresponding unbound resource descriptor from the unbound resource descriptor buffer memory according to the on-chip address of the unbound resource descriptor corresponding to the storage access instruction provided by the execution unit circuit.
According to still another aspect of the present disclosure, there is provided an electronic device including: a memory non-transitory storing computer-executable instructions; a processor configured to execute the computer-executable instructions; wherein the computer executable instructions, when executed by the processor, implement the method as described above.
According to yet another aspect of the present disclosure, a computer readable storage medium is provided, having stored thereon computer program code which, when executed, performs the method as described above.
Drawings
The disclosure will be better understood and other objects, details, features and advantages of the disclosure will become more apparent by reference to the description of specific embodiments thereof given in the following drawings.
Fig. 1 shows a schematic diagram of a computing device.
Fig. 2 shows a schematic structural diagram of a computing device according to an embodiment of the present disclosure.
Fig. 3 shows a schematic flow chart of a method of operation of a computing device according to an embodiment of the disclosure.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.
Fig. 1 shows a schematic diagram of a computing device 100. The Computing device 100 may be, for example, a Computing Unit (CU). The computing device 100 may include a plurality of Execution Units (EU) 110, four Execution units 110 being schematically illustrated in FIG. 1. Each execution unit 110 has an associated instruction scheduler (Instruction Scheduler, IS) circuit 112 and Execution Unit (EU) circuit 114. Furthermore, the computing device 100 further comprises an unbound descriptor buffer 120 for storing on-chip addresses of unbound descriptors. Here, the on-chip address refers to an on-chip physical address of the unbound resource descriptor. Note that the connection of the unbound descriptor buffer 120 to one instruction scheduler circuit 112 is only schematically shown in fig. 1, and that in fact each instruction scheduler circuit 112 has a similar connection to the unbound descriptor buffer 120.
The computing device 100 may also be connected to external memory 30, such as DDR SDRAM (double Rate synchronous dynamic random Access memory) or HBM (high bandwidth memory), via the memory interface 20 to store data generated by the computing device 100 to the external memory 30 or to read desired data from the external memory 30, as desired.
In the computing device 100 shown in fig. 1, each instruction scheduler circuit 112 receives execution instructions to be executed by a corresponding execution unit 110, for example, from a previous stage scheduler or instruction queue (e.g., a scheduler or instruction queue of the execution unit 110 in which the instruction scheduler circuit 112 is located, not shown). The execution instruction may be, for example, a store access instruction and at least includes an instruction type (e.g., load instruction, store instruction (store instruction), atomic operation instruction (atomic instruction), texture instruction (texture instruction), etc.) and address information of an unbound resource descriptor corresponding to the instruction. Here, the address information of the unbound resource descriptor means that the virtual address of the unbound resource descriptor is programmatically visible. The address information of the unbound resource descriptor may be represented, for example, as a starting address of a descriptor group to which the unbound resource descriptor belongs and an address offset (e.g., represented using a 32-bit offset) of the unbound resource descriptor in the descriptor group. Alternatively, the address information of the unbound resource descriptor may be directly represented as a virtual address of the unbound resource descriptor (e.g., using a 64-bit address representation). In addition, the execution instruction may also include a destination address indicating a deposit address of the result of the operation of the instruction.
When the instruction scheduler circuitry 112 receives an execution instruction, it first sends address information of the unbound resource descriptor contained therein to the unbound descriptor buffer 120 to check whether the unbound resource descriptor is present in the unbound descriptor buffer 120. For example, the unbound descriptor buffer 120 may check whether an on-chip address corresponding to the address information is stored therein.
If the unbound descriptor buffer 120 determines that the on-chip address corresponding to the address information (i.e., the on-chip physical address of the unbound resource descriptor corresponding to the address information) is stored therein, the on-chip address is returned to the instruction scheduler circuitry 112. The instruction scheduler circuitry 112 may send a store access instruction and the on-chip address to the corresponding execution unit circuitry 114 to instruct the execution unit circuitry 114 to execute the store access instruction according to the on-chip address.
On the other hand, if the unbound descriptor buffer 120 determines that the on-chip address corresponding to the address information is not stored therein, this indicates that the unbound resource descriptor indicated by the address information is not in the unbound descriptor buffer 120 but in the external memory 30. In this case, the unbound descriptor buffer 120 may allocate an on-chip address for the unbound resource descriptor, and a request is sent to the external memory 30 via the memory interface 20 to read the unbound resource descriptor corresponding to the address information to the unbound descriptor buffer 120 and send the on-chip address allocated for the unbound resource descriptor to the instruction scheduler circuit 112. The instruction scheduler circuitry 112 may send a store access instruction and the on-chip address to the corresponding execution unit circuitry 114 to instruct the execution unit circuitry 114 to execute the store access instruction according to the on-chip address. Specifically, when the execution unit circuit 114 executes the storage access instruction, the on-chip address of the unbound resource descriptor may be sent to the load store buffer 130, so as to instruct the load store buffer 130 to obtain, from the unbound resource descriptor buffer 120, the unbound resource descriptor corresponding to the on-chip address according to the on-chip address. The unbound resource descriptor contains the resource/data related information of the current instruction to direct execution of the storage access instruction.
It can be seen that in the above procedure, the instruction scheduler circuitry 112 needs to send the address information of each unbound resource descriptor to the unbound descriptor buffer 120 to check its location and needs the unbound descriptor buffer 120 to return the on-chip address of the unbound resource descriptor. This makes delays in pipelining of multiple instructions unavoidable. Further, for each instruction, the instruction scheduler circuitry 112 must repeatedly perform the above-described checking and return of on-chip addresses even though the unbound resource descriptor in the instruction has been checked in the previous instruction and returned the on-chip address of the unbound resource descriptor, which makes the execution of the instruction inefficient.
For this situation, the scheme of the present disclosure proposes to set a dedicated descriptor on-chip address register in the instruction scheduler circuitry and add a pre-allocation instruction before storing the access instruction. The pre-allocation instruction can be pre-executed before the storage access instruction, so that on-chip addresses of unbound resource descriptors are pre-allocated for the subsequent storage access instruction, the unbound resource descriptors are read to the load storage buffer, and meanwhile, the on-chip addresses of the unbound resource descriptors are stored in the descriptor on-chip address register. When a subsequent store access instruction requires the use of the unbound resource descriptor, the instruction scheduler circuitry may read the pre-allocated on-chip address of the unbound resource descriptor directly from the descriptor on-chip address register and send the on-chip address to the execution unit circuitry to execute the store access instruction.
Fig. 2 shows a schematic structural diagram of a computing device 200 according to an embodiment of the disclosure. Similar to the computing device 100 shown in fig. 1, the computing device 200 may be, for example, a Computing Unit (CU). The computing device 200 may include a plurality of execution units 210, four execution units 210 being schematically illustrated in fig. 2. Each execution unit 210 has associated instruction scheduler circuitry 212 and execution unit circuitry 214.
Unlike the computing device 100 shown in FIG. 1, the computing device 200 includes one or more descriptor on-chip address registers 240 in the instruction scheduler circuitry 212 for storing on-chip addresses of unbound resource descriptors obtained from the unbound descriptor buffer 220 by pre-allocation instructions. Here, the on-chip address refers to an on-chip physical address of the unbound resource descriptor.
As described above, the execution instructions executed by the execution unit 210 may be, for example, store access instructions and the instruction types may include load instructions, store instructions, atomic instructions, texture instructions, and the like. For a load instruction, a store instruction, or an atomic operation instruction, each instruction includes only one unbound resource descriptor. In this case, the descriptor on-chip address register 240 may include only one first register 242 for holding the on-chip address of the unbound resource descriptor of the corresponding instruction. For texture instructions, each instruction includes two unbound resource descriptors, a texture descriptor and a sample (or sampler) descriptor. In this case, the descriptor on-chip address register 240 may include two registers, namely a first register 242 and a second register 244, wherein the first register 242 is used to hold the on-chip address of the texture descriptor of the texture instruction and the second register 244 is used to hold the on-chip address of the sample descriptor of the texture instruction.
In aspects of the present disclosure, a pre-allocation instruction is added prior to a store access instruction to be executed by execution unit circuitry 214. Wherein the pre-allocation instruction may be pre-executed prior to storing the access instruction. The pre-allocation instruction may pre-allocate unbound resource descriptors for each thread bundle (warp). Descriptor on-chip address registers 240 of different sizes may be set depending on the size of the on-chip address. For example, in the case of using a 16-bit on-chip address, the first register 242 and the second register 244 may be implemented using 16-bit registers, respectively.
Fig. 3 shows a schematic flow chart of a method 300 of operation of the computing device 200 according to an embodiment of the disclosure. As shown in fig. 3, the method 300 of operation of the computing device 200 may include a pre-allocation instruction execution process 310 and a storage access instruction execution process 320.
In the pre-allocation instruction execution process 310, the on-chip address of at least one unbound resource descriptor may be obtained by the pre-allocation instruction (block 312) and saved in the descriptor on-chip address register 240 (block 314).
Here, the on-chip address of the unbound resource descriptor may be obtained in a similar manner to the existing one. Specifically, the instruction scheduler circuit 212 may send address information of the unbound resource descriptor corresponding to the received pre-allocation instruction to the unbound resource descriptor buffer 220 to check whether the unbound resource descriptor exists. If the unbound descriptor buffer 220 determines an on-chip address in which the unbound resource descriptor corresponding to the address information is stored, the on-chip address is returned to the instruction scheduler circuitry 212 and saved in the descriptor on-chip address register 240. If the unbound descriptor buffer 220 determines that the on-chip address of the unbound resource descriptor corresponding to the address information is not stored therein, the unbound descriptor buffer 220 may allocate an on-chip address for the unbound resource descriptor corresponding to the pre-allocation instruction, send a request to the external memory 30 via the memory interface 20 to read the unbound resource descriptor corresponding to the address information to the unbound descriptor buffer 220, and send the on-chip address allocated for the unbound resource descriptor to the instruction scheduler circuit 212 and store in the descriptor on-chip address register 240.
At the store access instruction execution process 320, when the instruction scheduler circuitry 212 receives a store access instruction, it may read the on-chip address of the unbound resource descriptor corresponding to the instruction directly from the descriptor on-chip address register 240 (block 322) and send the on-chip address of the unbound resource descriptor to the corresponding execution unit circuitry 214 (block 324) to execute the store access instruction.
Further (not shown), load store buffer 230 may retrieve the corresponding unbound resource descriptor from unbound descriptor buffer 220 based on the on-chip address of the unbound resource descriptor corresponding to the store access instruction provided by execution unit circuitry 214.
Specifically, when the store access instruction is a load instruction, a store instruction, or an atomic instruction, the instruction scheduler circuitry 212 directly reads the on-chip address of the unbound resource descriptor of the instruction from the first register 242. When the storage access instruction is a texture instruction, the instruction scheduler circuit 212 directly reads the on-chip address of the texture descriptor from the first register 242 and the on-chip address of the sample descriptor from the second register 244.
The pre-allocation instruction may be executed in advance, e.g., a number (e.g., 200) of instruction cycles, before the store access instruction, such that it is sufficient to ensure that on-chip addresses of the corresponding unbound resource descriptors have all been allocated and saved in the descriptor on-chip address register 240 when the store access instruction is received by the instruction scheduler circuitry 212. As described above, descriptor on-chip address registers 240 of different sizes may be set depending on the size of the on-chip address.
Note that the above description of the store access instruction execution process 320 shown in fig. 3 assumes that the on-chip address of the unbound resource descriptor corresponding to the store access instruction has been stored in the descriptor on-chip address register 240 (e.g., via a pre-allocation instruction). In some cases, the on-chip address of the unbound resource descriptor corresponding to the store access instruction received by the instruction scheduler circuitry 212 is not stored in the descriptor on-chip address register 240. In this case, the instruction scheduler circuitry 212 may perform operations similar to those in the prior art. Specifically, the instruction scheduler circuit 212 sends address information of the unbound resource descriptor to the unbound resource descriptor buffer 220 to check whether the unbound resource descriptor exists. If the unbound descriptor buffer 220 determines an on-chip address in which the unbound resource descriptor corresponding to the address information is stored, the on-chip address is returned to the instruction scheduler circuitry 212. If the unbound descriptor buffer 220 determines that an on-chip address of the unbound resource descriptor corresponding to the address information is not stored therein, the unbound descriptor buffer 220 may allocate an on-chip address for the unbound resource descriptor corresponding to the storage access instruction, issue a request to the external memory 30 via the memory interface 20 to read the unbound resource descriptor corresponding to the address information to the unbound descriptor buffer 220 and send the on-chip address allocated for the unbound resource descriptor to the instruction scheduler circuitry 212. The instruction scheduler circuitry 212 may send a store access instruction and the on-chip address to the corresponding execution unit circuitry 214 to instruct the execution unit circuitry 214 to execute the store access instruction in accordance with the on-chip address. Specifically, when the execution unit circuit 214 executes the storage access instruction, the on-chip address of the unbound resource descriptor may be sent to the load store buffer 230, so as to instruct the load store buffer 230 to obtain, from the unbound resource descriptor buffer 220, the unbound resource descriptor corresponding to the on-chip address according to the on-chip address. The unbound resource descriptor contains the resource/data related information of the current instruction to direct execution of the storage access instruction.
In the following, an execution instruction and an execution instruction according to the present invention will be described by taking a load instruction and a texture instruction as examples, respectively.
In the execution of the execution instruction shown in fig. 1, the load instruction is, for example:
load dest,src,desc_address
where load indicates the instruction type (i.e., load instruction), dest indicates the destination address, src indicates the source address of the resource/data to which the load instruction corresponds, and desc_address indicates the address information (virtual address) of the unbound resource descriptor of the instruction.
In the execution of the execution instructions of the present invention as shown in FIG. 2, the pre-allocation instructions and load instructions may be represented, for example, as:
alloc d0,desc_address instruction 1
…
load dest,src,d0
where alloc indicates a pre-allocation instruction, d0 indicates the first register 242, and desc_address indicates the virtual address of the unbound resource descriptor of the instruction. The pre-allocation instruction indicates that an on-chip address is allocated for the unbound resource descriptor of the instruction and saved in the first register 242. load indicates the instruction type of the store access instruction (i.e., load instruction), dest indicates the destination address, src indicates the source address of the resource/data to which the load instruction corresponds, d0 indicates the first register 242 for indicating the on-chip address of the resource descriptor of the instruction that the instruction scheduler circuit 212 reads from the first register 242. Wherein the preallocation instruction alloc executes 200 instruction cycles before storing the access instruction.
In the execution of the execution instruction shown in fig. 1, the texture instruction is, for example:
tex dest,t_src,s_src,t_address,s_address
where tex indicates the instruction type (i.e., texture instruction), dest indicates the destination address, t_src and s_src indicate the texture corresponding to the texture instruction and the source address of the sampling resource/data, respectively, t_address indicates the address information (virtual address) of the texture descriptor of the instruction, and s_address indicates the address information (virtual address) of the sampling descriptor of the instruction.
In the execution of the execution instructions of the present invention as shown in fig. 2, the pre-allocation instructions and texture instructions may be represented, for example, as:
alloc d0,t_address
alloc d1,s_address instruction 1
…
tex dest,t_src,s_src,d0,d1
where alloc indicates a pre-allocated instruction, d0 indicates the first register 242, t_address indicates the virtual address of the texture descriptor of the instruction, d1 indicates the second register 244, and s_address indicates the virtual address of the sample descriptor of the instruction. The pre-allocation instruction is represented as a texture descriptor of the texture instruction allocated an on-chip address and stored in a first register 242, and a sample descriptor of the texture instruction is allocated an on-chip address and stored in a second register 244. tex indicates the instruction type of the store access instruction (i.e. texture instruction), dest indicates the destination address, t_src and s_src indicate the source address of the texture and sampling resources/data, respectively, to which the texture instruction corresponds, d0 indicates the first register 242, d1 indicates the second register 244 for indicating the instruction scheduler circuitry 212 to read the on-chip address of the texture descriptor and the on-chip address of the sampling descriptor from the first register 242 and the second register 244, respectively. Wherein the preallocation instruction alloc executes 200 instruction cycles before storing the access instruction.
Note that the load instruction and texture instruction are illustrated separately herein, and in fact, the pre-allocate instruction is executed at once for all subsequent instructions. For example, the actual instruction process may be as follows:
alloc d0,desc_address
alloc d1,s_address instruction 1
…
load dest,src,d0
tex dest,t_src,s_src,d0,d1
this means that two load instructions and texture instructions are executed in sequence and two pre-allocation instructions are executed a number of instruction cycles ahead of the load instructions and texture instructions.
With the computing device 200 and the method 300 of operation thereof of the present invention, by adding a pre-allocation instruction before a memory access instruction to be executed by an execution unit, the pre-allocation instruction may be pre-executed before the memory access instruction to pre-allocate an on-chip address of an unbound resource descriptor for a subsequent memory access instruction, and store the on-chip address in a descriptor on-chip address register. When a subsequent storage access instruction is executed, the storage access instruction and the on-chip address of the unbound resource descriptor used by the instruction can be directly sent to an execution unit circuit, and the on-chip address of the unbound resource descriptor is not required to be checked to an unbound descriptor buffer and returned from the unbound descriptor buffer, so that the operation complexity is reduced, and the pre-allocation instruction can be executed for a plurality of instruction cycles before the storage access instruction, and further the instruction waiting time is eliminated or reduced. Furthermore, since the on-chip address of the unbound resource descriptor may be pre-allocated and saved in the descriptor address register of the instruction scheduler circuitry, unbound resource descriptors with the same address or address offset may be reused by multiple instructions (e.g., store access instructions), which results in further performance improvements in the computing device.
Those skilled in the art will appreciate that the computing device 200 shown in fig. 2 is merely illustrative. In some embodiments, computing device 200 may include more or fewer components.
The computing device 200 and method 300 of operation according to the present disclosure are described above in connection with the accompanying drawings. However, those skilled in the art will appreciate that the execution of the computing device 200 and its method of operation 300 is not limited to the sequences shown in the figures and described above, but may be performed in any other reasonable order. Moreover, computing device 200 need not include all of the components shown in FIG. 2, but may include only some or more of the components necessary to perform the functions described in this disclosure, and the manner of connection of such components is not limited to the form shown in the figures.
The present invention may be embodied as methods, computing devices, systems, and/or computer program products. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure. The computing device may include at least one processor and at least one memory coupled to the at least one processor, which may store instructions for execution by the at least one processor. The instructions, when executed by the at least one processor, may perform the asymmetric synchronization method described above.
In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The various units of the apparatus disclosed herein may be implemented using discrete hardware components or may be integrally implemented on one hardware component, such as a processor. For example, the various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
Those of ordinary skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The previous description of the disclosure is provided to enable any person of ordinary skill in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (12)
1. A computing device, comprising:
a plurality of execution units, each of the execution units comprising an instruction scheduler circuit and an execution unit circuit, wherein the instruction scheduler circuit comprises a descriptor on-chip address register for holding an on-chip address of at least one unbound resource descriptor obtained by pre-allocating an instruction, wherein
The instruction scheduler circuitry is configured to: upon receiving a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register, and sending the on-chip address of the unbound resource descriptor to the execution unit circuitry to execute the storage access instruction.
2. The computing device of claim 1, wherein the descriptor on-chip address register comprises a first register to hold an on-chip address of a unbound resource descriptor of a load instruction, a store instruction, or an atomic operation instruction, and wherein
The instruction scheduler circuitry is configured to: when the store access instruction is the load instruction, store instruction, or atomic operation instruction, the on-chip address of the unbound resource descriptor is read directly from the first register.
3. The computing device of claim 1, wherein the descriptor on-chip address register comprises a first register to hold an on-chip address of a texture descriptor of a texture instruction and a second register to hold an on-chip address of a sample descriptor of the texture instruction, and wherein
The instruction scheduler circuitry is configured to: when the storage access instruction is the texture instruction, the on-chip address of the texture descriptor is read directly from the first register and the on-chip address of the sampling descriptor is read directly from the second register.
4. The computing device of claim 1, wherein the pre-allocation instruction is executed a plurality of instruction cycles ahead of the store access instruction.
5. The computing device of claim 1, further comprising an unbound descriptor buffer to store a plurality of unbound resource descriptors, the descriptor on-chip address register to obtain on-chip addresses of the at least one unbound resource descriptor from the unbound descriptor buffer via the pre-allocation instruction, and a load store buffer to obtain the corresponding unbound resource descriptor from the unbound descriptor buffer based on-chip addresses of unbound resource descriptors provided by the execution unit circuitry that correspond to the store access instruction.
6. A method of operation of a computing device, the computing device comprising a plurality of execution units, each of the execution units comprising an instruction scheduler circuit and an execution unit circuit, wherein the instruction scheduler circuit comprises a descriptor on-chip address register, the method comprising:
acquiring an on-chip address of at least one unbound resource descriptor through a pre-allocation instruction;
saving an on-chip address of the at least one unbound resource descriptor in the descriptor on-chip address register;
when the instruction scheduler circuit receives a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register; and
the instruction scheduler circuitry sends an on-chip address of the unbound resource descriptor to the execution unit circuitry to execute the store access instruction.
7. The method of operation of a computing device of claim 6, wherein the descriptor on-chip address register comprises a first register to hold an on-chip address of a unbound resource descriptor of a load instruction, store instruction, or atomic operation instruction, and wherein
When the store access instruction is the load instruction, store instruction, or atomic operation instruction, the on-chip address of the unbound resource descriptor is read directly from the first register.
8. The method of operation of a computing device of claim 6, wherein the descriptor on-chip address register comprises a first register to hold an on-chip address of a texture descriptor of a texture instruction and a second register to hold an on-chip address of a sample descriptor of the texture instruction, and wherein
When the storage access instruction is the texture instruction, the on-chip address of the texture descriptor is read directly from the first register and the on-chip address of the sampling descriptor is read directly from the second register.
9. The method of operation of a computing device as recited in claim 6 wherein the pre-allocation instruction is executed a plurality of instruction cycles ahead of the store access instruction.
10. The method of operation of a computing device of claim 6, wherein the computing device further comprises an unbound descriptor buffer and a load store buffer, the method comprising:
acquiring an on-chip address of the at least one unbound resource descriptor from a plurality of unbound resource descriptors stored in the unbound descriptor buffer through the pre-allocation instruction; and
and the loading storage buffer memory obtains the corresponding unbound resource descriptor from the unbound resource descriptor buffer memory according to the on-chip address of the unbound resource descriptor corresponding to the storage access instruction provided by the execution unit circuit.
11. An electronic device, comprising:
a memory non-transitory storing computer-executable instructions;
a processor configured to execute the computer-executable instructions;
wherein the computer executable instructions, when executed by the processor, implement the method of any one of claims 6 to 10.
12. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 6 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310071798.8A CN116069392A (en) | 2023-01-13 | 2023-01-13 | Computing device, operating method of computing device, electronic apparatus, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310071798.8A CN116069392A (en) | 2023-01-13 | 2023-01-13 | Computing device, operating method of computing device, electronic apparatus, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116069392A true CN116069392A (en) | 2023-05-05 |
Family
ID=86176497
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310071798.8A Pending CN116069392A (en) | 2023-01-13 | 2023-01-13 | Computing device, operating method of computing device, electronic apparatus, and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116069392A (en) |
-
2023
- 2023-01-13 CN CN202310071798.8A patent/CN116069392A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6629237B2 (en) | Solving parallel problems employing hardware multi-threading in a parallel processing environment | |
US6560667B1 (en) | Handling contiguous memory references in a multi-queue system | |
EP1236088B1 (en) | Register set used in multithreaded parallel processor architecture | |
US7991983B2 (en) | Register set used in multithreaded parallel processor architecture | |
USRE41849E1 (en) | Parallel multi-threaded processing | |
US7793079B2 (en) | Method and system for expanding a conditional instruction into a unconditional instruction and a select instruction | |
US5237670A (en) | Method and apparatus for data transfer between source and destination modules | |
US9595075B2 (en) | Load/store operations in texture hardware | |
US9798543B2 (en) | Fast mapping table register file allocation algorithm for SIMT processors | |
CN108628638B (en) | Data processing method and device | |
US10146468B2 (en) | Addressless merge command with data item identifier | |
US8902915B2 (en) | Dataport and methods thereof | |
CN110908716A (en) | Method for implementing vector aggregation loading instruction | |
US10552349B1 (en) | System and method for dynamic pipelining of direct memory access (DMA) transactions | |
US10121220B2 (en) | System and method for creating aliased mappings to minimize impact of cache invalidation | |
US20020124157A1 (en) | Method and apparatus for fast operand access stage in a CPU design using a cache-like structure | |
US9846662B2 (en) | Chained CPP command | |
CA2323116A1 (en) | Graphic processor having multiple geometric operation units and method of processing data thereby | |
CN116069392A (en) | Computing device, operating method of computing device, electronic apparatus, and storage medium | |
US20190258492A1 (en) | Apparatuses for enqueuing kernels on a device-side | |
US7904697B2 (en) | Load register instruction short circuiting method | |
US6691190B1 (en) | Inter-DSP data exchange in a multiple DSP environment | |
KR20240025019A (en) | Provides atomicity for complex operations using near-memory computing | |
TWI819428B (en) | Processor apparatus | |
JPH0351012B2 (en) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Country or region after: China Address after: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai Applicant after: Shanghai Bi Ren Technology Co.,Ltd. Address before: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai Applicant before: Shanghai Bilin Intelligent Technology Co.,Ltd. Country or region before: China |