CN116069392A - Computing device, operating method of computing device, electronic apparatus, and storage medium - Google Patents

Computing device, operating method of computing device, electronic apparatus, and storage medium Download PDF

Info

Publication number
CN116069392A
CN116069392A CN202310071798.8A CN202310071798A CN116069392A CN 116069392 A CN116069392 A CN 116069392A CN 202310071798 A CN202310071798 A CN 202310071798A CN 116069392 A CN116069392 A CN 116069392A
Authority
CN
China
Prior art keywords
instruction
descriptor
unbound
chip address
register
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310071798.8A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Biren Intelligent Technology Co Ltd
Original Assignee
Shanghai Biren Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Biren Intelligent Technology Co Ltd filed Critical Shanghai Biren Intelligent Technology Co Ltd
Priority to CN202310071798.8A priority Critical patent/CN116069392A/en
Publication of CN116069392A publication Critical patent/CN116069392A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

The present disclosure provides a computing device, a method of operating the computing device, an electronic apparatus, and a computer-readable storage medium. The computing device includes: a plurality of execution units, each of the execution units comprising an instruction scheduler circuitry and execution unit circuitry, wherein the instruction scheduler circuitry comprises a descriptor on-chip address register for holding an on-chip address of at least one unbound resource descriptor obtained by pre-allocating an instruction, wherein the instruction scheduler circuitry is configured to: upon receiving a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register, and sending the on-chip address of the unbound resource descriptor to the execution unit circuitry to execute the storage access instruction.

Description

Computing device, operating method of computing device, electronic apparatus, and storage medium
Technical Field
The present disclosure relates generally to the field of processors, and more particularly, to a computing device, a method of operating a computing device, an electronic apparatus, and a computer-readable storage medium.
Background
Currently, some processors, such as Graphics Processors (GPUs), must first bind a resource to a pipeline when performing read and write operations on the resource. Such binding is typically an indirect binding, e.g., using a resource descriptor (resource descriptor). The resource descriptor is used for describing relevant information of the memory to be accessed, such as data format, access address, etc. The resource descriptor may be binding or unbinding. A binding-type resource descriptor is a descriptor of a pre-allocated set of memory resources that all instructions in the whole program must use for memory access. In this case, the number of resource descriptors that can be used by one program is limited, and thus the types of resources that it can access are also limited. To solve the overhead problem of binding resources to a pipeline and the number limitation problem of binding resource descriptors, unbound resource descriptors are also proposed, which are indexed by virtual addresses. The unbound resource descriptor requires each instruction (which has the virtual address of the resource descriptor) to interact with the private cache to obtain the resource pointed by the resource descriptor, so as to correctly read and write the data contained in the instruction. Thus, program operation using such unbound resource descriptors will be greatly delayed by the impact of the process, severely impacting processor processing speed.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a computing device and an operating method of the computing device that prestore, in an instruction scheduler circuit of each execution unit, an on-chip address of an unbound resource descriptor acquired by a pre-allocation instruction, and directly read the on-chip address when the instruction scheduler circuit receives a store access instruction.
According to one aspect of the present disclosure, a computing device is provided. The computing device includes: a plurality of execution units, each of the execution units comprising an instruction scheduler circuitry and execution unit circuitry, wherein the instruction scheduler circuitry comprises a descriptor on-chip address register for holding an on-chip address of at least one unbound resource descriptor obtained by pre-allocating an instruction, wherein the instruction scheduler circuitry is configured to: upon receiving a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register, and sending the on-chip address of the corresponding unbound resource descriptor to the execution unit circuitry to execute the storage access instruction.
In some implementations, the descriptor on-chip address register includes a first register to hold an on-chip address of an unbound resource descriptor of a load instruction, a store instruction, or an atomic operation instruction, and wherein the instruction scheduler circuitry is configured to: when the store access instruction is the load instruction, store instruction, or atomic operation instruction, the on-chip address of the unbound resource descriptor is read directly from the first register.
In some implementations, the descriptor on-chip address register includes a first register to hold an on-chip address of a texture descriptor of a texture instruction and a second register to hold an on-chip address of a sample descriptor of the texture instruction, and wherein the instruction scheduler circuitry is configured to: when the storage access instruction is the texture instruction, the on-chip address of the texture descriptor is read directly from the first register and the on-chip address of the sampling descriptor is read directly from the second register.
In some implementations, the preallocation instruction is executed a plurality of instruction cycles ahead of the store access instruction.
In some implementations, the computing device further includes an unbound descriptor buffer for storing a plurality of unbound resource descriptors, the descriptor on-chip address register obtains an on-chip address of the at least one unbound resource descriptor from the unbound descriptor buffer via the pre-allocation instruction, and a load store buffer for obtaining the corresponding unbound resource descriptor from the unbound descriptor buffer according to the on-chip address of the unbound resource descriptor provided by the execution unit circuit and corresponding to the store access instruction.
According to another aspect of the present disclosure, a method of operating a computing device is provided. The computing device includes a plurality of execution units, each of the execution units including an instruction scheduler circuit and an execution unit circuit, wherein the instruction scheduler circuit includes a descriptor on-chip address register. The method comprises the following steps: acquiring an on-chip address of at least one unbound resource descriptor through a pre-allocation instruction; saving an on-chip address of the at least one unbound resource descriptor in the descriptor on-chip address register; when the instruction scheduler circuit receives a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register; and the instruction scheduler circuitry sending an on-chip address of the corresponding unbound resource descriptor to the execution unit circuitry to execute the store access instruction.
In some implementations, the descriptor on-chip address register includes a first register to hold an on-chip address of an unbound resource descriptor of a load instruction, a store instruction, or an atomic operation instruction, and wherein when the store access instruction is the load instruction, store instruction, or atomic operation instruction, the on-chip address of the unbound resource descriptor is read directly from the first register.
In some implementations, the descriptor on-chip address register includes a first register to hold an on-chip address of a texture descriptor of a texture instruction and a second register to hold an on-chip address of a sample descriptor of the texture instruction, and wherein when the memory access instruction is the texture instruction, the on-chip address of the texture descriptor is read directly from the first register and the on-chip address of the sample descriptor is read directly from the second register.
In some implementations, the pre-allocation instruction is executed a plurality of instruction cycles ahead of the store access instruction.
In some implementations, the computing device further includes an unbound descriptor buffer and a load store buffer, the method comprising: acquiring an on-chip address of the at least one unbound resource descriptor from a plurality of unbound resource descriptors stored in the unbound descriptor buffer through the pre-allocation instruction; and the loading storage buffer memory obtains the corresponding unbound resource descriptor from the unbound resource descriptor buffer memory according to the on-chip address of the unbound resource descriptor corresponding to the storage access instruction provided by the execution unit circuit.
According to still another aspect of the present disclosure, there is provided an electronic device including: a memory non-transitory storing computer-executable instructions; a processor configured to execute the computer-executable instructions; wherein the computer executable instructions, when executed by the processor, implement the method as described above.
According to yet another aspect of the present disclosure, a computer readable storage medium is provided, having stored thereon computer program code which, when executed, performs the method as described above.
Drawings
The disclosure will be better understood and other objects, details, features and advantages of the disclosure will become more apparent by reference to the description of specific embodiments thereof given in the following drawings.
Fig. 1 shows a schematic diagram of a computing device.
Fig. 2 shows a schematic structural diagram of a computing device according to an embodiment of the present disclosure.
Fig. 3 shows a schematic flow chart of a method of operation of a computing device according to an embodiment of the disclosure.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.
Fig. 1 shows a schematic diagram of a computing device 100. The Computing device 100 may be, for example, a Computing Unit (CU). The computing device 100 may include a plurality of Execution Units (EU) 110, four Execution units 110 being schematically illustrated in FIG. 1. Each execution unit 110 has an associated instruction scheduler (Instruction Scheduler, IS) circuit 112 and Execution Unit (EU) circuit 114. Furthermore, the computing device 100 further comprises an unbound descriptor buffer 120 for storing on-chip addresses of unbound descriptors. Here, the on-chip address refers to an on-chip physical address of the unbound resource descriptor. Note that the connection of the unbound descriptor buffer 120 to one instruction scheduler circuit 112 is only schematically shown in fig. 1, and that in fact each instruction scheduler circuit 112 has a similar connection to the unbound descriptor buffer 120.
The computing device 100 may also be connected to external memory 30, such as DDR SDRAM (double Rate synchronous dynamic random Access memory) or HBM (high bandwidth memory), via the memory interface 20 to store data generated by the computing device 100 to the external memory 30 or to read desired data from the external memory 30, as desired.
In the computing device 100 shown in fig. 1, each instruction scheduler circuit 112 receives execution instructions to be executed by a corresponding execution unit 110, for example, from a previous stage scheduler or instruction queue (e.g., a scheduler or instruction queue of the execution unit 110 in which the instruction scheduler circuit 112 is located, not shown). The execution instruction may be, for example, a store access instruction and at least includes an instruction type (e.g., load instruction, store instruction (store instruction), atomic operation instruction (atomic instruction), texture instruction (texture instruction), etc.) and address information of an unbound resource descriptor corresponding to the instruction. Here, the address information of the unbound resource descriptor means that the virtual address of the unbound resource descriptor is programmatically visible. The address information of the unbound resource descriptor may be represented, for example, as a starting address of a descriptor group to which the unbound resource descriptor belongs and an address offset (e.g., represented using a 32-bit offset) of the unbound resource descriptor in the descriptor group. Alternatively, the address information of the unbound resource descriptor may be directly represented as a virtual address of the unbound resource descriptor (e.g., using a 64-bit address representation). In addition, the execution instruction may also include a destination address indicating a deposit address of the result of the operation of the instruction.
When the instruction scheduler circuitry 112 receives an execution instruction, it first sends address information of the unbound resource descriptor contained therein to the unbound descriptor buffer 120 to check whether the unbound resource descriptor is present in the unbound descriptor buffer 120. For example, the unbound descriptor buffer 120 may check whether an on-chip address corresponding to the address information is stored therein.
If the unbound descriptor buffer 120 determines that the on-chip address corresponding to the address information (i.e., the on-chip physical address of the unbound resource descriptor corresponding to the address information) is stored therein, the on-chip address is returned to the instruction scheduler circuitry 112. The instruction scheduler circuitry 112 may send a store access instruction and the on-chip address to the corresponding execution unit circuitry 114 to instruct the execution unit circuitry 114 to execute the store access instruction according to the on-chip address.
On the other hand, if the unbound descriptor buffer 120 determines that the on-chip address corresponding to the address information is not stored therein, this indicates that the unbound resource descriptor indicated by the address information is not in the unbound descriptor buffer 120 but in the external memory 30. In this case, the unbound descriptor buffer 120 may allocate an on-chip address for the unbound resource descriptor, and a request is sent to the external memory 30 via the memory interface 20 to read the unbound resource descriptor corresponding to the address information to the unbound descriptor buffer 120 and send the on-chip address allocated for the unbound resource descriptor to the instruction scheduler circuit 112. The instruction scheduler circuitry 112 may send a store access instruction and the on-chip address to the corresponding execution unit circuitry 114 to instruct the execution unit circuitry 114 to execute the store access instruction according to the on-chip address. Specifically, when the execution unit circuit 114 executes the storage access instruction, the on-chip address of the unbound resource descriptor may be sent to the load store buffer 130, so as to instruct the load store buffer 130 to obtain, from the unbound resource descriptor buffer 120, the unbound resource descriptor corresponding to the on-chip address according to the on-chip address. The unbound resource descriptor contains the resource/data related information of the current instruction to direct execution of the storage access instruction.
It can be seen that in the above procedure, the instruction scheduler circuitry 112 needs to send the address information of each unbound resource descriptor to the unbound descriptor buffer 120 to check its location and needs the unbound descriptor buffer 120 to return the on-chip address of the unbound resource descriptor. This makes delays in pipelining of multiple instructions unavoidable. Further, for each instruction, the instruction scheduler circuitry 112 must repeatedly perform the above-described checking and return of on-chip addresses even though the unbound resource descriptor in the instruction has been checked in the previous instruction and returned the on-chip address of the unbound resource descriptor, which makes the execution of the instruction inefficient.
For this situation, the scheme of the present disclosure proposes to set a dedicated descriptor on-chip address register in the instruction scheduler circuitry and add a pre-allocation instruction before storing the access instruction. The pre-allocation instruction can be pre-executed before the storage access instruction, so that on-chip addresses of unbound resource descriptors are pre-allocated for the subsequent storage access instruction, the unbound resource descriptors are read to the load storage buffer, and meanwhile, the on-chip addresses of the unbound resource descriptors are stored in the descriptor on-chip address register. When a subsequent store access instruction requires the use of the unbound resource descriptor, the instruction scheduler circuitry may read the pre-allocated on-chip address of the unbound resource descriptor directly from the descriptor on-chip address register and send the on-chip address to the execution unit circuitry to execute the store access instruction.
Fig. 2 shows a schematic structural diagram of a computing device 200 according to an embodiment of the disclosure. Similar to the computing device 100 shown in fig. 1, the computing device 200 may be, for example, a Computing Unit (CU). The computing device 200 may include a plurality of execution units 210, four execution units 210 being schematically illustrated in fig. 2. Each execution unit 210 has associated instruction scheduler circuitry 212 and execution unit circuitry 214.
Unlike the computing device 100 shown in FIG. 1, the computing device 200 includes one or more descriptor on-chip address registers 240 in the instruction scheduler circuitry 212 for storing on-chip addresses of unbound resource descriptors obtained from the unbound descriptor buffer 220 by pre-allocation instructions. Here, the on-chip address refers to an on-chip physical address of the unbound resource descriptor.
As described above, the execution instructions executed by the execution unit 210 may be, for example, store access instructions and the instruction types may include load instructions, store instructions, atomic instructions, texture instructions, and the like. For a load instruction, a store instruction, or an atomic operation instruction, each instruction includes only one unbound resource descriptor. In this case, the descriptor on-chip address register 240 may include only one first register 242 for holding the on-chip address of the unbound resource descriptor of the corresponding instruction. For texture instructions, each instruction includes two unbound resource descriptors, a texture descriptor and a sample (or sampler) descriptor. In this case, the descriptor on-chip address register 240 may include two registers, namely a first register 242 and a second register 244, wherein the first register 242 is used to hold the on-chip address of the texture descriptor of the texture instruction and the second register 244 is used to hold the on-chip address of the sample descriptor of the texture instruction.
In aspects of the present disclosure, a pre-allocation instruction is added prior to a store access instruction to be executed by execution unit circuitry 214. Wherein the pre-allocation instruction may be pre-executed prior to storing the access instruction. The pre-allocation instruction may pre-allocate unbound resource descriptors for each thread bundle (warp). Descriptor on-chip address registers 240 of different sizes may be set depending on the size of the on-chip address. For example, in the case of using a 16-bit on-chip address, the first register 242 and the second register 244 may be implemented using 16-bit registers, respectively.
Fig. 3 shows a schematic flow chart of a method 300 of operation of the computing device 200 according to an embodiment of the disclosure. As shown in fig. 3, the method 300 of operation of the computing device 200 may include a pre-allocation instruction execution process 310 and a storage access instruction execution process 320.
In the pre-allocation instruction execution process 310, the on-chip address of at least one unbound resource descriptor may be obtained by the pre-allocation instruction (block 312) and saved in the descriptor on-chip address register 240 (block 314).
Here, the on-chip address of the unbound resource descriptor may be obtained in a similar manner to the existing one. Specifically, the instruction scheduler circuit 212 may send address information of the unbound resource descriptor corresponding to the received pre-allocation instruction to the unbound resource descriptor buffer 220 to check whether the unbound resource descriptor exists. If the unbound descriptor buffer 220 determines an on-chip address in which the unbound resource descriptor corresponding to the address information is stored, the on-chip address is returned to the instruction scheduler circuitry 212 and saved in the descriptor on-chip address register 240. If the unbound descriptor buffer 220 determines that the on-chip address of the unbound resource descriptor corresponding to the address information is not stored therein, the unbound descriptor buffer 220 may allocate an on-chip address for the unbound resource descriptor corresponding to the pre-allocation instruction, send a request to the external memory 30 via the memory interface 20 to read the unbound resource descriptor corresponding to the address information to the unbound descriptor buffer 220, and send the on-chip address allocated for the unbound resource descriptor to the instruction scheduler circuit 212 and store in the descriptor on-chip address register 240.
At the store access instruction execution process 320, when the instruction scheduler circuitry 212 receives a store access instruction, it may read the on-chip address of the unbound resource descriptor corresponding to the instruction directly from the descriptor on-chip address register 240 (block 322) and send the on-chip address of the unbound resource descriptor to the corresponding execution unit circuitry 214 (block 324) to execute the store access instruction.
Further (not shown), load store buffer 230 may retrieve the corresponding unbound resource descriptor from unbound descriptor buffer 220 based on the on-chip address of the unbound resource descriptor corresponding to the store access instruction provided by execution unit circuitry 214.
Specifically, when the store access instruction is a load instruction, a store instruction, or an atomic instruction, the instruction scheduler circuitry 212 directly reads the on-chip address of the unbound resource descriptor of the instruction from the first register 242. When the storage access instruction is a texture instruction, the instruction scheduler circuit 212 directly reads the on-chip address of the texture descriptor from the first register 242 and the on-chip address of the sample descriptor from the second register 244.
The pre-allocation instruction may be executed in advance, e.g., a number (e.g., 200) of instruction cycles, before the store access instruction, such that it is sufficient to ensure that on-chip addresses of the corresponding unbound resource descriptors have all been allocated and saved in the descriptor on-chip address register 240 when the store access instruction is received by the instruction scheduler circuitry 212. As described above, descriptor on-chip address registers 240 of different sizes may be set depending on the size of the on-chip address.
Note that the above description of the store access instruction execution process 320 shown in fig. 3 assumes that the on-chip address of the unbound resource descriptor corresponding to the store access instruction has been stored in the descriptor on-chip address register 240 (e.g., via a pre-allocation instruction). In some cases, the on-chip address of the unbound resource descriptor corresponding to the store access instruction received by the instruction scheduler circuitry 212 is not stored in the descriptor on-chip address register 240. In this case, the instruction scheduler circuitry 212 may perform operations similar to those in the prior art. Specifically, the instruction scheduler circuit 212 sends address information of the unbound resource descriptor to the unbound resource descriptor buffer 220 to check whether the unbound resource descriptor exists. If the unbound descriptor buffer 220 determines an on-chip address in which the unbound resource descriptor corresponding to the address information is stored, the on-chip address is returned to the instruction scheduler circuitry 212. If the unbound descriptor buffer 220 determines that an on-chip address of the unbound resource descriptor corresponding to the address information is not stored therein, the unbound descriptor buffer 220 may allocate an on-chip address for the unbound resource descriptor corresponding to the storage access instruction, issue a request to the external memory 30 via the memory interface 20 to read the unbound resource descriptor corresponding to the address information to the unbound descriptor buffer 220 and send the on-chip address allocated for the unbound resource descriptor to the instruction scheduler circuitry 212. The instruction scheduler circuitry 212 may send a store access instruction and the on-chip address to the corresponding execution unit circuitry 214 to instruct the execution unit circuitry 214 to execute the store access instruction in accordance with the on-chip address. Specifically, when the execution unit circuit 214 executes the storage access instruction, the on-chip address of the unbound resource descriptor may be sent to the load store buffer 230, so as to instruct the load store buffer 230 to obtain, from the unbound resource descriptor buffer 220, the unbound resource descriptor corresponding to the on-chip address according to the on-chip address. The unbound resource descriptor contains the resource/data related information of the current instruction to direct execution of the storage access instruction.
In the following, an execution instruction and an execution instruction according to the present invention will be described by taking a load instruction and a texture instruction as examples, respectively.
In the execution of the execution instruction shown in fig. 1, the load instruction is, for example:
load dest,src,desc_address
where load indicates the instruction type (i.e., load instruction), dest indicates the destination address, src indicates the source address of the resource/data to which the load instruction corresponds, and desc_address indicates the address information (virtual address) of the unbound resource descriptor of the instruction.
In the execution of the execution instructions of the present invention as shown in FIG. 2, the pre-allocation instructions and load instructions may be represented, for example, as:
alloc d0,desc_address instruction 1
instruction 200
load dest,src,d0
where alloc indicates a pre-allocation instruction, d0 indicates the first register 242, and desc_address indicates the virtual address of the unbound resource descriptor of the instruction. The pre-allocation instruction indicates that an on-chip address is allocated for the unbound resource descriptor of the instruction and saved in the first register 242. load indicates the instruction type of the store access instruction (i.e., load instruction), dest indicates the destination address, src indicates the source address of the resource/data to which the load instruction corresponds, d0 indicates the first register 242 for indicating the on-chip address of the resource descriptor of the instruction that the instruction scheduler circuit 212 reads from the first register 242. Wherein the preallocation instruction alloc executes 200 instruction cycles before storing the access instruction.
In the execution of the execution instruction shown in fig. 1, the texture instruction is, for example:
tex dest,t_src,s_src,t_address,s_address
where tex indicates the instruction type (i.e., texture instruction), dest indicates the destination address, t_src and s_src indicate the texture corresponding to the texture instruction and the source address of the sampling resource/data, respectively, t_address indicates the address information (virtual address) of the texture descriptor of the instruction, and s_address indicates the address information (virtual address) of the sampling descriptor of the instruction.
In the execution of the execution instructions of the present invention as shown in fig. 2, the pre-allocation instructions and texture instructions may be represented, for example, as:
alloc d0,t_address
alloc d1,s_address instruction 1
instruction 200
tex dest,t_src,s_src,d0,d1
where alloc indicates a pre-allocated instruction, d0 indicates the first register 242, t_address indicates the virtual address of the texture descriptor of the instruction, d1 indicates the second register 244, and s_address indicates the virtual address of the sample descriptor of the instruction. The pre-allocation instruction is represented as a texture descriptor of the texture instruction allocated an on-chip address and stored in a first register 242, and a sample descriptor of the texture instruction is allocated an on-chip address and stored in a second register 244. tex indicates the instruction type of the store access instruction (i.e. texture instruction), dest indicates the destination address, t_src and s_src indicate the source address of the texture and sampling resources/data, respectively, to which the texture instruction corresponds, d0 indicates the first register 242, d1 indicates the second register 244 for indicating the instruction scheduler circuitry 212 to read the on-chip address of the texture descriptor and the on-chip address of the sampling descriptor from the first register 242 and the second register 244, respectively. Wherein the preallocation instruction alloc executes 200 instruction cycles before storing the access instruction.
Note that the load instruction and texture instruction are illustrated separately herein, and in fact, the pre-allocate instruction is executed at once for all subsequent instructions. For example, the actual instruction process may be as follows:
alloc d0,desc_address
alloc d1,s_address instruction 1
instruction 200
load dest,src,d0
tex dest,t_src,s_src,d0,d1
this means that two load instructions and texture instructions are executed in sequence and two pre-allocation instructions are executed a number of instruction cycles ahead of the load instructions and texture instructions.
With the computing device 200 and the method 300 of operation thereof of the present invention, by adding a pre-allocation instruction before a memory access instruction to be executed by an execution unit, the pre-allocation instruction may be pre-executed before the memory access instruction to pre-allocate an on-chip address of an unbound resource descriptor for a subsequent memory access instruction, and store the on-chip address in a descriptor on-chip address register. When a subsequent storage access instruction is executed, the storage access instruction and the on-chip address of the unbound resource descriptor used by the instruction can be directly sent to an execution unit circuit, and the on-chip address of the unbound resource descriptor is not required to be checked to an unbound descriptor buffer and returned from the unbound descriptor buffer, so that the operation complexity is reduced, and the pre-allocation instruction can be executed for a plurality of instruction cycles before the storage access instruction, and further the instruction waiting time is eliminated or reduced. Furthermore, since the on-chip address of the unbound resource descriptor may be pre-allocated and saved in the descriptor address register of the instruction scheduler circuitry, unbound resource descriptors with the same address or address offset may be reused by multiple instructions (e.g., store access instructions), which results in further performance improvements in the computing device.
Those skilled in the art will appreciate that the computing device 200 shown in fig. 2 is merely illustrative. In some embodiments, computing device 200 may include more or fewer components.
The computing device 200 and method 300 of operation according to the present disclosure are described above in connection with the accompanying drawings. However, those skilled in the art will appreciate that the execution of the computing device 200 and its method of operation 300 is not limited to the sequences shown in the figures and described above, but may be performed in any other reasonable order. Moreover, computing device 200 need not include all of the components shown in FIG. 2, but may include only some or more of the components necessary to perform the functions described in this disclosure, and the manner of connection of such components is not limited to the form shown in the figures.
The present invention may be embodied as methods, computing devices, systems, and/or computer program products. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure. The computing device may include at least one processor and at least one memory coupled to the at least one processor, which may store instructions for execution by the at least one processor. The instructions, when executed by the at least one processor, may perform the asymmetric synchronization method described above.
In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The various units of the apparatus disclosed herein may be implemented using discrete hardware components or may be integrally implemented on one hardware component, such as a processor. For example, the various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
Those of ordinary skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The previous description of the disclosure is provided to enable any person of ordinary skill in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A computing device, comprising:
a plurality of execution units, each of the execution units comprising an instruction scheduler circuit and an execution unit circuit, wherein the instruction scheduler circuit comprises a descriptor on-chip address register for holding an on-chip address of at least one unbound resource descriptor obtained by pre-allocating an instruction, wherein
The instruction scheduler circuitry is configured to: upon receiving a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register, and sending the on-chip address of the unbound resource descriptor to the execution unit circuitry to execute the storage access instruction.
2. The computing device of claim 1, wherein the descriptor on-chip address register comprises a first register to hold an on-chip address of a unbound resource descriptor of a load instruction, a store instruction, or an atomic operation instruction, and wherein
The instruction scheduler circuitry is configured to: when the store access instruction is the load instruction, store instruction, or atomic operation instruction, the on-chip address of the unbound resource descriptor is read directly from the first register.
3. The computing device of claim 1, wherein the descriptor on-chip address register comprises a first register to hold an on-chip address of a texture descriptor of a texture instruction and a second register to hold an on-chip address of a sample descriptor of the texture instruction, and wherein
The instruction scheduler circuitry is configured to: when the storage access instruction is the texture instruction, the on-chip address of the texture descriptor is read directly from the first register and the on-chip address of the sampling descriptor is read directly from the second register.
4. The computing device of claim 1, wherein the pre-allocation instruction is executed a plurality of instruction cycles ahead of the store access instruction.
5. The computing device of claim 1, further comprising an unbound descriptor buffer to store a plurality of unbound resource descriptors, the descriptor on-chip address register to obtain on-chip addresses of the at least one unbound resource descriptor from the unbound descriptor buffer via the pre-allocation instruction, and a load store buffer to obtain the corresponding unbound resource descriptor from the unbound descriptor buffer based on-chip addresses of unbound resource descriptors provided by the execution unit circuitry that correspond to the store access instruction.
6. A method of operation of a computing device, the computing device comprising a plurality of execution units, each of the execution units comprising an instruction scheduler circuit and an execution unit circuit, wherein the instruction scheduler circuit comprises a descriptor on-chip address register, the method comprising:
acquiring an on-chip address of at least one unbound resource descriptor through a pre-allocation instruction;
saving an on-chip address of the at least one unbound resource descriptor in the descriptor on-chip address register;
when the instruction scheduler circuit receives a storage access instruction, directly reading an on-chip address of an unbound resource descriptor corresponding to the storage access instruction from the descriptor on-chip address register; and
the instruction scheduler circuitry sends an on-chip address of the unbound resource descriptor to the execution unit circuitry to execute the store access instruction.
7. The method of operation of a computing device of claim 6, wherein the descriptor on-chip address register comprises a first register to hold an on-chip address of a unbound resource descriptor of a load instruction, store instruction, or atomic operation instruction, and wherein
When the store access instruction is the load instruction, store instruction, or atomic operation instruction, the on-chip address of the unbound resource descriptor is read directly from the first register.
8. The method of operation of a computing device of claim 6, wherein the descriptor on-chip address register comprises a first register to hold an on-chip address of a texture descriptor of a texture instruction and a second register to hold an on-chip address of a sample descriptor of the texture instruction, and wherein
When the storage access instruction is the texture instruction, the on-chip address of the texture descriptor is read directly from the first register and the on-chip address of the sampling descriptor is read directly from the second register.
9. The method of operation of a computing device as recited in claim 6 wherein the pre-allocation instruction is executed a plurality of instruction cycles ahead of the store access instruction.
10. The method of operation of a computing device of claim 6, wherein the computing device further comprises an unbound descriptor buffer and a load store buffer, the method comprising:
acquiring an on-chip address of the at least one unbound resource descriptor from a plurality of unbound resource descriptors stored in the unbound descriptor buffer through the pre-allocation instruction; and
and the loading storage buffer memory obtains the corresponding unbound resource descriptor from the unbound resource descriptor buffer memory according to the on-chip address of the unbound resource descriptor corresponding to the storage access instruction provided by the execution unit circuit.
11. An electronic device, comprising:
a memory non-transitory storing computer-executable instructions;
a processor configured to execute the computer-executable instructions;
wherein the computer executable instructions, when executed by the processor, implement the method of any one of claims 6 to 10.
12. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 6 to 10.
CN202310071798.8A 2023-01-13 2023-01-13 Computing device, operating method of computing device, electronic apparatus, and storage medium Pending CN116069392A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310071798.8A CN116069392A (en) 2023-01-13 2023-01-13 Computing device, operating method of computing device, electronic apparatus, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310071798.8A CN116069392A (en) 2023-01-13 2023-01-13 Computing device, operating method of computing device, electronic apparatus, and storage medium

Publications (1)

Publication Number Publication Date
CN116069392A true CN116069392A (en) 2023-05-05

Family

ID=86176497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310071798.8A Pending CN116069392A (en) 2023-01-13 2023-01-13 Computing device, operating method of computing device, electronic apparatus, and storage medium

Country Status (1)

Country Link
CN (1) CN116069392A (en)

Similar Documents

Publication Publication Date Title
US6629237B2 (en) Solving parallel problems employing hardware multi-threading in a parallel processing environment
US6560667B1 (en) Handling contiguous memory references in a multi-queue system
EP1236088B1 (en) Register set used in multithreaded parallel processor architecture
US7991983B2 (en) Register set used in multithreaded parallel processor architecture
USRE41849E1 (en) Parallel multi-threaded processing
US7793079B2 (en) Method and system for expanding a conditional instruction into a unconditional instruction and a select instruction
US5237670A (en) Method and apparatus for data transfer between source and destination modules
US9595075B2 (en) Load/store operations in texture hardware
US9798543B2 (en) Fast mapping table register file allocation algorithm for SIMT processors
CN108628638B (en) Data processing method and device
US10146468B2 (en) Addressless merge command with data item identifier
US8902915B2 (en) Dataport and methods thereof
CN110908716A (en) Method for implementing vector aggregation loading instruction
US10552349B1 (en) System and method for dynamic pipelining of direct memory access (DMA) transactions
US10121220B2 (en) System and method for creating aliased mappings to minimize impact of cache invalidation
US20020124157A1 (en) Method and apparatus for fast operand access stage in a CPU design using a cache-like structure
US9846662B2 (en) Chained CPP command
CA2323116A1 (en) Graphic processor having multiple geometric operation units and method of processing data thereby
CN116069392A (en) Computing device, operating method of computing device, electronic apparatus, and storage medium
US20190258492A1 (en) Apparatuses for enqueuing kernels on a device-side
US7904697B2 (en) Load register instruction short circuiting method
US6691190B1 (en) Inter-DSP data exchange in a multiple DSP environment
KR20240025019A (en) Provides atomicity for complex operations using near-memory computing
TWI819428B (en) Processor apparatus
JPH0351012B2 (en)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai

Applicant after: Shanghai Bi Ren Technology Co.,Ltd.

Address before: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai

Applicant before: Shanghai Bilin Intelligent Technology Co.,Ltd.

Country or region before: China