WO2024109689A1 - 一种块指令的寄存器参数传递方法和相关设备 - Google Patents

一种块指令的寄存器参数传递方法和相关设备 Download PDF

Info

Publication number
WO2024109689A1
WO2024109689A1 PCT/CN2023/132608 CN2023132608W WO2024109689A1 WO 2024109689 A1 WO2024109689 A1 WO 2024109689A1 CN 2023132608 W CN2023132608 W CN 2023132608W WO 2024109689 A1 WO2024109689 A1 WO 2024109689A1
Authority
WO
WIPO (PCT)
Prior art keywords
block
registers
register
intra
global
Prior art date
Application number
PCT/CN2023/132608
Other languages
English (en)
French (fr)
Inventor
李国柱
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024109689A1 publication Critical patent/WO2024109689A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • G06F9/384Register renaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • the present application relates to the field of computer technology, and in particular to a register parameter transfer method for a block instruction and related equipment.
  • the prior art implements the scheduling of block instructions by the processor, that is, the processor can schedule a group of instructions instead of a single instruction.
  • the scheduling of block instructions by the processor is generally divided into two layers, the first layer is the block header, and the second layer is the block body.
  • the processor executes a block instruction, it needs to pass register parameters between the block header layer and the block body layer, for example, declare which global registers are to be used in the block header, and then directly index these global registers in the specific instructions within the block body. In this way, although the normal execution of block instructions is guaranteed, it is easy to cause problems such as low efficiency of block instruction execution and poor reusability of block code.
  • the embodiments of the present application provide a register parameter transfer method and related devices for a block instruction, which can improve execution efficiency and block code reusability.
  • an embodiment of the present application provides a register parameter passing method for a block instruction, wherein the block instruction includes a block header and a block body, the block header is used to indicate M global registers corresponding to the block instruction, the block body includes K intra-block instructions, the K intra-block instructions correspond to N intra-block registers, and K is an integer greater than 0; the method may include: establishing a mapping relationship for P global registers among the M global registers and P intra-block registers among the N intra-block registers to obtain a register mapping table; the P global registers correspond one-to-one to the P intra-block registers; M and N are both integers greater than 0; determining that the input data of the i-th intra-block instruction among the K intra-block instructions comes from the j-th intra-block register among the P intra-block registers; i takes 1, 2, ..., K, and j takes 1, 2, ..., P; based on the register mapping table, determining the target register corresponding to the j-
  • register parameters are generally transferred between the block header layer and the block body layer by direct indexing or default output registers.
  • the direct indexing method requires the address of the corresponding global register to be directly indexed in the block body, and the result of each instruction in the block also needs to be temporarily stored in the global register, which is not conducive to predictive execution and has low execution efficiency; and the direct indexing of the global register in the code of the block body is not conducive to the reuse of the block body code.
  • the default output register method requires the additional use of get/set and other instructions, which makes the instruction space larger, and get/set and other instructions need to index the global register through the global ID of the global register, which is also not conducive to the reuse of the block body code.
  • a register mapping table can be established for the global register indicated by the block header and the preset block register in the block body, so that the block engine can index and output to the global register based on the block register, thereby improving the execution efficiency.
  • the block instructions in the block body can no longer rely on get and set instructions to index and assign values to the global register, thereby saving the extra space occupied by get and set instructions in the block body.
  • the global register ID may no longer be used in the instructions within a block of block instructions. In this way, when two or more block instructions have the same pattern but use different global registers, the two or more block instructions can share the same block, so that the block code does not need to be rewritten, thereby improving the reusability of the block code.
  • the step of determining that the input data of the i-th in-block instruction among the K in-block instructions comes from the j-th in-block register among the P in-block registers includes: determining the j-th in-block register based on the input information in the i-th in-block instruction; the in-block register corresponding to the execution result of the i-th in-block instruction is the t-th in-block register; when t is greater than or equal to j, the input information includes a relative distance tj between the t-th in-block register and the j-th in-block register; when t is less than j, the input information includes a relative distance N-j+t between the t-th in-block register and the j-th in-block register.
  • the input information of the intra-block instruction includes the relative distance between the intra-block register corresponding to the execution result of the current intra-block instruction and the intra-block register that the current intra-block instruction needs to index, so that the processor based on the block instruction can determine the intra-block register to be indexed by the current intra-block instruction through the relative distance, thereby simplifying the indexing logic of the intra-block register and further improving the execution efficiency.
  • the establishing of a mapping relationship between P global registers among the M global registers and P in-block registers among the N in-block registers includes: determining the P in-block registers corresponding to the input data of each of the K in-block instructions based on the in-block registers and input information corresponding to the execution results of each of the K in-block instructions; and establishing a mapping relationship between the P in-block registers and the P global registers.
  • the input information of the instructions in the block and the output register in the block corresponding to the execution result of the instructions in the block are first determined, and based on this, the input register in the block corresponding to the input data of each instruction in the block is determined, and then a register mapping table is established for the input register in the block and the global register to ensure that the instructions in the block can index the global register to obtain the input data when they are executed.
  • the global register can also be indexed and the execution result can be output when the block instruction is executed.
  • the in-block registers corresponding to the execution results of each of the K in-block instructions include S in-block registers; the method further includes: establishing a mapping relationship between the S global registers among the M global registers and the S in-block registers; the S global registers correspond one-to-one to the S in-block registers.
  • an output index relationship can be established for the global output register (i.e., S global registers) and the intra-block output register (i.e., S intra-block registers) to ensure that the global register can be indexed and the execution result can be output when the block instruction is executed.
  • the establishing of a mapping relationship between P global registers among the M global registers and P in-block registers among the N in-block registers includes: establishing a mapping relationship between the P in-block registers and the P global registers based on the order of the P in-block registers corresponding to the input data of each of the K in-block instructions in the N in-block registers and the indication order of the P global registers in the block header;
  • the establishing of a mapping relationship between S global registers among the M global registers and the S in-block registers includes: establishing a mapping relationship between the S in-block registers and the S global registers based on the order of the S in-block registers in the N in-block registers and the indication order of the S global registers in the block header.
  • a mapping relationship can be established between the in-block input registers and the global input registers according to the order of the in-block input registers (i.e., P in-block registers) corresponding to each instruction in the block in the N in-block registers and the indication order of the global input registers (i.e., P global registers) in the block header, and a mapping relationship can be established between the in-block output registers and the global output registers according to the order of the in-block output registers (i.e., S in-block registers) corresponding to each instruction in the block and the indication order of the global output registers (i.e., S global registers) in the block header, thereby simplifying the mapping logic and reducing the difficulty of maintaining the register mapping table.
  • the register mapping table also includes a mapping relationship between the S global registers and the S in-block registers; the method also includes: outputting the execution results of one or more of the K in-block instructions to corresponding global registers in the S global registers based on the register mapping table, and submitting the block instruction.
  • the execution results of the block in the block instruction can be output to the global output registers (i.e., S global registers) indicated by the block header through the register mapping table, and the block instruction can be submitted so that the subsequent corresponding program can access the register to obtain the execution result, thereby ensuring that the processor based on the block instruction can process more block instructions efficiently and accurately.
  • the global output registers i.e., S global registers
  • an embodiment of the present application provides a register parameter passing device for a block instruction, wherein the block instruction includes a block header and a block body, the block header is used to indicate M global registers corresponding to the block instruction, the block body includes K intra-block instructions, the K intra-block instructions correspond to N intra-block registers, and K is an integer greater than 0;
  • the device may include: a first processing module, used to establish a mapping relationship for P global registers among the M global registers and P intra-block registers among the N intra-block registers to obtain a register mapping table; the P global registers correspond one-to-one to the P intra-block registers; M and N are both integers greater than 0; a determination module, used to determine that the input data of the i-th intra-block instruction among the K intra-block instructions comes from the j-th intra-block register among the P intra-block registers; i takes 1, 2, ..., K, and j takes 1, 2, ..., P; a second processing module, used to determine that
  • the register parameter transfer device of the block instruction can establish a register mapping table for the global register indicated by the block header and the intra-block register preset in the block body, so that the block engine can index and output to the global register based on the intra-block register, thereby improving the execution efficiency.
  • the intra-block instructions of the block body can no longer rely on instructions such as get and set to index and assign values to the global register, thereby saving the extra space occupied by intra-block instructions such as get and set in the block body.
  • the ID of the global register can no longer be used in the intra-block instructions of the block instruction.
  • the determination module is specifically used to: determine the j-th in-block register based on the input information in the i-th in-block instruction; the in-block register corresponding to the execution result of the i-th in-block instruction is the t-th in-block register; when t is greater than or equal to j, the input information includes the relative distance t-j between the t-th in-block register and the j-th in-block register; when t is less than j, the input information includes the relative distance N-j+t between the t-th in-block register and the j-th in-block register.
  • the first processing module is specifically used to: determine the P in-block registers corresponding to the input data of each of the K in-block instructions based on the in-block registers corresponding to the execution results of each of the K in-block instructions and the input information; and establish a mapping relationship between the P in-block registers and the P global registers.
  • the in-block registers corresponding to the execution results of each of the K in-block instructions include S in-block registers; the device also includes: a third processing module, used to establish a mapping relationship between S global registers among the M global registers and the S in-block registers; the S global registers correspond one-to-one to the S in-block registers.
  • the first processing module is specifically used to establish a mapping relationship between the P in-block registers and the P global registers based on the order of the P in-block registers corresponding to the input data of each of the K in-block instructions in the N in-block registers and the indication order of the P global registers in the block header;
  • the third processing module is specifically used to establish a mapping relationship between the S in-block registers and the S global registers based on the order of the S in-block registers in the N in-block registers and the indication order of the S global registers in the block header.
  • the register mapping table also includes a mapping relationship between the S global registers and the S in-block registers; the device also includes: a fourth processing module, which is used to output the execution results of one or more of the K in-block instructions to the corresponding global registers in the S global registers based on the register mapping table, and submit the block instruction.
  • an embodiment of the present application provides a computer-readable storage medium for storing computer software instructions used by a device apparatus for implementing a register parameter passing method for a block instruction provided by one or more of the second aspects above, which includes a program designed for executing the above aspects.
  • an embodiment of the present application provides a computer program, which includes instructions.
  • the computer program When the computer program is executed by a computer, the computer can execute the process executed by an apparatus for implementing a register parameter passing method for a block instruction provided by one or more of the second aspects above.
  • an embodiment of the present application provides a terminal device, the terminal device includes a processor, and the processor is configured to support the terminal device to implement the corresponding functions in the register parameter transfer method of the block instruction provided in the first aspect.
  • the terminal device may also include a memory, the memory is used to couple with the processor, and the memory stores the necessary program instructions and data of the terminal device.
  • the terminal device may also include a communication interface for the terminal device to communicate with other devices or a communication network.
  • an embodiment of the present application provides a chip system, which includes a processor for supporting a device to implement the functions involved in the first aspect, for example, generating or processing information involved in the register parameter transfer method of the block instruction.
  • the chip system also includes a memory, which is used to store program instructions and data necessary for the device.
  • the chip system can be composed of a chip, or it can include a chip and other discrete devices.
  • FIG. 1 is a schematic diagram of an execution queue for speculative execution.
  • FIG2 is a schematic diagram of a system architecture for applying a register parameter transfer method for a block instruction provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of a block header structure provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the structure of a block provided in an embodiment of the present application.
  • FIG5 is a flow chart of a method for transferring register parameters of a block instruction provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of an intra-block register management queue provided in an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the architecture of the register parameter transfer process of the block instruction provided by the present application.
  • FIG8 is a schematic diagram of the structure of a register parameter transfer device for a block instruction provided in an embodiment of the present application.
  • a component can be, but is not limited to, a process running on a processor, a processor, an object, an executable file, an execution thread, a program and/or a computer.
  • applications and computing devices running on a computing device can be components.
  • One or more components may reside in a process and/or an execution thread, and a component may be located on a computer and/or distributed between two or more computers.
  • these components may be executed from various computer-readable media having various data structures stored thereon.
  • Components may, for example, communicate through local and/or remote processes according to signals having one or more data packets (e.g., data from two components interacting with another component between a local system, a distributed system and/or a network, such as the Internet interacting with other systems through signals).
  • signals having one or more data packets (e.g., data from two components interacting with another component between a local system, a distributed system and/or a network, such as the Internet interacting with other systems through signals).
  • a superscalar processor is a processor that can run multiple instructions in one clock cycle.
  • One implementation method of a superscalar processor is to perform predictive execution to improve the execution efficiency of threads. See Figure 1, which is a schematic diagram of an execution queue for predictive execution, wherein the execution queue may include instruction records that have been executed (abandoned), instruction records that are being executed, and instruction records to be executed.
  • a superscalar processor can perform predictive execution from instruction records to be executed, but if it is determined that an instruction record is invalid, it can only delete the instruction record and the instruction records after it, and refill the invalid place with valid records. Therefore, in order to ensure the accuracy of predictive execution, a superscalar processor needs to pay a higher cost to manage a longer queue, so as to see the execution branch characteristics of a longer distance.
  • Block instructions including block headers and block bodies.
  • the block header is generally an unconditional block description, which is used to describe the dependencies of the entire block, such as the attributes and types of block instructions, register inputs, register outputs, pointers to the next block header, pointers to the current block body, etc.
  • the block body is the actual instruction execution, which expresses the specific calculation of the block instruction.
  • the block body includes a series of intra-block instructions, and the block body is executed by the block engine in the processor.
  • plural means two or more.
  • “and/or” describes the association relationship of related objects, indicating that three relationships can exist.
  • a and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone.
  • the character "/” generally indicates that the related objects are in an "or” relationship.
  • the schemes for transferring register parameters between the block header layer and the block body layer include scheme 1 and scheme 2:
  • Solution 1 Direct indexing, that is, the block level can directly use the registers at the block header level.
  • the specific registers to be used are first declared in the block header description (for example, the 32 general registers r0-r31 commonly used in traditional reduced instruction set computer central processing units (RISC CPUs), also called global registers), and then these registers can be directly indexed in the specific instructions within the block.
  • RISC CPUs reduced instruction set computer central processing units
  • the global register consumption is large, the execution efficiency is low, and the block code reusability is poor.
  • the direct indexing method requires each block instruction to occupy the global register, which will cause a lot of resource waste.
  • the block In actual execution, the block generally does not use up all the global registers (for example, 32). For example, only 8 of them are used, but these 8 global registers still use the global address (that is, each register still needs 5 bits to store the address), which will cause resource waste.
  • using direct indexing of global registers may also reduce execution efficiency. For example, when the calculation result in the block needs to be temporarily saved without affecting the output, because the global register is used, this temporary result also needs an output register to be temporarily saved (this is also a waste of resources). This also determines that the execution of this block instruction has not been completed, and the execution of the next block instruction cannot be predicted, which reduces the execution efficiency. Because the processor cannot determine whether this result is the final result of this block of instructions.
  • this method generally requires direct reference to global registers in each instruction within the block, which is not conducive to the reuse of block code. For example, when the modes of two or more block instructions are the same (such as register addition), but because the two or more block instructions use different global registers, the code needs to be repeated twice or more, and the code consumption naturally increases, which is not worth it.
  • Solution 2 Use the output register by default. This method mainly uses the private registers inside the block instruction as the output registers of each instruction in the block, reducing the use of global registers.
  • each instruction in the block can only use the global register as the input register, and there is no need to specify the global register as the output register.
  • the internal private registers (block registers) are used to temporarily save the calculation results of each instruction in the block. When other instructions in the block need to reference these calculation results, they only need to use the offset of the instruction in this block and the corresponding instruction in the block to reference the corresponding block registers of these calculation results.
  • Solution 2 has a higher execution efficiency.
  • the instructions take up a large space, and the block code has poor reusability.
  • additional instructions such as get and set are required so that the processor's block engine can index and assign values to global registers.
  • the additional get and set instructions require additional space, making the instructions take up a large space.
  • a mapping relationship between global registers and registers within a block is established through a register mapping table.
  • a register mapping table can be established for the global register indicated by the block header and the registers within the block preset inside the block, so that the global register can be indexed and output through the registers within the block, and the instructions within the block of the block can no longer rely on instructions such as get and set to index and assign values to the global register, thereby improving execution efficiency and reducing the space occupied by the block using instructions such as get and set.
  • the global registers can no longer be used directly in the instructions within the block instructions, thereby improving the reusability of the block code, that is, when the modes of two or more block instructions are the same, but the global registers they use are different, the block code does not need to be rewritten and can be directly reused.
  • FIG2 is a schematic diagram of the system architecture to which the register parameter transfer method of the block instruction provided in the embodiment of the present application is applied.
  • the system architecture can be a schematic diagram of the architecture of a central processing unit (CPU), or can also be the architecture of a processor such as a graphics processing unit (GPU) or a system on chip (SOC).
  • the system architecture can include one or more block header instruction fetch and decode logic units [101] and one or more block engines [102].
  • a conventional processor generally includes an instruction fetch and decode logic unit and a computing unit, and schedules a single instruction.
  • the processor in the embodiment of the present application generally schedules and executes the block instruction through the block engine [102], that is, in the processor architecture of the embodiment of the present application, each member in the execution queue no longer describes an instruction, but a group of instructions.
  • each block engine [102] is equivalent to a small traditional processor, which can use its own internal execution system and has no relationship with other blocks, so the implementation is simpler and it is also easier to expand the block processor by increasing the number and type of block engines without significantly increasing the complexity.
  • the block header instruction fetching and decoding logic unit [101] can be used to fetch and decode one or more block headers in the block header sequence, and then the block header can be put into the block header execution queue for management.
  • the processor can determine the global register specifically used by the block instruction corresponding to the block header when the block header instruction fetching and decoding logic unit [101] fetches and decodes the block header, so that after the processor determines the block register within the block, a register mapping table of the global register and the block register can be established, so that the block register can be directly indexed and output to the global register, and the block instruction of the block can no longer rely on get and set instructions to index and assign values to the global register, thereby improving the execution efficiency and reducing the space occupied by the block using get and set instructions.
  • the global register can no longer be directly used in the block instruction of the block instruction, thereby improving the reusability of the block code.
  • the block engine [102] is equivalent to a small traditional processor, and may include an instruction fetch unit, a decoding unit, and a computing unit.
  • the block engine [102] may be used to fetch, decode, and perform calculations on the block instructions in the block instruction sequence.
  • the block engine may determine the block of block instructions to be processed according to the block header execution queue, and fetch, decode, and perform calculations on one or more block instructions in the block according to the block instruction sequence.
  • the block engine may determine the mapping relationship between the global register and the block register according to the register mapping table, and may directly index and output to the global register using the block register, so that when actually writing the block instructions of the block, it is no longer necessary to rely on get and set instructions to index and assign values to the global register, thereby reducing the space occupied by the block using get and set instructions.
  • the global register may no longer be directly used in the block instructions of the block instructions, thereby improving the reusability of the block code.
  • the embodiments of the present application can be applied to the processor system architecture of various computers.
  • the processor architecture of the computer in the above figure is only an exemplary implementation in the embodiments of the present application.
  • the architecture applicable to the embodiments of the present application includes but is not limited to the above architecture.
  • the processor architecture of the computer may have more or fewer units/modules than those shown in the figure, may combine two or more units/modules, or may have different unit/module configurations.
  • the various units/modules shown in the figure can be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application-specific integrated circuits.
  • a terminal device refers to a device that can be abstracted as a computer system, wherein a terminal device that supports the block instruction processing function may also be referred to as a block instruction processing device.
  • the block instruction processing device may be a complete machine of the terminal device, such as: a smart wearable device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a vehicle-mounted computer or a server, etc.; it may also be a system/device composed of multiple complete machines; it may also be a part of the device in the terminal device, such as: a chip related to the block instruction processing function, such as a processor based on block instructions, a system chip (system on a chip, SoC), etc., which is not specifically limited in the embodiment of the present application. Among them, the system chip is also called a system on chip.
  • Figure 3 is a schematic diagram of a block header structure provided by an embodiment of the present application, wherein a block header can correspond to a block body, and when the modes of block instructions are the same, different block headers can share the same block body (such as block header 0 and block header 1 share block body 0, block header 2 and block header 3 share block body 2), that is, block body code reuse is realized; further, block body code reuse can also be performed according to the code of instructions in some blocks in the entire block body.
  • block header 0 and block header 1 share block body 0, block header 2 and block header 3 share block body 2
  • block body code reuse can also be performed according to the code of instructions in some blocks in the entire block body.
  • the block header of a block instruction can be a fixed-length RISC instruction, such as a 128-bit RISC instruction, which can include the type and attributes of the block instruction, the global input register bit mask (Bitmask), the global output register Bitmask, the jump pointer offset and the block pointer offset and other information, as follows.
  • Types of block instructions This can include calculation types such as fixed-point, floating-point, custom blocks, and accelerator calls.
  • Attributes of block instructions may include the submission strategy of block instructions, such as the number of times the block instruction is repeated, whether the execution of the block instruction must be atomic/visible/ordered, etc.
  • Global input register Bitmask can be used to indicate the global input register of a block instruction.
  • a block instruction can generally have up to 32 registers as global input registers. The names (Identity, ID) of these 32 input registers can be expressed in the Bitmask format.
  • Global output register Bitmask can be used to indicate the global output register of a block instruction.
  • a block instruction can generally have up to 32 registers as global output registers. The IDs of these 32 output registers can be expressed in the Bitmask format.
  • R0-R31 are all physical registers (i.e. general registers), that is, registers shared between block instructions. If a block instruction has exactly 32 inputs, then every bit in its 32-bit global input register bit mask is 1 (it can also be represented by "0").
  • the corresponding 19 bits in its 32-bit global input register bit mask are 1 (for example, 11110000110000011101111001011111, then the output registers of the block instruction are the 19 general registers R0, R1, R2, R3, R4, R6, R9, R10, R11, R12, R14, R15, R16, R22, R23, R28, R29, R30 and R31), that is, the general registers used for the input/output of the block instruction, the corresponding bits of these general registers are 1 (can also be represented by "0").
  • each block instruction generally has at most 16 global input/output registers, and the global input/output registers of each block instruction can be indicated by a 16-bit bit mask. It can be understood that when there are more general registers in the CPU (for example, 64), the global input/output registers of each block instruction can also be 64. The embodiments of the present application do not specifically limit this.
  • Jump pointer offset can be used to indicate the storage location of the block header of the next block instruction to be executed after the current block instruction is executed.
  • Block pointer offset can be used to indicate the storage location of the block of the current block instruction.
  • the current block pointer can be obtained by adding the block pointer offset to the previous block pointer, that is, the storage location of the current block can be obtained.
  • FIG. 4 is a schematic diagram of a block structure provided in an embodiment of the present application, wherein the block may include an intra-block instruction 1, a block A series of intra-block instructions, such as intra-block instruction 2, intra-block instruction 3, and intra-block instruction K, are executed in sequence.
  • intra-block instruction block micro-instruction
  • the intra-block instruction can be a 16-bit fixed-length RISC instruction.
  • each intra-block instruction in the block can be an instruction including at most two inputs and one output.
  • the input of a block instruction can be read from the output register of other block instructions through the get instruction in the block body.
  • the first block instruction (get R0) in FIG4 reads the data written to the general register R0 by other block instructions
  • the second block instruction (get R1) in FIG4 reads the data written to the general register R1 by other block instructions
  • the fourth block instruction (get R2) in FIG4 reads the data written to the general register R2 by other block instructions.
  • the mapping relationship between the global register (general register) indicated by the block header and the register in the block is expressed, which can reduce the use of get instructions, thereby reducing the space occupied by the use of get instructions.
  • the output of a block instruction can be written to the corresponding output register through the set instruction in the block body, that is, written back to one or more of the above 32 physical registers.
  • the mapping relationship between the global register (general register) indicated by the block header and the register in the block is expressed, and the use of set instructions can be reduced, thereby reducing the space occupied by the use of set instructions.
  • the block instruction can access the global register (general register) through the register mapping table, and can also access the general register through the explicit get/set instruction in the instruction in the block to obtain the output of other block instructions, or output its own execution result.
  • the register parameter transfer method of the block instruction does not affect the execution unit of the processor by renaming the register, and can directly access the corresponding global register according to the register mapping table without copying the value of the global register.
  • the inputs of other intra-block instructions in the block body can mostly come from the outputs of the previous intra-block instructions in the current block body.
  • the execution result of each intra-block instruction in the block body can be implicitly written to its corresponding intra-block register (i.e., intra-block output register).
  • the current intra-block instruction can use the relative distance between instructions to index the execution result of the previous intra-block instruction temporarily stored in the intra-block register, and use this as input.
  • each intra-block instruction does not need to express its own output register, that is, each intra-block instruction can only express its own opcode and at most two input registers (i.e., at most two relative distances).
  • each intra-block instruction can only express its own opcode and at most two input registers (i.e., at most two relative distances).
  • the 16-bit block instruction taking the 16-bit block instruction as an example, 10 bits (0-9) can be used to express the opcode, and two 3-bits can be used to express two input registers (i.e., two relative distances).
  • the relative distance is limited to 1-8, that is, the current block instruction can only index the execution results of the instructions in the 1st to 8th block instructions before the current block instruction as input.
  • the two inputs of the third block instruction (add 1 2) in Figure 4 come from the execution results of two instructions (i.e., the 1st instruction and the 2nd block instruction) whose relative distances to the third block instruction are 1 and 2 respectively.
  • the current block instruction has only one input, the remaining 6 bits after removing the opcode can all be used to express one input register.
  • the relative distance is limited to 1-64, that is, the current block instruction can only index the execution results of the instructions in the 1st to 64th block instructions before the current block instruction as input.
  • the assembly of the block instructions may be as follows:
  • T#1 represents the intra-block register corresponding to the intra-block instruction whose relative distance to the current intra-block instruction (add T#1 T#2) is 1, and the execution result of the intra-block instruction get R0 is stored in the intra-block register
  • T#2 represents the intra-block register corresponding to the intra-block instruction whose relative distance to the current intra-block instruction (add T#1 T#2) is 2, and the execution result of the intra-block instruction get R1 is stored in the intra-block register.
  • the actual calculation operation performed by the third intra-block instruction (add T#1 T#2) is to add the data in the global registers R0 and R1.
  • Figure 3 illustrates the register parameter passing method of the block instruction provided by this application.
  • FIG. 5 is a flow chart of a register parameter passing method for a block instruction provided in an embodiment of the present application.
  • the method can be applied to the processor architecture based on block instructions described in FIG. 2 , that is, the processor based on block instructions can be used to support and execute the method flow steps S500 to S502 shown in FIG. 5 , wherein steps S500 to S502 include the following:
  • Step S500 Establish a mapping relationship for P global registers among the M global registers and P intra-block registers among the N intra-block registers to obtain a register mapping table.
  • the block instructions processed by the block instruction-based processor may include a block header and a block body, wherein the block header may indicate one or more global registers (i.e., M global registers, which may include a global output register and a global input register) specifically used by the block instruction, the block body may include one or more intra-block instructions (i.e., K intra-block instructions), and one or more intra-block registers (i.e., N intra-block registers, which may include an intra-block output register and an intra-block input register) are set by default, the execution result of each intra-block instruction in the block body may be output to the corresponding intra-block register (in this case, the intra-block output register) for temporary storage, that is, the intra-block register in the block body is used to store the execution results output by the corresponding intra-block instructions in the K intra-block instructions of the block body, and M, N, and K are all integers greater than 0.
  • M global registers i.e., M global registers, which may include a global
  • the processor based on block instructions can establish a register mapping table based on P global registers among the M global registers indicated by the block header and P intra-block registers among the N intra-block registers corresponding to the block body.
  • the above P global registers correspond one-to-one to the P intra-block registers, that is, the register mapping table can include the input index relationship between the global input registers among the M global registers (that is, the P global registers) and the intra-block input registers among the N intra-block registers (that is, the P intra-block registers), so that the block engine [102] can access the global registers from the intra-block registers through the register mapping table when executing the intra-block instructions.
  • a processor based on block instructions may also establish a mapping relationship for some of the M global registers and the N in-block registers, wherein the number of some of the N in-block registers may be determined based on the formula (1, 2, ..., MIN(M, N)) (i.e., P may take any integer value from 1 to MIN(M, N)), wherein 1 indicates that the serial number of the in-block register starts from 1. It is understandable that the serial number of the in-block register may also start from other values, and the formula may be modified accordingly, and no specific limitation is made here.
  • mapping relationship between all global input registers and in-block registers may not be established, but only a part. It is understandable that the mapping relationship between the global output register and the in-block register may also be established with reference to this method, and is not expanded here.
  • a processor based on block instructions may first determine the block input registers (i.e., P block registers) corresponding to the input data of each of the K block instructions based on the block output registers and input information corresponding to the execution results of each of the K block instructions included in the block; wherein the block input registers and block output registers are both registers among the default N block registers in the block.
  • the block input registers and block output registers are both registers among the default N block registers in the block.
  • the processor based on block instructions may establish a mapping relationship for the block input registers and block output registers corresponding to each of the K block instructions, and one or more global registers (i.e., M global registers, which may include global input registers and global output registers) indicated in the block header, thereby obtaining a register mapping table, so that the block engine [102] can access the global registers from the block registers through the register mapping table when executing the block instructions, that is, it can obtain input by indexing the global registers, and output to the global registers.
  • M global registers which may include global input registers and global output registers
  • the one or more global registers (i.e., M global registers) indicated in the block header may include one or more global output registers (i.e., S global registers), and one or more global input registers (i.e., P global registers);
  • the register mapping table established by the processor based on the block instruction may include the output index relationship between the one or more global output registers and one or more in-block output registers (i.e., S in-block registers) of the N in-block registers corresponding to the block body, and the input index relationship between the one or more global input registers and one or more in-block input registers (i.e., P in-block registers) of the N in-block registers.
  • the block engine [102] When executing the in-block instruction, the block engine [102] can access the global input register from the in-block input register through the input index relationship in the register mapping table, thereby obtaining the input of the in-block instruction, and can access the global output register from the in-block output register through the output index relationship in the register mapping table, thereby outputting the execution result of the in-block instruction.
  • the block engine [102] can establish a mapping relationship between global input registers and block input registers according to the indication order of one or more global input registers (i.e., P global registers) indicated by the block header, and the order of the block input registers corresponding to the input data of each of the K block instructions in the N block registers; and can establish a mapping relationship between global output registers and block output registers according to the indication order of one or more global output registers (i.e., S global registers) indicated by the block header, and the order of the block output registers corresponding to the execution results of each of the K block instructions in the N block registers.
  • P global registers the indication order of one or more global input registers
  • S global registers global registers
  • the block engine [102] can access the global input registers from the block input registers through the register mapping table, thereby obtaining the data of the global input registers and using them as the input of the block instruction, and can access the global output registers from the block output registers through the register mapping table, thereby outputting the execution results of the block instruction to the corresponding global output registers.
  • Step S501 Determine whether input data of the i-th intra-block instruction among the K intra-block instructions comes from the j-th intra-block register among the P intra-block registers.
  • the block engine [102] included in the block instruction-based processor can determine one or more intra-block instructions (i.e., K intra-block instructions)
  • the input data of a certain intra-block instruction (i.e., the ith intra-block instruction) in the block comes from a certain intra-block register (i.e., the jth intra-block register) among one or more intra-block registers (i.e., the P intra-block registers).
  • the block engine [102] may determine the jth intra-block register from which its input data comes based on the input information of the ith intra-block instruction, wherein the input information includes the relative distance between the intra-block register corresponding to the execution result of the ith intra-block instruction and the jth intra-block register, and the intra-block register corresponding to the execution result of the ith intra-block instruction is the tth intra-block register; when t is greater than or equal to j, the relative distance may be tj; when t is less than j, the relative distance may be N-j+t.
  • the block engine [102] when the block engine [102] performs instruction fetching and decoding on the ith intra-block instruction, on the basis of clarifying the intra-block register corresponding to the execution result of the ith intra-block instruction, it may determine, through the input information of the ith intra-block instruction, that the input data of the ith intra-block instruction comes from the jth intra-block register among the N intra-block registers. For example, if the relative distance indicated by the input information is 3, the block engine [102] can find the block register with a relative distance of 3 by shifting forward from the block register corresponding to the execution result of the i-th block instruction, that is, the j-th block register. It can be understood that the relative distance indicated in the input information can be an integer greater than 0, that is, it can be a value other than 3, and is not specifically limited here.
  • a circular queue can be used to manage the one or more block registers.
  • FIG6 is a schematic diagram of a block register management queue provided in an embodiment of the present application, wherein the block engine [102] can manage the default N block registers of the block instruction through the circular queue.
  • the execution result of the first block instruction can be output to the first block register (set to 0)
  • the execution result of the second block instruction can be output to the second block register (set to 1), and so on.
  • the execution result of the N+1th block instruction can be re-output to the block register 0.
  • each subsequent block instruction will always update the last block register in order and abandon the value of the oldest block register. It can be understood that the execution result of the N+1th block instruction will overwrite the execution result of the original first block instruction output to the first block register (set to 0).
  • the nth intra-block instruction in the block wants to index the intra-block register through input information (for example, T#m)
  • it can be achieved by indexing the intra-block register with the index subscript (n-m)%N (that is, the jth intra-block register), where m is the relative distance between the default output intra-block register of the execution result of the nth intra-block instruction and the intra-block register it wants to index, and N is the default number of intra-block instructions in the block instruction.
  • the execution result of the instruction in the block can also overwrite the input mapping relationship between the register in the block and the global register.
  • the j-th register in the block can be indexed, and then the corresponding global input register (such as r1) can be accessed based on the mapping relationship. If the execution result of the subsequent i+x-th instruction in the block is output to the j-th register in the block by default, the execution result can overwrite the mapping relationship between the j-th register in the block and the global input register (such as r1). That is to say, when an instruction in the block after the i+x-th instruction in the block needs to index the j-th register in the block, what is obtained is the execution result of the i+x-th instruction in the block, instead of accessing the global input register (such as r1).
  • the syntax of T#m is generally limited by the total number of registers in the block, and the block instructions using T#m generally cannot access the output registers in the block that exceed the total range. For example, there are a total of 8 registers in the block, and a certain block instruction cannot generally index the execution result of the ninth block instruction before the instruction using T#m. If you want to cross this range, you can continue through the middle block instruction, or write to the global register resource. Understandably, the execution result of the first block instruction can be output to the first block register (set to 0) or to other block registers through the circular queue. For the convenience of the embodiment of the present application, the output to the first block register (set to 0) is taken as an example, and no specific limitation is made here.
  • the embodiment of the present application can establish a mapping relationship between the global input register and the block register at the tail of the queue in the circular queue, so that the block instruction at the front of the block instruction can access the global input register through T#m to index the block register with the subscript (n-m)%N.
  • Step S502 Based on the register mapping table, determine the target register corresponding to the j-th intra-block register in the P global registers, obtain the input data from the target register, and execute the i-th intra-block instruction.
  • the block engine [102] included in the block instruction processor can determine the target global register corresponding to the j-th intra-block register based on the mapping relationship in the register mapping table, and then the block engine [102] can access the target global register and obtain the target data from the target global register, and execute the i-th intra-block instruction with the target data as input.
  • the process of the register parameter passing method of the block instruction provided in the embodiment of the present application in addition to the above-mentioned steps S500-S502, also includes step S503.
  • Step S503 Based on the register mapping table, the execution results of one or more of the K block instructions are output to the corresponding global registers in the S global registers, and the block instruction is submitted.
  • the register mapping table also includes the mapping relationship between the S global registers and the S block registers.
  • the block engine [102] can output the execution results of the block body in the block instruction to the corresponding global output registers in the S global registers (i.e., global output registers) indicated by the block header through the register mapping table, and submit the block instruction.
  • S global registers i.e., global output registers
  • Figure 7 is a schematic diagram of the architecture of the register parameter transfer process of the block instruction provided in the present application, wherein the number of global registers in the global register file and the number of global registers in the block register file can be preset in advance by the processor based on the block instruction.
  • the processor based on the block instruction
  • the global input register for example, x, y, z
  • the global output register for example, a, b, c
  • the input register mapping includes the mapping of the global input register and the tail register of the block register management queue.
  • the tail register of the block register management queue (block register N-2) is indexed to the global input register (e.g., register x), and the execution result of the block instruction after execution can be output to the register at the head of the block register management queue (block register 0) by default.
  • the subsequent block instruction (instruction 2) in the block is executed, it is indexed with reference to the above process, and the global input register (y) is indexed through the block register N-1, and the execution result is output to the block register 1 by default; then the execution result of the block instruction is mapped to the global output register according to the register mapping table, wherein the output register mapping includes the mapping relationship between the global output register and the block register output by the block instruction by default.
  • the register mapping table can also include other information, such as the register's flag information, attribute (ref) information, etc.
  • the implementation method of the prior art 1 is direct indexing, and its code assembly can be as follows:
  • Block, Block1 and Block2 are defined.
  • the modes of these two blocks are the same, indicating four global input registers and one global output register.
  • the global input registers of Block1 are r1, r2, r3, r4, and the global output register is r1;
  • the global input registers of Block2 are r5, r6, r7, r8, and the global output register is r5.
  • the implementation method of the second prior art is to use the output register by default, and its code assembly can be as follows:
  • Block Block1 and Block2
  • the modes of these two blocks are the same, both indicating four global input registers and one global output register.
  • the global input registers of Block1 are r1, r2, r3, r4, and the global output register is r1;
  • the global input registers of Block2 are r5, r6, r7, r8, and the global output register is r5.
  • each instruction in a block has only input registers, and there is no need to specify output registers.
  • Other instructions in the block reference these output registers, and only need to use the offset of the instructions in the block and the instructions in the block corresponding to these output registers.
  • the extra get and set instructions are just an index for the block engine. They do not need to be executed themselves and will not affect the execution efficiency, but they need to occupy a certain amount of extra space.
  • the code assembly can be as follows:
  • Block, Block1 and Block1 are defined. These are two different Block1s. Their modes are the same, indicating four global input registers and one global output register. Their block headers are different, but their block bodies are the same (reused).
  • the global input registers of the first Block1 are r1, r2, r3, r4, and the global output register is r1;
  • the global input registers of the second Block1 are r5, r6, r7, r8, and the global output register is r5.
  • the first instruction within the block i.e., the i-th instruction within the block
  • references t#3 and t#4 which is meaningless from the perspective of instruction distance.
  • the block engine [102] believes that there are four virtual get instructions for r1 to r4 before the first instruction within the block, and the register mapping table includes the mapping of the block registers corresponding to the four virtual get instructions and the corresponding global registers. Therefore, the first instruction within the block indexes the global registers r2 and r1 (i.e., the global registers corresponding to the j-th block registers) through t#3 and t#4.
  • the second instruction within the block i.e., the i-th instruction within the block
  • Add t#2, t#3 can index the two global registers r4 and r3 (i.e., the global registers corresponding to the j-th block registers).
  • get/set instructions can still be used to reference these global registers according to the subscript of this engine instead of using the actual global register address for indexing.
  • the status of the global register can be checked. If a global output register has not been output, it can be filled from the corresponding block register according to the register mapping table.
  • a register mapping table can be established for the global register indicated by the block header and the block body preset internal register, so that the global register can be indexed and output through the block body register, and the block body instructions can no longer rely on get and set instructions to index and assign values to the global register, thereby improving execution efficiency and reducing the space occupied by the block body using get and set instructions.
  • the global register can no longer be directly used in the block body instructions, which improves the reusability of the block body code, that is, when the modes of two or more block instructions are the same, but the global registers they use are different, the block body code does not need to be rewritten and can be directly reused.
  • FIG8 is a schematic diagram of the structure of a register parameter transfer device for a block instruction provided in an embodiment of the present application.
  • the register parameter transfer device 8 for a block instruction may include a first processing module 81, a determination module 82, a second processing module 83, and may also include a third processing module 84 and a fourth processing module 85.
  • the detailed description of each unit is as follows:
  • a first processing module 81 is used to establish a mapping relationship between P global registers among the M global registers and P intra-block registers among the N intra-block registers to obtain a register mapping table; the P global registers correspond to the P intra-block registers one by one; M and N are both integers greater than 0;
  • the determining module 82 is used to determine whether the input data of the i-th instruction in the K instructions in the block comes from the P registers in the block.
  • the second processing module 83 is used to determine the target register corresponding to the j-th intra-block register among the P global registers based on the register mapping table, obtain the input data from the target register, and execute the i-th intra-block instruction.
  • the determining module 82 is specifically configured to:
  • the j-th in-block register is determined based on the input information in the i-th in-block instruction; the in-block register corresponding to the execution result of the i-th in-block instruction is the t-th in-block register; when t is greater than or equal to j, the input information includes the relative distance t-j between the t-th in-block register and the j-th in-block register; when t is less than j, the input information includes the relative distance N-j+t between the t-th in-block register and the j-th in-block register.
  • the first processing module 81 is specifically configured to:
  • a mapping relationship is established between the P in-block registers and the P global registers.
  • the in-block registers corresponding to the execution results of the K in-block instructions include S in-block registers; and the device further includes:
  • the third processing module 84 is used to establish a mapping relationship between S global registers among the M global registers and the S intra-block registers; the S global registers correspond to the S intra-block registers in a one-to-one manner.
  • the first processing module 81 is specifically configured to:
  • the third processing module 84 is specifically used for:
  • a mapping relationship is established between the S in-block registers and the S global registers based on the order of the S in-block registers in the N in-block registers and the indication order of the S global registers in the block header.
  • the register mapping table further includes a mapping relationship between the S global registers and the S intra-block registers; and the device further includes:
  • the fourth processing module 85 is used to output the execution results of one or more of the K block instructions to corresponding global registers in the S global registers based on the register mapping table, and submit the block instruction.
  • the register parameter transfer device of the block instruction described in the present application is not limited thereto, and the register parameter transfer device of the block instruction can be located in any terminal device, such as a computer, a mobile phone, a tablet, a server and other types of devices.
  • the register parameter transfer device of the block instruction can specifically be a chip or a chipset or a circuit board equipped with a chip or a chipset.
  • the chip or chipset or the circuit board equipped with a chip or a chipset can work under the necessary software drive.
  • the register parameter transfer device of the block instruction can be:
  • the IC set may also include a storage component for storing data and computer programs;
  • An embodiment of the present application further provides a computer-readable storage medium, in which a computer program code is stored.
  • a computer program code is stored.
  • the embodiment of the present application also provides a terminal device, which can exist in the form of a chip product, and the terminal device includes a processor, which is configured to support the terminal device to implement the corresponding functions of the method in any of the above embodiments.
  • the terminal device may also include a memory, which is coupled to the processor and stores the necessary program instructions and data of the terminal device.
  • the terminal device may also include a communication interface for the terminal device to communicate with other devices or communication networks.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product When the computer program product is run on a computer, the computer executes the method in any of the aforementioned embodiments.
  • the embodiment of the present application provides a chip system, which includes a processor for supporting a device to implement the functions involved in the first aspect, for example, generating or processing information involved in the register parameter transfer method of the block instruction.
  • the chip system also includes a memory, which is used to store program instructions and data necessary for the device.
  • the chip system can be composed of a chip It can also include chips and other discrete devices.
  • the disclosed device can be implemented in other ways.
  • the device embodiments described above are only schematic, such as the division of the above-mentioned units, which is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, and the indirect coupling or communication connection of devices or units can be electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server or a network device, etc., specifically a processor in a computer device) to perform all or part of the steps of the above-mentioned methods of each embodiment of the present application.
  • a computer device which can be a personal computer, a server or a network device, etc., specifically a processor in a computer device
  • the aforementioned storage medium may include: U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (Read-Only Memory, abbreviated: ROM) or random access memory (Random Access Memory, abbreviated: RAM) and other media that can store program codes.
  • ROM Read-Only Memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

本申请公开了一种块指令的寄存器参数传递方法和相关设备,该方法可包括:针对M个全局寄存器中的P个全局寄存器以及N个块内寄存器中的P个块内寄存器建立寄存器映射表;确定块体包括的K条块内指令中的第i条块内指令的输入数据来自所述P个块内寄存器中的第j个块内寄存器;i取1、2、……、K,j取1、2、……、P;基于所述寄存器映射表,确定所述P个全局寄存器中与所述第j个块内寄存器对应的目标寄存器,从所述目标寄存器中获取所述输入数据,并执行所述第i条块内指令。本申请能够提高执行效率以及块体代码复用性。

Description

一种块指令的寄存器参数传递方法和相关设备
本申请要求于2022年11月22日提交中国国家知识产权局、申请号为202211466485.4、申请名称为“一种块指令的寄存器参数传递方法和相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种块指令的寄存器参数传递方法和相关设备。
背景技术
随着计算机技术的不断发展,大型处理器一般通过增加越来越多的计算单元来提高执行效率,想要充分利用这些计算单元的能力,一种直接的方式是多线程、多核以及多处理器的设计,这种方式就要求系统软件把任务主动分解到不同的执行线程上,分不同的线程独立使用这些计算单元。
为了提高线程执行的粒度,现有技术实现了处理器对块指令的调度,即处理器能够对一组指令进行调度,而不是一条指令。处理器对于块指令的调度一般分为两层,第一层是块头,第二层是块体。处理器在执行块指令时,需要在块头一层和块体一层之间传递寄存器参数,例如在块头中声明具体使用哪些全局寄存器,然后在块体的具体块内指令中对这些全局寄存器进行直接索引。如此,虽然保证了块指令的正常执行,但容易导致块指令执行效率较低以及块体代码复用性较差等问题。
因此,如何提供一种可以提高执行效率以及块体代码复用性的寄存器参数传递方法,是亟待解决的问题。
发明内容
本申请实施例提供一种块指令的寄存器参数传递方法和相关设备,能够提高执行效率以及块体代码复用性。
第一方面,本申请实施例提供了一种块指令的寄存器参数传递方法,所述块指令包括块头和块体,所述块头用于指示所述块指令对应的M个全局寄存器,所述块体包括K条块内指令,所述K条块内指令对应N个块内寄存器,K为大于0的整数;所述方法,可包括:针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,得到寄存器映射表;所述P个全局寄存器与所述P个块内寄存器一一对应;M、N均为大于0的整数;确定所述K条块内指令中的第i条块内指令的输入数据来自所述P个块内寄存器中的第j个块内寄存器;i取1、2、……、K,j取1、2、……、P;基于所述寄存器映射表,确定所述P个全局寄存器中与所述第j个块内寄存器对应的目标寄存器,从所述目标寄存器中获取所述输入数据,并执行所述第i条块内指令。
现有技术中,一般是通过直接索引或者默认输出寄存器的方式,实现在块头一层与块体一层之间传递寄存器参数。但是,直接索引的方式由于需要在块体中直接索引对应全局寄存器的地址,每条块内指令的结果也需临时存放在全局寄存器中,不利于进行预测执行,执行效率较低;且块体的代码中直接索引全局寄存器,不利于块体代码复用。而默认输出寄存器的方式,由于需要额外使用get/set等指令,使得指令占用空间较大,且get/set等指令需要通过全局寄存器的全局ID对全局寄存器进行索引,同样不利于块体代码的复用。通过第一方面提供的方法,本申请实施例中可以为块头指示的全局寄存器和块体内部预设的块内寄存器建立寄存器映射表,使得块引擎可以基于块内寄存器索引和输出到全局寄存器,提高了执行效率,同时,块体的块内指令可以不再依赖get和set等指令对全局寄存器进行索引和赋值,从而节省了块体中get和set等块内指令额外占用的空间。进一步地,块指令的块内指令中也可以不再使用全局寄存器的ID,如此,当两个或多个块指令的模式相同,但它们使用的全局寄存器不相同时,该两个或多个块指令可以共用同一块体,使得块体代码可以不用再重复编写,从而提高了块体代码的复用性。
在一种可能的实现方式中,所述确定所述K条块内指令中的第i条块内指令的输入数据来自所述P个块内寄存器中的第j个块内寄存器,包括:基于所述第i条块内指令中的输入信息确定出所述第j个块内寄存器;所述第i条块内指令的执行结果对应的块内寄存器为第t个块内寄存器;当t大于或等于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离t-j;当t小于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离N-j+t。
本申请实施例中,块内指令的输入信息包括了当前块内指令的执行结果对应的块内寄存器和当前块内指令需要索引的块内寄存器之间的相对距离,使得基于块指令的处理器可以通过该相对距离确定出当前块内指令要索引的块内寄存器,简化了块内寄存器的索引逻辑,进一步提升执行效率。
在一种可能的实现方式中,所述针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,包括:基于所述K条块内指令各自的执行结果对应的块内寄存器以及输入信息,确定所述K条块内指令各自的输入数据对应的所述P个块内寄存器;针对所述P个块内寄存器和所述P个全局寄存器建立映射关系。
本申请实施例中,在建立寄存器映射表的过程中,首先确定出块体中块内指令的输入信息和该块内指令的执行结果所对应的块内输出寄存器,并据此确定出块内指令各自的输入数据对应的块内输入寄存器,再针对块内输入寄存器和全局寄存器建立寄存器映射表,保证块内指令在执行时可以索引到全局寄存器得到输入数据,在此基础上,块指令执行结束时也可以索引到全局寄存器并输出执行结果。
在一种可能的实现方式中,所述K条块内指令各自的执行结果对应的块内寄存器包括S个块内寄存器;所述方法,还包括:针对所述M个全局寄存器中的S个全局寄存器以及所述S个块内寄存器建立映射关系;所述S个全局寄存器与所述S个块内寄存器一一对应。
本申请实施例中,在建立寄存器映射表的过程中,可以针对全局输出寄存器(即S个全局寄存器)和块内输出寄存器(即S个块内寄存器)建立输出索引关系,保证块指令执行结束时也可以索引到全局寄存器并输出执行结果。
在一种可能的实现方式中,所述针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,包括:基于所述K条块内指令各自的输入数据对应的所述P个块内寄存器在所述N个块内寄存器中的顺序和所述P个全局寄存器在所述块头中的指示顺序,为所述P个块内寄存器和所述P个全局寄存器建立映射关系;所述针对所述M个全局寄存器中的S个全局寄存器以及所述S个块内寄存器建立映射关系,包括:基于所述S个块内寄存器在所述N个块内寄存器中的顺序和所述S个全局寄存器在所述块头中的指示顺序,为所述S个块内寄存器和所述S个全局寄存器建立映射关系。
本申请实施例中,在建立寄存器映射表的过程中,可以针对块内指令各自对应的块内输入寄存器(即P个块内寄存器)的在N个块内寄存器中顺序和全局输入寄存器(即P个全局寄存器)在块头中的指示顺序,为块内输入寄存器和全局输入寄存器建立映射关系,针对块内指令各自对应的块内输出寄存器(即S个块内寄存器)的顺序和全局输出寄存器(即S个全局寄存器)在块头中的指示顺序,为块内输出寄存器和全局输出寄存器建立映射关系,简化映射逻辑,降低寄存器映射表的维护难度。
在一种可能的实现方式中,所述寄存器映射表还包括所述S个全局寄存器与所述S个块内寄存器的映射关系;所述方法,还包括:基于所述寄存器映射表将所述K条块内指令中的一条或多条块内指令的执行结果分别输出至所述S个全局寄存器中对应的全局寄存器,并将所述块指令提交。
本申请实施例中,在执行完块指令中的块内指令后,可以通过寄存器映射表将块指令中块体的执行结果分别输出至块头指示的全局输出寄存器(即S个全局寄存器),并将块指令提交,以便后续相应程序访问该寄存器获取该执行结果,从而保证基于块指令的处理器可以高效和准确地处理更多块指令。
第二方面,本申请实施例提供了一种块指令的寄存器参数传递装置,所述块指令包括块头和块体,所述块头用于指示所述块指令对应的M个全局寄存器,所述块体包括K条块内指令,所述K条块内指令对应N个块内寄存器,K为大于0的整数;所述装置,可包括:第一处理模块,用于针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,得到寄存器映射表;所述P个全局寄存器与所述P个块内寄存器一一对应;M、N均为大于0的整数;确定模块,用于确定所述K条块内指令中的第i条块内指令的输入数据来自所述P个块内寄存器中的第j个块内寄存器;i取1、2、……、K,j取1、2、……、P;第二处理模块,用于基于所述寄存器映射表,确定所述P个全局寄存器中与所述第j个块内寄存器对应的目标寄存器,从所述目标寄存器中获取所述输入数据,并执行所述第i条块内指令。
本申请实施例中,块指令的寄存器参数传递装置可以为块头指示的全局寄存器和块体内部预设的块内寄存器建立寄存器映射表,使得块引擎可以基于块内寄存器可以索引和输出到全局寄存器,提高了执行效率,同时,块体的块内指令可以不再依赖get和set等指令对全局寄存器进行索引和赋值,从而节省了块体中使用get和set等块内指令额外占用的空间。进一步地,块指令的块内指令中可以不再使用全局寄存器的ID,如此,当两个或多个块指令的模式相同,但它们使用的全局寄存器不相同时,该两个或多个块指令可 以共用同一块体,使得块体代码可以不用再重复编写,提高了块体代码的复用性。
在一种可能的实现方式中,所述确定模块,具体用于:基于所述第i条块内指令中的输入信息确定出所述第j个块内寄存器;所述第i条块内指令的执行结果对应的块内寄存器为第t个块内寄存器;当t大于或等于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离t-j;当t小于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离N-j+t。
在一种可能的实现方式中,所述第一处理模块,具体用于:基于所述K条块内指令各自的执行结果对应的块内寄存器以及输入信息,确定所述K条块内指令各自的输入数据对应的所述P个块内寄存器;针对所述P个块内寄存器和所述P个全局寄存器建立映射关系。
在一种可能的实现方式中,所述K条块内指令各自的执行结果对应的块内寄存器包括S个块内寄存器;所述装置,还包括:第三处理模块,用于针对所述M个全局寄存器中的S个全局寄存器以及所述S个块内寄存器建立映射关系;所述S个全局寄存器与所述S个块内寄存器一一对应。
在一种可能的实现方式中,所述第一处理模块,具体用于:基于所述K条块内指令各自的输入数据对应的所述P个块内寄存器在所述N个块内寄存器中的顺序和所述P个全局寄存器在所述块头中的指示顺序,为所述P个块内寄存器和所述P个全局寄存器建立映射关系;所述第三处理模块,具体用于:基于所述S个块内寄存器在所述N个块内寄存器中的顺序和所述S个全局寄存器在所述块头中的指示顺序,为所述S个块内寄存器和所述S个全局寄存器建立映射关系。
在一种可能的实现方式中,所述寄存器映射表还包括所述S个全局寄存器与所述S个块内寄存器的映射关系;所述装置,还包括:第四处理模块,用于基于所述寄存器映射表将所述K条块内指令中的一条或多条块内指令的执行结果分别输出至所述S个全局寄存器中对应的全局寄存器,并将所述块指令提交。
第三方面,本申请实施例提供了一种计算机可读存储介质,用于存储上述第二方面中的一种或多种所提供的一种用于实现块指令的寄存器参数传递方法的设备装置所用的计算机软件指令,其包含用于执行上述方面所设计的程序。
第四方面,本申请实施例提供了一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述第二方面中的一种或多种所提供的一种用于实现块指令的寄存器参数传递方法的装置所执行的流程。
第五方面,本申请实施例提供了一种终端设备,该终端设备中包括处理器,处理器被配置为支持该终端设备实现第一方面所提供的块指令的寄存器参数传递方法中相应的功能。该终端设备还可以包括存储器,存储器用于与处理器耦合,其保存该终端设备必要的程序指令和数据。该终端设备还可以包括通信接口,用于该终端设备与其他设备或通信网络通信。
第六方面,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于支持设备实现上述第一方面所涉及的功能,例如,生成或处理上述块指令的寄存器参数传递方法中所涉及的信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1是一种预测执行的执行队列示意图。
图2是本申请实施例提供的一种块指令的寄存器参数传递方法应用的系统架构示意图。
图3是本申请实施例提供的一种块头结构的示意图。
图4是本申请实施例提供的一种块体的结构示意图。
图5是本申请实施例提供的一种块指令的寄存器参数传递方法的流程示意图。
图6是本申请实施例提供的一种块内寄存器管理队列示意图。
图7是本申请提供的块指令的寄存器参数传递流程的架构示意图。
图8是本申请实施例提供的一种块指令的寄存器参数传递装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例进行描述。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的一个或多个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
首先,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。
(1)超标量处理器,是一种可以在一个时钟周期内运行多条指令的处理器。超标量处理器的一种实现手段是进行预测执行,提高线程的执行效率。可参见图1,图1是一种预测执行的执行队列示意图,其中,执行队列中可以包括已经完成执行(废弃)的指令记录、执行中的指令记录以及待执行的指令记录,超标量处理器可以从待执行的指令记录中进行预测执行,但如果确定指令记录无效后,只能把该指令记录以及它之后的指令记录也全部删除,从无效的地方重新填充有效的记录。因此,超标量处理器为保证预测执行的准确性,需要付出较高的成本去管理较长的队列,从而看到更长距离的执行分支特征。
(2)块指令,包括块头和块体,其中,块头一般是无条件的块描述,用于说明整个块的依赖关系,例如,块指令的属性和种类、寄存器输入、寄存器输出、下一个块头的指针、本块体的指针等;块体,是实际的指令执行,表达了块指令的具体计算,块体包括一系列块内指令,块体由处理器中的块引擎进行执行。
(3)在本申请中,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
首先,分析并提出本申请所具体要解决的技术问题。在现有技术中,在块头一层与块体一层之间传递寄存器参数的方案包括方案一和方案二:
方案一:直接索引的方式,即块体一层可以直接使用块头一层的寄存器。这个方案中,首先会在块头描述中,声明具体使用哪些寄存器(例如,传统精简指令系统计算机中央处理器(Reduced Instruction Set Computer CPU,RISC CPU)常用的32个通用寄存器r0-r31,也叫全局寄存器),然后可以在块体的具体块内指令中直接索引这些寄存器。
该方案一存在以下缺点:
对于全局寄存器消耗大,执行效率低,块体代码复用性差。直接索引的方式使得每个块指令都要占用全局寄存器,这会带来不小的资源浪费,在实际执行时,块体内部一般不会将所有的全局寄存器(例如32个)用完,例如仅使用了其中8个,但是这8个全局寄存器仍然使用的是全局地址(即每个寄存器仍需5bit存储地址),会造成资源浪费。
并且,使用直接索引全局寄存器也可能会降低执行效率,例如,当需要把块内的计算结果临时保存起来,不影响输出,但因为用了全局寄存器,这个临时结果也需要一个输出寄存器进行临时保存(这也是一种资源浪费),这样也决定了这个块指令没有执行结束,不能预测执行下一个块指令,降低了执行效率, 因为处理器不能确定这个结果是不是这个块指令最终的结果。
同时,在这种方式下,块体的块内指令也必须和传统RISC指令一样复杂,执行效率也受限。此外,这种方式一般需要在每条块内指令中直接引用全局寄存器,不利于块体代码的复用,例如,当两个或多个块指令的模式相同时(如都是做寄存器加法),但因为该两个或多个块指令用的是不同的全局寄存器,就需要将代码重复写两次或多次,代码消耗自然也增加了,这不值得。
方案二:默认使用输出寄存器的方式,这个方式主要使用块指令内部私有的寄存器作为每条块内指令的输出寄存器,减少对全局寄存器的使用。这个方案中,块指令中的每个块内指令可以只有全局寄存器作为输入寄存器,不需要指定全局寄存器作为输出寄存器,而是使用内部私有的寄存器(块内寄存器)临时保存每条块内指令的计算结果。当其它块内指令需要引用这些计算结果时,只需要用本块内指令和对应的块内指令的偏移,就可以引用到这些计算结果各自对应的块内寄存器,相比于方案一,方案二的执行效率更高了一些。
该方案二存在以下缺点:
指令占用空间较大,块体代码复用性差。基于这个方式,块指令的块内指令在实际编写时,需要额外使用get和set等指令,使得处理器的块引擎可以对全局寄存器进行索引以及赋值。额外使用的get和set等指令则需要额外占用空间,使得指令占用空间较大。
此外,额外使用get和set等指令对全局寄存器进行索引以及赋值,仍然需要在块内指令中使用到全局寄存器,这同样不利于块体代码的复用,当两个或多个块指令的模式相同,但它们使用的全局寄存器不相同时,还是需要将代码重复写两次或多次。
为了解决现有技术中在块头一层与块体一层之间传递寄存器参数存在的执行效率较低、指令占用空间较大以及块体复用性较差的问题,综合考虑现有技术存在的缺点,本申请实际要解决的技术问题包括如下:
通过寄存器映射表建立全局寄存器和块内寄存器的映射关系。本申请实施例中,可以为块头指示的全局寄存器和块体内部预设的块内寄存器建立寄存器映射表,使得通过块内寄存器可以索引和输出到全局寄存器,块体的块内指令可以不再依赖get和set等指令对全局寄存器进行索引和赋值,提高了执行效率,减少了块体使用get和set等指令额外占用的空间。并且,块指令的块内指令中可以不再直接使用到全局寄存器,提高了块体代码的复用性,即当两个或多个块指令的模式相同,但它们使用的全局寄存器不相同时,块体代码可以不用再重复编写,可以直接复用。
为更好地理解本申请实施例提供的块指令的寄存器参数传递方法,下面将对本申请实施例提供的块指令的寄存器参数传递方法的系统架构和/或应用场景进行说明。可理解的,本申请实施例描述的系统架构以及应用场景是为了可以更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定。
本申请实施例提供的块指令的寄存器参数传递方法可应用的系统架构,可参见图2,图2是本申请实施例提供的一种块指令的寄存器参数传递方法应用的系统架构示意图,该系统架构可以是中央处理器(Central Processing Unit,CPU)的架构示意,或者也可以是图形处理器(Graphics Processing Unit,GPU)、系统级芯片(System on Chip,SOC)等处理器的架构,该系统架构中可以包括一个或多个块头取指译码逻辑单元[101]以及一个或多个块引擎[102]。需要说明的是,传统处理器一般是包括取指译码逻辑单元和计算单元,调度的是单个指令,本申请实施例中的处理器一般是通过块引擎[102]对块指令进行调度执行,即在本申请实施例的处理器架构中,执行队列中的每个成员,描述的不再是一条指令,而是一组指令。同时,每个块引擎[102]相当于一个小型的传统处理器,可以使用自己内部的执行系统,不和其他块发生关系,实现更简单,也更容易通过提高块引擎的数量以及种类,对块处理器进行扩展,而不会大幅增加复杂度。
其中,块头取指译码逻辑单元[101],可以用于对块头序列中一个或多个块头进行取指译码,然后可以将块头放入块头执行队列中进行管理。本申请实施例中,处理器可以在块头取指译码逻辑单元[101]对块头进行取指译码时,确定出该块头对应的块指令具体使用到的全局寄存器,从而可以在处理器确定出块体的块内寄存器后,建立全局寄存器和块内寄存器的寄存器映射表,使得块内寄存器可以直接索引和输出到全局寄存器,块体的块内指令可以不再依赖get和set等指令对全局寄存器进行索引和赋值,提高了执行效率,减少了块体使用get和set等指令额外占用的空间,并且,块指令的块内指令中可以不再直接使用到全局寄存器,提高了块体代码的复用性。
块引擎[102],相当于一个小型的传统处理器,可以包括取指单元、译码单元以及计算单元,块引擎[102]可以用于对块内指令序列中的块内指令进行取指译码以及执行计算。本申请实施例中,块引擎可以根据块头执行队列确定出需要处理的块指令的块体,并根据块内指令序列对块体内的一个或多个块内指令进行取指译码以及执行计算,其中,块引擎在执行计算的过程中,可以根据寄存器映射表确定全局寄存器和块内寄存器的映射关系,利用块内寄存器可以直接索引和输出到全局寄存器,使得在实际编写块体的块内指令时,可以不再依赖get和set等指令对全局寄存器进行索引和赋值,从而能够缩减块体使用get和set等指令额外占用的空间,同时,块指令的块内指令中可以不再直接使用到全局寄存器,提高了块体代码的复用性。
需要说明的是,本申请实施例可以应用于各种计算机的处理器系统架构中,上述图中的计算机的处理器架构只是本申请实施例中的一种示例性的实施方式,本申请实施例可应用的架构包括但不仅限于以上架构。应该理解的是,计算机的处理器架构可以具有比图中所示的更多的或者更少的单元/模块,可以组合两个或多个的单元/模块,或者可以具有不同的单元/模块配置。图中所示出的各种单元/模块可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
可理解地,本申请实施例提供的块指令的寄存器参数传递方法可以由终端设备执行。终端设备是指能够被抽象为计算机系统的设备,其中,支持块指令处理功能的终端设备,也可称为块指令处理装置。块指令处理装置可以是该终端设备的整机,例如:智能可穿戴设备、智能手机、平板电脑、笔记本电脑、台式电脑、车载计算机或服务器,等等;也可以是由多个整机构成的系统/装置;还可以是该终端设备中的部分器件,例如:块指令处理功能相关的芯片,如基于块指令的处理器、系统芯片(system on a chip,SoC),等等,本申请实施例对此不作具体限定。其中,系统芯片也称为片上系统。
为方便理解,下面先对本申请实施例中所涉及的块指令的块头和块体的结构以及功能进行简单说明。可理解地,以下说明仅是块头和块体的结构和功能的示例性说明,并不构成对块头和块体的具体限定。
请参见图3,图3是本申请实施例提供的一种块头结构的示意图,其中,一个块头可以和一个块体对应,当块指令的模式相同时,不同的块头可以共用同一个块体(如块头0和块头1共用块体0,块头2和块头3共用块体2),即实现块体代码复用;进一步地,块体代码的复用也可以按整个块体中部分块内指令的代码进行复用。具体地,块指令的块头可以是一条定长的RISC指令,例如为128bit的RISC指令,其中可以包含块指令的种类和属性,全局输入寄存器位掩码(Bitmask),全局输出寄存器Bitmask,跳转指针偏移和块体指针偏移等信息,具体如下。
块指令的种类:可以包括计算类型,例如定点,浮点,定制块体和加速器调用等。
块指令的属性:可以包括块指令的提交策略,例如块指令的重复执行次数,块指令的执行是否要具备原子性/可见性/有序性等。
全局输入寄存器Bitmask:可以用于指示块指令的全局输入寄存器。一个块指令一般最多可以有32个寄存器作为全局输入寄存器。这32个输入寄存器的名称(Identity,ID)可以用Bitmask的格式表达。
全局输出寄存器Bitmask:可以用于指示块指令的全局输出寄存器。一个块指令一般最多可以有32个寄存器作为全局输出寄存器。这32个输出寄存器的ID可以用Bitmask的格式表达。
需要说明的是,这32个全局输入/输出寄存器(R0-R31)都是物理寄存器(即通用寄存器),即是块指令与块指令之间共享的寄存器。如果一个块指令刚好有32个输入,那它32bit的全局输入寄存器位掩码中的每一个位都是1(也可以用“0”表示)。如果一个块指令刚好有16个输出,那它32bit的全局输入寄存器位掩码中对应的19个位是1(例如11110000110000011101111001011111,则该块指令的输出寄存器为R0、R1、R2、R3、R4、R6、R9、R10、R11、R12、R14、R15、R16、R22、R23、R28、R29、R30和R31这19个通用寄存器),即块指令的输入/输出使用哪些通用寄存器,这些通用寄存器对应的位就是1(也可以用“0”表示)。可选地,若CPU中只有16个通用寄存器,则每个块指令也一般最多只有16个全局输入/输出寄存器,可以通过16bit的位掩码来指示每个块指令的全局输入/输出寄存器,可理解地,当CPU中有更多通用寄存器(例如64个)时,则每个块指令的全局输入/输出寄存器也可以为64个,本申请实施例对此不作具体限定。
跳转指针偏移:可以用于指示在当前块指令执行完毕后,下一个待执行的块指令的块头的存储位置。
块体指针偏移:可以用于指示当前块指令的块体的存储位置。一般地,将上一个块体指针加上该块体指针偏移,便可以得到当前的块体指针,即可以得到当前块体的存储位置。
再参见图4,图4是本申请实施例提供的一种块体的结构示意图,其中,块体可以包括块内指令1、块 内指令2、块内指令3……块内指令K等一系列块内指令,每条块内指令按序执行。需要说明的是,不同种类的块指令可以定义不同的块体格式。以较为基本的标准块指令为例,块内指令(block micro-instruction)可以为16bit的定长RISC指令。对于标准块指令来说,块体中的每条块内指令可以为包括至多两个输入和一个输出的指令。
如图4所示,一个块指令的输入可以通过块体中的get指令从其他块指令的输出寄存器中读取。例如图4中第1条块内指令(get R0)读取了其他块指令写到通用寄存器R0中的数据,又例如图4中的第2条块内指令(get R1)读取了其他块指令写到通用寄存器R1中的数据,又例如图4中的第4条块内指令(get R2)读取了其他块指令写到通用寄存器R2中的数据。需要说明的是,本申请实施例中,通过建立寄存器映射表,表达了块头指示的全局寄存器(通用寄存器)与块内寄存器的映射关系,可以减少使用get指令,从而缩减使用get指令额外占用的空间。
如图4所示,一个块指令的输出可以通过块体中的set指令写到对应的输出寄存器上,即写回上述32个物理寄存器中的一个或多个。本申请实施例中,通过建立寄存器映射表,表达了块头指示的全局寄存器(通用寄存器)与块内寄存器的映射关系,可以减少使用set指令,从而缩减使用set指令额外占用的空间。综上,在本申请实施例中,块指令可以通过寄存器映射表访问全局寄存器(通用寄存器),也可以通过块内指令中显式的get/set指令访问通用寄存器,以获取其他块指令的输出,或者输出自己的执行结果。即,本申请实施例中可以在通过寄存器映射表实现全局寄存器的访问的同时,也可以不影响块内指令继续使用get/set指令访问全局寄存器。需要说明的是,本申请实施例提供的块指令的寄存器参数传递方法也不影响处理器的执行部件通过对寄存器进行重命名(rename),可以根据寄存器映射表直接访问到对应的全局寄存器上,不用把全局寄存器的值拷贝进来。
可选地,除了上述特殊的get指令和set指令外,块体中的其他块内指令的输入大多可以来自当前块体中前序块内指令的输出。需要说明的是,块体中每条块内指令的执行结果可以隐含写到自己对应的块内寄存器(即块内输出寄存器)中。然后,当前的块内指令可以通过指令间的相对距离索引前序块内指令暂存在块内寄存器中的执行结果,以此作为输入。如此,在块内指令编码上,每条块内指令可以不用表达自己的输出寄存器,即每条块内指令可以只表达自己的操作码和至多两个输入寄存器(即至多两个相对距离)。以块内指令为16bit的定长RISC指令为例,其结构可以如下表一所示。
表一
如上表一所示,以长度为16bit的块内指令为例,其中,10bit(0-9)可以用于表达操作码,2个3bit可以分别用于表达两个输入寄存器(即两个相对距离),此时相对距离的限制为1-8,即当前块内指令只能索引在当前块内指令之前的第1条-第8条块内指令中的指令的执行结果,以此作为输入。例如,图4中的第3条块内指令(add 1 2)的两个输入分别来自与第3条块内指令的相对距离是1和2的两条指令(即将第1条指令和第2条块内指令)的执行结果。可理解地,若当前的块内指令只有一个输入,则除去操作码后剩余的6bit可以全部用于表达一个输入寄存器,此时相对距离的限制为1-64,即当前块内指令只能索引在当前块内指令之前的第1条-第64条块内指令中的指令的执行结果,以此作为输入。
示例性的,块指令的汇编可以如下:
以上述块内第3条块内指令(add T#1 T#2)为例,T#1表示与当前块内指令(add T#1 T#2)的相对距离为1的块内指令所对应的块内寄存器,该块内寄存器中存储有块内指令get R0的执行结果;T#2表示与当前块内指令(add T#1 T#2)的相对距离为2的块内指令所对应的块内寄存器,该块内寄存器中存储有块内指令get R1的执行结果。则第3条块内指令(add T#1 T#2)实际做的计算操作为将全局寄存器R0和R1中的数据进行相加。
下面基于上述图2提供的架构示意图,以及上述图3提供的块头结构示意图和图4提供的块体结构示意 图,对本申请提供的块指令的寄存器参数传递方法进行说明。
请参见图5,图5是本申请实施例提供的一种块指令的寄存器参数传递方法的流程示意图,该方法可应用于上述图2中所述的基于块指令的处理器架构中,也即是说上述基于块指令的处理器可以用于支持并执行图5中所示的方法流程步骤S500-步骤S502,其中,步骤S500-步骤S502包括以下:
步骤S500:针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,得到寄存器映射表。
具体地,基于块指令的处理器所处理的块指令可以包括块头和块体,其中,块头中可以指示块指令具体使用的一个或多个全局寄存器(即M个全局寄存器,其中可以包括全局输出寄存器和全局输入寄存器),块体可以包括一条或多条块内指令(即K条块内指令),并默认设置有一个或多个块内寄存器(即N个块内寄存器,其中可以包括块内输出寄存器和块内输入寄存器),块体中每条块内指令的执行结果可以输出到对应的块内寄存器(此时为块内输出寄存器)中暂存,即块体中的块内寄存器用于存储块体的K条块内指令中对应的块内指令输出的执行结果,M、N、K均为大于0的整数。在此基础上,基于块指令的处理器可以基于块头指示的M个全局寄存器中的P个全局寄存器以及块体对应的N个块内寄存器中的P个块内寄存器建立寄存器映射表,上述P个全局寄存器与P个块内寄存器一一对应,即所述寄存器映射表可以包括所述M个全局寄存器中的全局输入寄存器(即P个全局寄存器)和所述N个块内寄存器中的块内输入寄存器(即P个块内寄存器)之间的输入索引关系,使得块引擎[102]在执行块内指令时可以通过该寄存器映射表由块内寄存器访问到全局寄存器。可选地,基于块指令的处理器也可以针对M个全局寄存器和N个块内寄存器中的部分块内寄存器建立映射关系,其中,N个块内寄存器中的部分块内寄存器的数量可以基于(1,2,…,MIN(M,N))公式确定(即P可以取1至MIN(M,N)中的任一整数值),其中,1表示块内寄存器的序号从1开始取值,可理解地,块内寄存器的序号也可以从其它取值开始,公式进行相应修改即可,在此不作具体限定。当块头指示的M个全局寄存器中全局输入寄存器的数量大于块内寄存器的数量(N个)时,可以不建立所有全局输入寄存器与块内寄存器的映射关系,仅建立一部分,可理解地,全局输出寄存器与块内寄存器的映射关系也可参考此法建立,在此不另外展开。
在一种可能的实现方式中,基于块指令的处理器可以先基于块体包括的K条块内指令各自的执行结果对应的块内输出寄存器以及输入信息,确定出K条块内指令各自的输入数据对应的块内输入寄存器(即P个块内寄存器);其中,块内输入寄存器和块内输出寄存器均为块体内默认的N个块内寄存器中的寄存器。然后,基于块指令的处理器可以针对该K条块内指令各自对应的块内输入寄存器和块内输出寄存器,以及块头中指示一个或多个全局寄存器(即M个全局寄存器,可以包括全局输入寄存器和全局输出寄存器)建立映射关系,从而得到寄存器映射表,使得块引擎[102]在执行块内指令时可以通过该寄存器映射表由块内寄存器访问到全局寄存器,即可以实现向全局寄存器索引得到输入,以及向全局寄存器输出。
可选地,块头中指示的一个或多个全局寄存器(即M个全局寄存器)可以包括一个或多个全局输出寄存器(即S个全局寄存器),以及一个或多个全局输入寄存器(即P个全局寄存器);基于块指令的处理器建立的寄存器映射表可以包括上述一个或多个全局输出寄存器与块体对应的N个块内寄存器中的一个或多个块内输出寄存器(即S个块内寄存器)的输出索引关系,以及上述一个或多个全局输入寄存器与所述N个块内寄存器中的一个或多个块内输入寄存器(即P个块内寄存器)的输入索引关系。使得块引擎[102]在执行块内指令时可以通过该寄存器映射表中的输入索引关系由块内输入寄存器访问到全局输入寄存器,从而获得块内指令的输入,以及可以通过该寄存器映射表中的输出索引关系,由块内输出寄存器访问到全局输出寄存器,从而输出块内指令的执行结果。
在一种可能的实现方式中,块引擎[102]可以根据块头指示的一个或多个全局输入寄存器(即P个全局寄存器)的指示顺序,以及K条块内指令各自的输入数据对应的块内输入寄存器在N个块内寄存器中的顺序,建立全局输入寄存器和块内输入寄存器的映射关系;并,可以根据块头指示的一个或多个全局输出寄存器(即S个全局寄存器)的指示顺序,以及K条块内指令各自的执行结果对应的块内输出寄存器在N个块内寄存器中的顺序,建立全局输出寄存器和块内输出寄存器的映射关系。使得块引擎[102]在调度执行块指令内部具体的块内指令时,可以通过寄存器映射表由块内输入寄存器访问到全局输入寄存器,以此获取到全局输入寄存器的数据并作为块内指令的输入,以及可以通过寄存器映射表由块内输出寄存器访问到全局输出寄存器,以此将块指令的执行结果输出到对应全局输出寄存器。
步骤S501:确定所述K条块内指令中的第i条块内指令的输入数据来自所述P个块内寄存器中的第j个块内寄存器。
具体地,基于块指令的处理器包括的块引擎[102]可以确定出一条或多条块内指令(即K条块内指令) 中某条块内指令(即第i条块内指令)的输入数据来自于一个或多个块内寄存器(即P个块内寄存器)中的某个块内寄存器(即第j个块内寄存器)。可选地,块引擎[102]可以基于该第i条块内指令的输入信息确定出其输入数据来自的第j个块内寄存器,其中,所述输入信息包括所述第i条块内指令的执行结果对应的块内寄存器和所述第j个块内寄存器之间的相对距离,所述第i条块内指令的执行结果对应的块内寄存器为第t个块内寄存器;当t大于或等于j时,相对距离可以为t-j;当t小于j时,相对距离可以为N-j+t。即,块引擎[102]在对该第i条块内指令进行取指译码时,在明确该第i条块内指令的执行结果对应的块内寄存器的基础上,可以通过该第i条块内指令的输入信息,确定出该第i条块内指令的输入数据来自N个块内寄存器中的第j个块内寄存器。例如,输入信息指示的相对距离为3,则块引擎[102]可以从该第i条块内指令的执行结果对应的块内寄存器向前偏移找到相对距离为3的块内寄存器,即为第j个块内寄存器,可理解地,输入信息中指示的相对距离可以为大于0的整数,即可以为3以外的其它值,在此不作具体限定。
在一种可能的实现方式中,块引擎[102]在使用块内指令默认的一个或多个块内寄存器(即N个块内寄存器)时,可以通过一个循环队列对这一个或多个块内寄存器进行管理。可参见图6,图6是本申请实施例提供的一种块内寄存器管理队列示意图,其中,块引擎[102]通过该循环队列可以管理块指令默认的N个块内寄存器,例如,第一条块内指令的执行结果可以先输出到第一个块内寄存器(令为0),第二条块内指令的执行结果可以输出到第二个块内寄存器(令为1),……,如此类推,第N+1条块内指令的执行结果可以重新输出到块内寄存器0,随后,后续的每条块内指令总是会按序更新最后一个块内寄存器,并放弃最旧的一个块内寄存器的值,可理解为,第N+1条块内指令的执行结果会覆盖掉原本第1条块内指令输出到第一个块内寄存器(令为0)的执行结果。根据这个循环队列的输出算法,当块体中的第n条块内指令中想要通过输入信息(例如,T#m)对块内寄存器进行索引的时候,可以通过索引下标为(n-m)%N的块内寄存器(即为第j个块内寄存器)实现,其中,m为第n条块内指令的执行结果默认输出的块内寄存器与其想要索引的块内寄存器的相对距离,N为块指令默认的块内指令数量。需要说明的是,块内指令的执行结果也可以将块内寄存器与全局寄存器的输入映射关系覆盖,例如,第i条块内指令执行时,可以索引到第j个块内寄存器,进而基于映射关系访问到对应的全局输入寄存器(如r1),若后续第i+x条块内指令的执行结果默认输出到该第j个块内寄存器时,该执行结果可以将第j个块内寄存器与全局输入寄存器(如r1)的映射关系覆盖,也即是说,当第i+x条块内指令后的某条块内指令需要索引第j个块内寄存器时,得到的是第i+x条块内指令的执行结果,而不是访问全局输入寄存器(如r1)。
需要说明的是,T#m这个语法一般会受到块内寄存器总量的限制,块内指令利用T#m一般不能访问超出总量范围的块内输出寄存器。例如,块内寄存器的数量总共有8个,某条块内指令利用T#m一般无法索引在该条指令之前的第九条块内指令的执行结果,如果要越过这个范围,可以通过在中间的块内指令进行接续,或者写到全局寄存器资源上。可理解地,通过循环队列可以将第一条块内指令的执行结果可以输出到第一个块内寄存器(令为0),也可以输出到其它块内寄存器,为方便本申请实施例,所以以输出到第一个块内寄存器(令为0)为例,在此不作具体限定。基于此,本申请实施例可以通过建立全局输入寄存器和循环队列中队列尾部的块内寄存器的映射关系,让块指令中靠前的块内指令通过T#m索引下标为(n-m)%N的块内寄存器,从而可以访问到全局输入寄存器。
步骤S502:基于所述寄存器映射表,确定所述P个全局寄存器中与所述第j个块内寄存器对应的目标寄存器,从所述目标寄存器中获取所述输入数据,并执行所述第i条块内指令。
具体地,基于块指令的处理器包括的块引擎[102]在确定出第i条块内指令的输入数据来自的第j个块内寄存器后,可以基于寄存器映射表中的映射关系确定出与第j个块内寄存器对应的目标全局寄存器,进而块引擎[102]可以访问到该目标全局寄存器,并从该目标全局寄存器中获取到目标数据,以该目标数据作为输入执行所述第i条块内指令。
在一种可能的实现方式中,本申请实施例提供的块指令的寄存器参数传递方法的流程,除了上述步骤S500-步骤S502之外,还包括步骤S503。步骤S503:基于所述寄存器映射表将所述K条块内指令中的一条或多条块内指令的执行结果分别输出至所述S个全局寄存器中对应的全局寄存器,并将所述块指令提交。具体地,所述寄存器映射表还包括所述S个全局寄存器与所述S个块内寄存器的映射关系,块引擎[102]可以在执行完块指令中的K条块内指令后,通过寄存器映射表将块指令中块体的执行结果分别输出至块头指示的S个全局寄存器(即全局输出寄存器)中对应的全局输出寄存器,并将块指令提交。
为更方便理解本申请实施例提供的块指令的寄存器参数传递方法的流程,可参见图7,图7是本申请提供的块指令的寄存器参数传递流程的架构示意图,其中,全局寄存器文件中的全局寄存器数量,以及块内寄存器文件中的全局寄存器数量可以由基于块指令的处理器提前预设,可选地,基于块指令的处理器在对 块指令的块头进行取指译码时,可以根据块头描述符确定出块头指示的全局输入寄存器(例如,x、y、z)以及全局输出寄存器(例如,a、b、c),并为全局寄存器(包括全局输入寄存器、全局输出寄存器)和块内寄存器(包括块内输入寄存器、块内输出寄存器)建立映射关系,从而得到寄存器映射表;输入寄存器映射包括全局输入寄存器和块内寄存器管理队列尾部寄存器的映射,当块体中的块内指令(指令1)执行时,可以通过块内寄存器管理队列尾部的寄存器(块内寄存器N-2)索引到全局输入寄存器(例如寄存器x),并且块内指令在执行结束后的执行结果可以默认输出到块内寄存器管理队列首部的寄存器(块内寄存器0),块体中后续的块内指令(指令2)执行时,参考上述过程进行索引,通过块内寄存器N-1索引到全局输入寄存器(y),执行结果默认输出到块内寄存器1;然后再根据寄存器映射表将块指令的执行结果映射到全局输出寄存器,其中,输出寄存器映射包括了全局输出寄存器和块内指令默认输出的块内寄存器之间的映射关系。需要说明的是,寄存器映射表中除了全局寄存器和块内寄存器之间的映射关系之外,还可以包括其它信息,如寄存器的标记(flags)信息、属性(ref)信息等。
以上,对本申请实施例提供的块指令的寄存器参数传递方法流程进行了简单的描述说明。至于,使用本申请实施例提供的块指令的寄存器参数传递方法进行块指令调度的效果如何,以下再继续进行描述说明。
为更方便、更直观理解本申请实施例的效果,下面将以基于本申请实施例提供的块指令的寄存器参数传递方法的具体块指令的代码汇编,与基于现有技术一及现有技术二的代码汇编进行对比。以2个块指令均做寄存器加法为例进行示例,把4个输入寄存器的值加起来输出到第一个输入寄存器上。
现有技术一的实现方式是直接索引,其代码汇编可以如下:
以上,定义了2个块指令(Block,Block1和Block2),这2个Block的模式是相同的,均指示了4个全局输入寄存器,1个全局输出寄存器。其中,Block1的全局输入寄存器为r1,r2,r3,r4,全局输出寄存器为r1;Block2的全局输入寄存器为r5,r6,r7,r8,全局输出寄存器为r5。
现有技术二的实现方式是默认使用输出寄存器,其代码汇编可以如下:

以上,定义了2个块指令(Block,Block1和Block2),这2个Block的模式是相同的,均指示了4个全局输入寄存器,1个全局输出寄存器。其中,Block1的全局输入寄存器为r1,r2,r3,r4,全局输出寄存器为r1;Block2的全局输入寄存器为r5,r6,r7,r8,全局输出寄存器为r5。在这个例子中,每个块内指令都只有输入寄存器,不需要指定输出寄存器,其他块内指令引用这些输出寄存器,只需要用本块内指令和这些输出寄存器对应的块内指令的偏移就可以了。其中,多出来的get和set指令,对块引擎来说,只是一个索引,本身并不需要执行,不会影响执行效率,但是需要额外占用一定的空间。
基于本申请实施例的提供的块指令的寄存器参数传递方法,代码汇编可以如下:
以上,定义了2个块指令(Block,Block1和Block1),这是2个不同的Block1,它们的模式是相同的,均指示了4个全局输入寄存器,1个全局输出寄存器,它们的块头不同,块体相同(复用)。其中,第一个Block1的全局输入寄存器为r1,r2,r3,r4,全局输出寄存器为r1;第二个Block1的全局输入寄存器为r5,r6,r7,r8,全局输出寄存器为r5。在这个例子中,第一条块内指令(即第i条块内指令)引用t#3和t#4,这从指令距离的角度上看是没有意义的,但是从块头输入的角度,块引擎[102]认为在第一条块内指令前面有了针对r1到r4的四个虚拟的get指令,而寄存器映射表中包括了四个虚拟的get指令对应的块内寄存器和对应的全局寄存器的映射,所以第一条块内指令通过t#3和t#4就索引到了全局寄存器r2和r1(即第j个块内寄存器对应的全局寄存器),同理,第二条块内指令(即第i条块内指令)Add t#2,t#3则可以索引到了r4和r3这两个全局寄存器(即第j个块内寄存器对应的全局寄存器)。
同时,因为通过寄存器映射表建立了块内寄存器和全局寄存器的映射关系,在块体内部的具体块内指令中将看不到全局寄存器r1,r2等索引。这样,原来的Block1和Block2即便模式相同,但由于所使用的具体全局寄存器不同仍需要重复编写代码两次,此时就可以不用再重复编写,而是复用同一个块体。
需要说明的是,这种方式也可以进一步推广,在汇编代码中也继续使用get/set等指令,仍可以根据本引擎的下标引用这些全局寄存器,而不是使用实际的全局寄存器地址进行索引。
可理解地,最终在整个块指令被提交的时候,可以对全局寄存器的状态进行检查,如果某个全局输出寄存器没有被输出过,可以根据寄存器映射表从对应的块内寄存器进行填充。
综上,对比于现有技术解决块头一层与块体一层之间传递寄存器参数的方案,本申请实施例中,可以为块头指示的全局寄存器和块体内部预设的块内寄存器建立寄存器映射表,使得通过块内寄存器可以索引和输出到全局寄存器,块体的块内指令可以不再依赖get和set等指令对全局寄存器进行索引和赋值,提高了执行效率,减少了块体使用get和set等指令额外占用的空间。并且,块指令的块内指令中可以不再直接使用到全局寄存器,提高了块体代码的复用性,即当两个或多个块指令的模式相同,但它们使用的全局寄存器不相同时,块体代码可以不用再重复编写,可以直接复用。
上述详细阐述了本申请实施例的方法,下面提供本申请实施例的几种相关装置。
请参见图8,图8是本申请实施例提供的一种块指令的寄存器参数传递装置的结构示意图,该块指令的寄存器参数传递装置8,可以包括第一处理模块81,确定模块82,第二处理模块83,还可以包括第三处理模块84和第四处理模块85。其中,各个单元的详细描述如下:
第一处理模块81,用于针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,得到寄存器映射表;所述P个全局寄存器与所述P个块内寄存器一一对应;M、N均为大于0的整数;
确定模块82,用于确定所述K条块内指令中的第i条块内指令的输入数据来自所述P个块内寄存器中的 第j个块内寄存器;i取1、2、……、K,j取1、2、……、P;
第二处理模块83,用于基于所述寄存器映射表,确定所述P个全局寄存器中与所述第j个块内寄存器对应的目标寄存器,从所述目标寄存器中获取所述输入数据,并执行所述第i条块内指令。
在一种可能的实现方式中,所述确定模块82,具体用于:
基于所述第i条块内指令中的输入信息确定出所述第j个块内寄存器;所述第i条块内指令的执行结果对应的块内寄存器为第t个块内寄存器;当t大于或等于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离t-j;当t小于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离N-j+t。
在一种可能的实现方式中,所述第一处理模块81,具体用于:
基于所述K条块内指令各自的执行结果对应的块内寄存器以及输入信息,确定所述K条块内指令各自的输入数据对应的所述P个块内寄存器;
针对所述P个块内寄存器和所述P个全局寄存器建立映射关系。
在一种可能的实现方式中,所述K条块内指令各自的执行结果对应的块内寄存器包括S个块内寄存器;所述装置,还包括:
第三处理模块84,用于针对所述M个全局寄存器中的S个全局寄存器以及所述S个块内寄存器建立映射关系;所述S个全局寄存器与所述S个块内寄存器一一对应。
在一种可能的实现方式中,所述第一处理模块81,具体用于:
基于所述K条块内指令各自的输入数据对应的所述P个块内寄存器在所述N个块内寄存器中的顺序和所述P个全局寄存器在所述块头中的指示顺序,为所述P个块内寄存器和所述P个全局寄存器建立映射关系;
所述第三处理模块84,具体用于:
基于所述S个块内寄存器在所述N个块内寄存器中的顺序和所述S个全局寄存器在所述块头中的指示顺序,为所述S个块内寄存器和所述S个全局寄存器建立映射关系。
在一种可能的实现方式中,所述寄存器映射表还包括所述S个全局寄存器与所述S个块内寄存器的映射关系;所述装置,还包括:
第四处理模块85,用于基于所述寄存器映射表将所述K条块内指令中的一条或多条块内指令的执行结果分别输出至所述S个全局寄存器中对应的全局寄存器,并将所述块指令提交。
需要说明的是,本申请实施例中所描述的块指令的寄存器参数传递装置8中各功能单元/模块的功能可参见上述方法实施例中的相关描述,此处不再赘述。
需要说明的是,本申请中描述的块指令的寄存器参数传递装置并不限于此,该块指令的寄存器参数传递装置可以位于任意一个终端设备中,如电脑、计算机、手机、平板、服务器等各类设备中。块指令的寄存器参数传递装置具体可以是芯片或芯片组或搭载有芯片或者芯片组的电路板。该芯片或芯片组或搭载有芯片或芯片组的电路板可在必要的软件驱动下工作。例如,所述块指令的寄存器参数传递装置可以是:
(1)独立的集成电路IC、芯片、芯片系统或子系统;
(2)具有一个或多个IC的集合,可选的,该IC集合也可以包括用于存储数据,计算机程序的存储部件;
(3)可嵌入在其他设备内的模块;
(4)其他等等。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序代码,当上述处理器执行该计算机程序代码时,使得计算机执行前述任一实施例中的方法。
本申请实施例还提供一种终端设备,该终端设备可以以芯片的产品形态存在,该终端设备中包括处理器,处理器被配置为支持该终端设备实现前述任一实施例中的方法中相应的功能。该终端设备还可以包括存储器,存储器用于与处理器耦合,其保存该终端设备必要的程序指令和数据。该终端设备还可以包括通信接口,用于该终端设备与其他设备或通信网络通信。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行前述任一实施例中的方法。
本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于支持设备实现上述第一方面所涉及的功能,例如,生成或处理上述块指令的寄存器参数传递方法中所涉及的信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存设备必要的程序指令和数据。该芯片系统,可以由芯片 构成,也可以包含芯片和其他分立器件。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以为个人计算机、服务端或者网络设备等,具体可以是计算机设备中的处理器)执行本申请各个实施例上述方法的全部或部分步骤。其中,而前述的存储介质可包括:U盘、移动硬盘、磁碟、光盘、只读存储器(Read-OnlyMemory,缩写:ROM)或者随机存取存储器(RandomAccessMemory,缩写:RAM)等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (16)

  1. 一种块指令的寄存器参数传递方法,其特征在于,所述块指令包括块头和块体,所述块头用于指示所述块指令对应的M个全局寄存器,所述块体包括K条块内指令,所述K条块内指令对应N个块内寄存器,K为大于0的整数;所述方法,包括:
    针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,得到寄存器映射表;所述P个全局寄存器与所述P个块内寄存器一一对应;M、N均为大于0的整数;
    确定所述K条块内指令中的第i条块内指令的输入数据来自所述P个块内寄存器中的第j个块内寄存器;i取1、2、……、K,j取1、2、……、P;
    基于所述寄存器映射表,确定所述P个全局寄存器中与所述第j个块内寄存器对应的目标寄存器,从所述目标寄存器中获取所述输入数据,并执行所述第i条块内指令。
  2. 如权利要求1所述的方法,其特征在于,所述确定所述K条块内指令中的第i条块内指令的输入数据来自所述P个块内寄存器中的第j个块内寄存器,包括:
    基于所述第i条块内指令中的输入信息确定出所述第j个块内寄存器;所述第i条块内指令的执行结果对应的块内寄存器为第t个块内寄存器;当t大于或等于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离t-j;当t小于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离N-j+t。
  3. 如权利要求1-2中任一项所述的方法,其特征在于,所述针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,包括:
    基于所述K条块内指令各自的执行结果对应的块内寄存器以及输入信息,确定所述K条块内指令各自的输入数据对应的所述P个块内寄存器;
    针对所述P个块内寄存器和所述P个全局寄存器建立映射关系。
  4. 如权利要求1-3中任一项所述的方法,其特征在于,所述K条块内指令各自的执行结果对应的块内寄存器包括S个块内寄存器;所述方法,还包括:
    针对所述M个全局寄存器中的S个全局寄存器以及所述S个块内寄存器建立映射关系;所述S个全局寄存器与所述S个块内寄存器一一对应。
  5. 如权利要求4所述的方法,其特征在于,所述针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,包括:
    基于所述K条块内指令各自的输入数据对应的所述P个块内寄存器在所述N个块内寄存器中的顺序和所述P个全局寄存器在所述块头中的指示顺序,为所述P个块内寄存器和所述P个全局寄存器建立映射关系;
    所述针对所述M个全局寄存器中的S个全局寄存器以及所述S个块内寄存器建立映射关系,包括:
    基于所述S个块内寄存器在所述N个块内寄存器中的顺序和所述S个全局寄存器在所述块头中的指示顺序,为所述S个块内寄存器和所述S个全局寄存器建立映射关系。
  6. 如权利要求4-5中任一项所述的方法,其特征在于,所述寄存器映射表还包括所述S个全局寄存器与所述S个块内寄存器的映射关系;所述方法,还包括:
    基于所述寄存器映射表将所述K条块内指令中的一条或多条块内指令的执行结果分别输出至所述S个全局寄存器中对应的全局寄存器,并将所述块指令提交。
  7. 一种块指令的寄存器参数传递装置,其特征在于,所述块指令包括块头和块体,所述块头用于指示所述块指令对应的M个全局寄存器,所述块体包括K条块内指令,所述K条块内指令对应N个块内寄存器,K为大于0的整数;所述装置,包括:
    第一处理模块,用于针对所述M个全局寄存器中的P个全局寄存器以及所述N个块内寄存器中的P个块内寄存器建立映射关系,得到寄存器映射表;所述P个全局寄存器与所述P个块内寄存器一一对应;M、N均为大于0的整数;
    确定模块,用于确定所述K条块内指令中的第i条块内指令的输入数据来自所述P个块内寄存器中的第j个块内寄存器;i取1、2、……、K,j取1、2、……、P;
    第二处理模块,用于基于所述寄存器映射表,确定所述P个全局寄存器中与所述第j个块内寄存器对应的目标寄存器,从所述目标寄存器中获取所述输入数据,并执行所述第i条块内指令。
  8. 如权利要求7所述的装置,其特征在于,所述确定模块,具体用于:
    基于所述第i条块内指令中的输入信息确定出所述第j个块内寄存器;所述第i条块内指令的执行结果对应的块内寄存器为第t个块内寄存器;当t大于或等于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离t-j;当t小于j时,所述输入信息包括所述第t个块内寄存器和所述第j个块内寄存器之间的相对距离N-j+t。
  9. 如权利要求7-8中任一项所述的装置,其特征在于,所述第一处理模块,具体用于:
    基于所述K条块内指令各自的执行结果对应的块内寄存器以及输入信息,确定所述K条块内指令各自的输入数据对应的所述P个块内寄存器;
    针对所述P个块内寄存器和所述P个全局寄存器建立映射关系。
  10. 如权利要求7-9中任一项所述的装置,其特征在于,所述K条块内指令各自的执行结果对应的块内寄存器包括S个块内寄存器;所述装置,还包括:
    第三处理模块,用于针对所述M个全局寄存器中的S个全局寄存器以及所述S个块内寄存器建立映射关系;所述S个全局寄存器与所述S个块内寄存器一一对应。
  11. 如权利要求10所述的装置,其特征在于,所述第一处理模块,具体用于:
    基于所述K条块内指令各自的输入数据对应的所述P个块内寄存器在所述N个块内寄存器中的顺序和所述P个全局寄存器在所述块头中的指示顺序,为所述P个块内寄存器和所述P个全局寄存器建立映射关系;
    所述第三处理模块,具体用于:
    基于所述S个块内寄存器在所述N个块内寄存器中的顺序和所述S个全局寄存器在所述块头中的指示顺序,为所述S个块内寄存器和所述S个全局寄存器建立映射关系。
  12. 如权利要求10-11中任一项所述的装置,其特征在于,所述寄存器映射表还包括所述S个全局寄存器与所述S个块内寄存器的映射关系;所述装置,还包括:
    第四处理模块,用于基于所述寄存器映射表将所述K条块内指令中的一条或多条块内指令的执行结果分别输出至所述S个全局寄存器中对应的全局寄存器,并将所述块指令提交。
  13. 一种计算机可读存储介质,其特征在于,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1-6中任一项所述的方法。
  14. 一种计算机程序,其特征在于,所述计算机程序包括指令,当所述计算机程序被计算机执行时,使得所述计算机执行如权利要求1-6中任一项所述的方法。
  15. 一种终端设备,其特征在于,包括处理器和存储器,其中,所述存储器用于存储程序代码,所述程序代码被所述处理器执行时,所述终端设备实现如权利要求1-6中任一项所述的方法。
  16. 一种芯片系统,其特征在于,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述接口电路和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有指令;所述指令被所述处理器执行时,权利要求1-6中任意一项所述的方法得以实现。
PCT/CN2023/132608 2022-11-22 2023-11-20 一种块指令的寄存器参数传递方法和相关设备 WO2024109689A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211466485.4A CN118069225A (zh) 2022-11-22 2022-11-22 一种块指令的寄存器参数传递方法和相关设备
CN202211466485.4 2022-11-22

Publications (1)

Publication Number Publication Date
WO2024109689A1 true WO2024109689A1 (zh) 2024-05-30

Family

ID=91102711

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/132608 WO2024109689A1 (zh) 2022-11-22 2023-11-20 一种块指令的寄存器参数传递方法和相关设备

Country Status (2)

Country Link
CN (1) CN118069225A (zh)
WO (1) WO2024109689A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101030192A (zh) * 2006-03-02 2007-09-05 国际商业机器公司 管理处理器的寄存器的方法和系统
CN101419541A (zh) * 2007-10-25 2009-04-29 晶心科技股份有限公司 存取多个寄存器其中之一目标寄存器的方法及其相关装置
US20130167149A1 (en) * 2011-12-26 2013-06-27 International Business Machines Corporation Register Mapping Techniques
CN108027767A (zh) * 2015-09-19 2018-05-11 微软技术许可有限责任公司 寄存器读取/写入排序
CN113254073A (zh) * 2021-05-31 2021-08-13 厦门紫光展锐科技有限公司 数据处理方法及装置
CN114691537A (zh) * 2022-03-31 2022-07-01 合肥忆芯电子科技有限公司 一种访问存储器的方法及信息处理设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101030192A (zh) * 2006-03-02 2007-09-05 国际商业机器公司 管理处理器的寄存器的方法和系统
CN101419541A (zh) * 2007-10-25 2009-04-29 晶心科技股份有限公司 存取多个寄存器其中之一目标寄存器的方法及其相关装置
US20130167149A1 (en) * 2011-12-26 2013-06-27 International Business Machines Corporation Register Mapping Techniques
CN108027767A (zh) * 2015-09-19 2018-05-11 微软技术许可有限责任公司 寄存器读取/写入排序
CN113254073A (zh) * 2021-05-31 2021-08-13 厦门紫光展锐科技有限公司 数据处理方法及装置
CN114691537A (zh) * 2022-03-31 2022-07-01 合肥忆芯电子科技有限公司 一种访问存储器的方法及信息处理设备

Also Published As

Publication number Publication date
CN118069225A (zh) 2024-05-24

Similar Documents

Publication Publication Date Title
US12086603B2 (en) Processor having multiple cores, shared core extension logic, and shared core extension utilization instructions
TWI279715B (en) Method, system and machine-readable medium of translating and executing binary of program code, and apparatus to process binaries
JP4934356B2 (ja) 映像処理エンジンおよびそれを含む映像処理システム
US5155817A (en) Microprocessor
JPH01131949A (ja) 処理依頼機能を持つ並列計算機
KR20190129702A (ko) 부동 소수점 데이터를 압축하기 위한 시스템
WO2021249054A1 (zh) 一种数据处理方法及装置、存储介质
CN112214241B (zh) 一种分布式指令执行单元的方法及系统
JP2017016639A (ja) トランザクショナル電力管理を実行するためのハードウェア装置及び方法
US20210089305A1 (en) Instruction executing method and apparatus
JP2024523339A (ja) ニアメモリコンピューティングを用いた複合操作のアトミック性の提供
WO2006102379A2 (en) Processor and method of grouping and executing dependent instructions in a packet
CN116909943A (zh) 一种缓存访问方法、装置、存储介质及电子设备
WO2024109689A1 (zh) 一种块指令的寄存器参数传递方法和相关设备
WO2009032186A1 (en) Low-overhead/power-saving processor synchronization mechanism, and applications thereof
CN114510271B (zh) 用于在单指令多线程计算系统中加载数据的方法和装置
CN111814093A (zh) 一种乘累加指令的处理方法和处理装置
US20030182538A1 (en) Method and system for managing registers
CN109683959B (zh) 处理器的指令执行方法及其处理器
CN114661227A (zh) 通过使用忘记存储来增加每核存储器带宽
CN113885943A (zh) 处理单元、片上系统、计算装置及方法
CN112000592A (zh) 一种模块间数据交互的方法和装置
WO2024087039A1 (zh) 一种块指令的处理方法和块指令处理器
CN117501256A (zh) 用于大数据集的复杂过滤器硬件加速器
WO2019134376A1 (zh) 虚拟地址确定方法及装置、处理器、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23893779

Country of ref document: EP

Kind code of ref document: A1