US11036512B2 - Systems and methods for processing instructions having wide immediate operands - Google Patents
Systems and methods for processing instructions having wide immediate operands Download PDFInfo
- Publication number
- US11036512B2 US11036512B2 US16/579,161 US201916579161A US11036512B2 US 11036512 B2 US11036512 B2 US 11036512B2 US 201916579161 A US201916579161 A US 201916579161A US 11036512 B2 US11036512 B2 US 11036512B2
- Authority
- US
- United States
- Prior art keywords
- immediate operand
- wide
- immediate
- hcilt
- instruction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
- G06F9/3016—Decoding the operand specifier, e.g. specifier format
- G06F9/30167—Decoding the operand specifier, e.g. specifier format of immediate specifier, e.g. constants
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30181—Instruction operation extension or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3824—Operand accessing
Definitions
- the present disclosure is related to processor-based systems and methods for operating processor-based systems to accommodate the use of immediate operands that are larger than an instruction size defined by an instruction set architecture (ISA) with minimal overhead.
- ISA instruction set architecture
- ISAs Instruction set architectures
- Most ISAs have a relatively small instruction size (e.g., four bytes).
- an immediate value i.e., a value that is stored as part of an instruction itself rather than as a pointer to a memory location or register
- an immediate value i.e., a value that is stored as part of an instruction itself rather than as a pointer to a memory location or register
- a move immediate instruction e.g., “movi register, immediate,” where “movi” is the opcode of the instruction, “immediate” is an immediate operand specifying an immediate value, and “register” is a register operand specifying the register that will be updated with the immediate value
- one byte is reserved for the opcode and one byte is reserved for the register operand, leaving only two bytes for the immediate operand.
- immediate values with a length over two bytes in length cannot be stored in the instruction itself.
- a branch to immediate offset instruction e.g., “bri immediate,” where “bri” is the opcode of the instruction and “immediate” is an immediate operand specifying the offset value to jump to
- one byte is reserved for the opcode, leaving only three bytes for the immediate operand.
- immediate values with a length over three bytes cannot be stored in the instruction itself.
- an immediate value is too large to be stored in an instruction because it is too large to fit in the allotted space provided by the instruction as dictated by the ISA, it is defined herein as a wide immediate.
- branch to immediate offset instructions having a wide immediate operand may be chained together to finally arrive at the offset indicated by the wide immediate operand.
- an indirect branch may be used to arrive at the offset indicated by the wide immediate operand. Indirect branches occupy space in branch prediction circuitry of the processor, and in the present case in which there is one target that is 100% predictable, occupying this space in the branch prediction circuitry is wasteful.
- a processor element in a processor-based system is configured to fetch one or more instructions associated with a program binary, where the one or more instructions include an instruction having an immediate operand.
- the processor element is configured to determine if the immediate operand is a reference to a wide immediate operand.
- the processor element is configured to retrieve the wide immediate operand from a common immediate lookup table (CILT) in the program binary, where the immediate operand indexes the wide immediate operand in the CILT.
- CILT common immediate lookup table
- the processor element is then configured to process the instruction having the immediate operand such that the immediate operand is replaced with the wide immediate operand from the CILT.
- a processor element in a processor-based system includes a hardware CILT (HCILT) and instruction processing circuitry.
- HCILT includes hardware storage (e.g., a memory or register) configured to store a table indexing immediate values to wide immediate values.
- the instruction processing circuitry is configured to fetch one or more instructions associated with a program binary from an instruction memory, the instructions including an instruction having an immediate operand.
- the instruction processing circuitry is configured to determine if the immediate operand is a reference to a wide immediate operand.
- the instruction processing circuitry In response to determining that the immediate operand is a reference to a wide immediate operand, the instruction processing circuitry is configured to search the HCILT for the wide immediate operand indexed by the immediate operand, and, in response to finding the wide immediate operand in the HCILT, process the instruction such that the immediate operand is replaced by the wide immediate operand from the HCILT. If the wide immediate operand is not found in the HCILT, it is retrieved from the CILT as discussed above. If the immediate operand is not a reference to a wide immediate operand, the instruction is processed as usual. Using the HCILT to store and retrieve wide immediate operands avoids having to load the wide immediate operands from memory and thus may significantly improve the performance of the processor-based system.
- FIG. 1 is a block diagram illustrating an exemplary processor-based system that includes a processor configured to process instructions including wide immediate operands such that the wide immediate operands are fetched from a common immediate lookup table (CILT) or hardware CILT (HCILT);
- CILT common immediate lookup table
- HILT hardware CILT
- FIG. 2 is a block diagram illustrating exemplary details of a processor in a processor-based system in FIG. 1 processing instructions including wide immediate operands such that the wide immediate operands are fetched from a CILT or HCILT;
- FIG. 3 is a flowchart illustrating an exemplary process for processing instructions that may include immediate operands that reference wide immediate operands stored in a CILT or HCILT;
- FIG. 4 is a flowchart illustrating an exemplary process for processing a move immediate instruction that may include an immediate operand that references a wide immediate operand stored in a CILT or HCILT;
- FIG. 5 is a flowchart illustrating an exemplary process for populating an HCILT from a CILT
- FIG. 6 is a block diagram illustrating an exemplary compiler system for compiling source code into a program binary including a CILT;
- FIG. 7 is a flowchart illustrating an exemplary process for generating a program binary including a CILT from source code
- FIG. 8 is a block diagram illustrating an exemplary processor-based system that includes a processor configured to process instructions including wide immediate operands such that the wide immediate operands are fetched form a CILT or HCILT; and
- FIG. 9 is a flowchart illustrating an exemplary process for handling an HCILT miss wherein a wide immediate operand is not found in an HCILT.
- a processor element in a processor-based system is configured to fetch one or more instructions associated with a program binary, where the one or more instructions include an instruction having an immediate operand.
- the processor element is configured to determine if the immediate operand is a reference to a wide immediate operand.
- the processor element is configured to retrieve the wide immediate operand from a common immediate lookup table (CILT) in the program binary, where the immediate operand indexes the wide immediate operand in the CILT.
- CILT common immediate lookup table
- the processor element is then configured to process the instruction having the immediate operand such that the immediate operand is replaced with the wide immediate operand from the CILT.
- a processor element in a processor-based system includes a hardware CILT (HCILT) and instruction processing circuitry.
- HCILT includes hardware storage (e.g., a memory or register) configured to store a table indexing immediate values to wide immediate values.
- the instruction processing circuitry is configured to fetch one or more instructions associated with a program binary from an instruction memory, the instructions including an instruction having an immediate operand.
- the instruction processing circuitry is configured to determine if the immediate operand is a reference to a wide immediate operand.
- the instruction processing circuitry In response to determining that the immediate operand is a reference to a wide immediate operand, the instruction processing circuitry is configured to search the HCILT for the wide immediate operand indexed by the immediate operand, and, in response to finding the wide immediate operand in the HCILT, process the instruction such that the immediate operand is replaced by the wide immediate operand from the HCILT. If the wide immediate operand is not found in the HCILT, it is retrieved from the CILT as discussed above. If the immediate operand is not a reference to a wide immediate operand, the instruction is processed as usual. Using the HCILT to store and retrieve wide immediate operands avoids having to load wide immediate operands from memory and thus may significantly improve the performance of the processor-based system.
- FIG. 1 is a schematic diagram of an exemplary processor-based system 100 that may include improvements thereto in order to more efficiently process instructions having wide immediate operands.
- the processor-based system 100 includes a number of processor blocks 102 ( 1 )- 102 (M), wherein in the present exemplary embodiment “M” is equal to any number of processor blocks 102 desired.
- Each processor block 102 contains a number of processor elements 104 ( 1 )- 104 (N), wherein in the present exemplary embodiment “N” is equal to any number of processors desired.
- the processor elements 104 in each one of the processor blocks 102 may be microprocessors ( ⁇ P), vector processors (vP), or any other type of processor.
- each processor block 102 contains a shared level 2 (L2) cache 106 for storing cached data that is used by any of, or shared among, each of the processor elements 104 .
- a shared level 3 (L3) cache 108 is also provided for storing cached data that is used by any of, or shared among, each of the processor blocks 102 .
- An internal bus system 110 is provided that allows each of the processor blocks 102 to access the shared L3 cache 108 as well as other shared resources such as a memory controller 112 for accessing a main, external memory (MEM), one or more peripherals 114 (including input/output devices, networking devices, and the like), and storage 116 .
- MEM main, external memory
- peripherals 114 including input/output devices, networking devices, and the like
- one or more of the processor elements 104 in one or more of the processor blocks 102 work with the memory controller 112 to fetch instructions from memory, execute the instructions to perform one or more operations and generate a result, and optionally store the result back to memory or provide the result to another consumer instruction for consumption.
- FIG. 2 shows details of a processor element 104 in a processor block 102 of the processor-based system 100 according to an exemplary embodiment of the present disclosure.
- the processor element 104 includes an instruction processing circuit 200 .
- the instruction processing circuit 200 includes an instruction fetch circuit 202 that is configured to fetch instructions 204 from an instruction memory 206 .
- the instruction memory 206 may be provided in or as part of a system memory in the processor-based system 100 as an example.
- An instruction cache 208 may also be provided in the processor element 104 to cache the instructions 204 fetched from the instruction memory 206 to reduce latency in the instruction fetch circuit 202 .
- the instruction fetch circuit 202 in this example is configured to provide the instructions 204 as fetched instructions 204 F into one or more instruction pipelines I 0 -I N as an instruction stream 210 in the instruction processing circuit 200 to be pre-processed, before the fetched instructions 204 F reach an execution circuit 212 to be executed.
- the instruction pipelines I 0 -I N are provided across different processing circuits or stages of the instruction processing circuit 200 to pre-process and process the fetched instructions 204 F in a series of steps that can be performed concurrently to increase throughput prior to execution of the fetched instructions 204 F in the execution circuit 212 .
- a control flow prediction circuit 214 (e.g., a branch prediction circuit) is also provided in the instruction processing circuit 200 in the processor element 104 to speculate or predict a target address for a control flow fetched instruction 204 F, such as a conditional branch instruction.
- the prediction of the target address by the control flow prediction circuit 214 is used by the instruction fetch circuit 202 to determine the next fetched instructions 204 F to fetch based on the predicted target address.
- the instruction processing circuit 200 also includes an instruction decode circuit 216 configured to decode the fetched instructions 204 F fetched by the instruction fetch circuit 202 into decoded instructions 204 D to determine the instruction type and actions required, which may also be used to determine in which instruction pipeline I 0 -I N the decoded instructions 204 D should be placed.
- the decoded instructions 204 D are then placed in one or more of the instruction pipelines I 0 -I N and are next provided to a register access circuit 218 .
- the register access circuit 218 is configured to access a physical register 220 ( 1 )- 220 (X) in a physical register file (PRF) 222 to retrieve a produced value from an executed instruction 204 E from the execution circuit 212 .
- the register access circuit 218 is also configured to provide the retrieved produced value from an executed instruction 204 E as the source register operand of a decoded instruction 204 D to be executed.
- the instruction processing circuit 200 also includes a dispatch circuit 224 , which is configured to dispatch a decoded instruction 204 D to the execution circuit 212 to be executed when all source register operands for the decoded instruction 204 D are available.
- the dispatch circuit 224 is responsible for making sure that the necessary values for operands of a decoded consumer instruction 204 D, which is an instruction that consumes a produced value from a previously executed producer instruction, are available before dispatching the decoded consumer instruction 204 D to the execution circuit 212 for execution.
- the operands of the decoded instruction 204 D can include intermediate values, values stored in memory, and produced values from other decoded instructions 204 D that would be considered producer instructions to the consumer instruction.
- an HCILT 226 is provided within, or as shown, in addition to the PRF 222 .
- the HCILT 226 includes a set of HCILT registers 228 ( 1 )- 228 (Y), where “Y” is any desired number, dedicated to storing wide immediate values such that the wide immediate values are indexed by immediate values that fit within the instruction size of the ISA of the processor element 104 .
- the HCILT registers 228 may include support registers for accomplishing the functionality of the HCILT 226 as discussed in detail below.
- the HCILT 226 may be searched for the wide immediate operand such that the immediate operand is replaced with the wide immediate operand from the HCILT 226 by the register access circuitry 218 . This may significantly improve the performance of program binary execution by bypassing loading wide immediate operands from memory, which would otherwise need to occur to process an instruction having a wide immediate value. Further details regarding the functionality of the HCILT 226 are discussed below. Notably, while the HCILT 226 is illustrated above as a set of registers, the HCILT may be implemented as any type of dedicated hardware storage such as a hardware memory in various embodiments.
- the execution circuit 212 is configured to execute decoded instructions 204 D received from the dispatch circuit 224 . As discussed above, the executed instructions 204 E may generate produced values to be consumed by other instructions. In such a case, a write circuit 230 writes the produced values to the PRF 222 so that they can be later consumed by consumer instructions.
- FIG. 3 is a flow diagram illustrating a method for operating the processor element 104 to process instructions having wide immediate operands according to an exemplary embodiment of the present disclosure.
- instructions associated with a program binary are fetched from the instruction memory 206 , or, if cached, the instruction cache 208 (block 300 ).
- the program binary includes a CILT, which is a table storing wide immediate operands that are indexed by immediate operands that fit within an instruction size of the ISA of the processor element 104 .
- the instructions include an instruction having an immediate operand.
- an immediate operand is a value that is stored as part of an instruction itself, rather than as a point to a memory location or register.
- the ISA of the processor element 104 may specify that immediate operands include a reserved bit, which specifies whether the immediate operand is a reference to a wide immediate operand or not. For example, if the most significant bit of an immediate operand is set, the ISA may specify that the immediate operand is a reference to a wide immediate operand, which may be stored in a CILT or HCILT as discussed below.
- the ISA may specify that the immediate operand is not a reference to a wide immediate operand.
- the ISA of the processor element 104 may specify custom opcodes that specify that an immediate operand following the custom opcode is a reference to a wide immediate operand.
- the instruction is processed by the execution circuit 212 conventionally (block 304 ). If the immediate operand is a reference to a wide immediate operand, a determination is made whether the processor element 104 includes the HCILT 226 (block 306 ).
- the HCILT 226 is a hardware structure including one or more registers for storing a table which stores wide immediate operands referenced by immediate operands that fit within an instruction size of the ISA of the processor element 104 .
- the HCILT 226 is the hardware corollary to the CILT, and is meant to further expedite processing of instructions having wide immediate operands compared to the CILT alone.
- Determining if the processor element 104 includes the HCILT 226 may comprise reading a register of the processor element 104 . Instructions for determining whether the processor element 104 includes the HCILT 226 may be included in the ISA of the processor element 104 . If the processor element 104 does not include the HCILT 226 , the wide immediate operand may be retrieved from the CILT in the program binary (block 308 ). Retrieving the wide immediate operand from the CILT in the program binary may include fetching the wide immediate operand from a memory location that is indexed by the immediate value.
- the immediate operand may directly point to a memory location including the wide immediate value (e.g., via an offset value from a starting memory address of the CILT) or the CILT may be a map, where the immediate value is hashed to get the actual index of the wide immediate value.
- the loading of the wide immediate value from memory is performed by the processor element 104 in response to encountering an instruction with an immediate operand that references a wide immediate operand (either due to dual semantics of the immediate operand or due to a custom opcode) such that the load from memory is not explicit in instructions associated with the program binary.
- A load X//X is a wide immediate operand
- B Y+A //dependent on preceding load instruction can be reconfigured as:
- B Y+A′//A ′ is an immediate operand with dual semantics
- two instructions used to process an instruction having a wide immediate operand can be condensed into a single instruction, where the loading of the wide immediate value is handled by the processor according to a dedicated ISA specification. This not only reduces the static code size of the program binary but also the instruction fetch bandwidth, which is likely to improve the performance of the processor element 104 .
- the instruction is then processed such that the immediate operand is replaced with the wide immediate operand from the CILT (block 310 ). If the processor element 104 does include the HCILT 226 , a determination is made whether the wide immediate operand referenced by the immediate operand is in the HCILT 226 (block 312 ).
- the HCILT 226 may not be large enough to hold every wide immediate operand in the program binary. That is the, HCILT 226 may be smaller than the CILT and thus only some of the wide immediate operands may be present in the HCILT 226 .
- the wide immediate operand referenced by the immediate operand is not in the HCILT 226 , the wide immediate operand is retrieved from the CILT in the program binary (block 314 ), which is done as discussed above by a dynamic load initiated by the processor element 104 .
- the instruction is then processed such that the immediate operand is replaced with the wide immediate operand from the CILT (block 316 ).
- the wide immediate operand can also be copied from the CILT to the HCILT 226 (block 318 ) such that the wide immediate operand can be more easily accessed in a future processing cycle.
- One or more caching rules may dictate whether a wide immediate operand not found in the HCILT 226 should be added to the HCILT 226 after it is fetched from the CILT as discussed below.
- the wide immediate operand is retrieved from the HCILT 226 (block 320 ).
- the wide immediate operand may be retrieved from the HCILT 226 using the immediate operand as a direct index or a hashed index as discussed above with respect to the CILT.
- the instruction is then processed such that the immediate operand is replaced with the wide immediate operand from the HCILT 226 (block 322 ).
- a number of system registers may be added to the processor element 104 , providing support for using the CILT alone or the CILT along with an HCILT.
- the table below indicates the additional registers and their functions:
- HCILT_present Indicates if the hardware implements an HCILT. If this bit is not set, the OS must not attempt to load the CILT into the HCILT. Read only register.
- CILT_base_address Contains the virtual address at which the CILT is loaded in the program's address space.
- HCILT_active_entry Contains the current active HCILT entry, which is the entry in the HCILT table that is implicitly written/read when accessing the HCILT through a system register read/write.
- HCILT_table A system register such that the instruction write system register (wsr) HCILT_table, constant will write CILT_base_address[immediate operand * size of CILT entry in bytes] into the HCILT array entry number pointed to by HCILT_active_entry.
- the register can also be read to retrieve stored wide immediate operands therein.
- these registers are only one exemplary implementation of ISA support for a CILT and HCILT for improving processing of instructions having wide immediate operands.
- dedicated instructions in the ISA are provided to load wide immediates from the CILT such that one or more of the registers discussed above may be unnecessary and thus not included.
- FIG. 4 is a flow diagram illustrating the application of the process discussed above to a specific instruction, a move immediate (movi) instruction to be processed by the processor element 104 .
- a move immediate instruction includes a register operand and an immediate operand (block 400 ). The instruction, when processed, moves the immediate operand into the register.
- the processor element 104 determines if the immediate operand is a reference to a wide immediate operand (block 402 ). As discussed above, determining whether the immediate operand is a reference to a wide immediate operand may include determining if a reserved bit in the immediate operand is set.
- the register is set to the immediate operand (block 404 ) and the move immediate instruction is completed (block 406 ). If the immediate operand is a reference to a wide immediate operand, the processor element 104 determines if it includes the HCILT 226 (block 408 ). If the processor element 104 includes the HCILT 226 , the register is set to the value in HCILT_table[immediate] (block 410 ). As shown, the immediate operand indexes the wide immediate operand in the HCILT 226 . The move immediate instruction is then completed (block 406 ).
- the processor element 104 injects a load register instruction (“ldr register, [CILT_base_address+immediate]”) to load the wide immediate operand from the CILT (block 412 ), which is stored in memory starting at CILT_base_address.
- the immediate operand is used to index the wide immediate operand in the CILT. Any reserved bits used for determining if the immediate operand is a reference to a wide immediate operand may be stripped from the immediate value before using the immediate value as an index (e.g., offset) to retrieve the wide immediate operand.
- the move immediate instruction is then completed (block 406 ).
- FIG. 5 is a flow diagram illustrating how the HCILT 226 in the processor element 104 is populated from the CILT during a context switch according to an exemplary embodiment of the present disclosure.
- the population of the HCILT 226 occurs in response to a context switch in the program binary (block 500 ).
- the processor element 104 determines whether a number of entries in the HCILT 226 is greater than or equal to a number of entries in the CILT (block 502 ).
- the HCILT 226 includes a number of registers. The size of the registers determines how many entries (where each entry stores a wide immediate operand) can be stored in the HCILT 226 and thus how many entries there are.
- the number of entries in the HCILT 226 is less than the number of entries in the CILT, only a subset of the CILT entries are copied into the HCILT 226 (block 508 ). For example, for a CILT having 32 entries and an HCILT having 4 entries, the following exemplary instructions may be executed to populate the HCILT 226 from the CILT:
- FIG. 6 illustrates an exemplary compiler system 600 .
- the compiler system 600 includes a memory 602 and processing circuitry 604 .
- the memory 602 and the processing circuitry 604 are connected via a bus 606 .
- the memory 602 stores instructions, which, when executed by the processing circuitry 604 cause the compiler system 600 to retrieve or otherwise receive source code, generate an intermediate representation of the source code, apply one or more compiler optimizations to the intermediate representation of the source code, and provide the optimized intermediate representation of the source code as machine code suitable for execution by a processor in a processor-based system.
- the compiler system 600 may further include input/output circuitry 608 , which may connect to storage 610 for storage and retrieval of source code and/or machine code.
- input/output circuitry 608 may connect to storage 610 for storage and retrieval of source code and/or machine code.
- the operation of the compiler system 600 will be described as it relates to compiling source code into machine code for the processor element 104 in the processor-based system 100 .
- the compiler system 600 may more generally compile source code into machine code suitable for any processor in any processor-based system, including several different processors for several different processor-based systems.
- the memory 602 may include instructions, which, when executed by the processing circuitry 604 cause the compiler system 600 to generate machine code including a CILT and one or more instructions having an immediate value that references a wide_immediate value stored in the CILT as discussed in detail below.
- FIG. 7 is a flow diagram illustrating a method for operating the compiler system 600 to generate a program binary including a CILT according to an exemplary embodiment of the present disclosure.
- the compiler system 600 receives source code (block 700 ).
- the source code may be code written in a high-level programming language such as C, Rust, Go, Swift, and the like. Alternatively, the source code may be in a low-level language (i.e., written directly in machine code) that is only assembled by the compiler system 600 as discussed below.
- the compiler system 600 identifies wide_immediate operands in the source code (block 702 ). The wide immediate operands may be identified by static code analysis according to one or more rules.
- the CILT data structure along with the updated ISA for the processor element 104 which allows for immediate operands to reference wide immediate operands stored in the CILT, and, optionally, the HCILT 226 , may improve the performance of binary execution.
- the processor-based system 800 may be a circuit or circuits included in an electronic board card, such as, a printed circuit board (PCB), a server, a personal computer, a desktop computer, a laptop computer, a personal digital assistant (PDA), a computing pad, a mobile device, or any other device, and may represent, for example, a server or a user's computer.
- the processor-based system 800 includes the processor 802 .
- the processor 802 represents one or more general-purpose processing circuits, such as a microprocessor, central processing unit, or the like. More particularly, the processor 802 may be an EDGE instruction set microprocessor, or other processor implementing an instruction set that supports explicit consumer naming for communicating produced values resulting from execution of producer instructions.
- the processor 802 and the system memory 808 are coupled to the system bus 810 and can intercouple peripheral devices included in the processor-based system 800 . As is well known, the processor 802 communicates with these other devices by exchanging address, control, and data information over the system bus 810 . For example, the processor 802 can communicate bus transaction requests to a memory controller 812 in the system memory 808 as an example of a slave device. Although not illustrated in FIG. 8 , multiple system buses 810 could be provided, wherein each system bus 810 constitutes a different fabric. In this example, the memory controller 812 is configured to provide memory access requests to a memory array 814 in the system memory 808 .
- the memory array 814 is comprised of an array of storage bit cells for storing data.
- the system memory 808 may be a read-only memory (ROM), flash memory, dynamic random-access memory (DRAM), such as synchronous DRAM (SDRAM), etc., and a static memory (e.g., flash memory, static random access memory (SRAM), etc.), as non-limiting examples.
- ROM read-only memory
- DRAM dynamic random-access memory
- SDRAM synchronous DRAM
- static memory e.g., flash memory, static random access memory (SRAM), etc.
- Other devices can be connected to the system bus 810 . As illustrated in FIG. 8 , these devices can include the system memory 808 , one or more input device(s) 816 , one or more output device(s) 818 , a modem 820 , and one or more display controllers 822 , as examples.
- the input device(s) 816 can include any type of input device, including but not limited to input keys, switches, voice processors, etc.
- the output device(s) 818 can include any type of output device, including but not limited to audio, video, other visual indicators, etc.
- the modem 820 can be any device configured to allow exchange of data to and from a network 824 .
- the network 824 can be any type of network, including but not limited to a wired or wireless network, a private or public network, a local area network (LAN), a wireless local area network (WLAN), a wide area network (WAN), a BLUETOOTHTM network, and the Internet.
- the modem 820 can be configured to support any type of communications protocol desired.
- the processor 802 may also be configured to access the display controller(s) 822 over the system bus 810 to control information sent to one or more displays 826 .
- the display(s) 826 can include any type of display, including but not limited to a cathode ray tube (CRT), a liquid crystal display (LCD), a plasma display, etc.
- the processor-based system 800 in FIG. 8 may include a set of instructions 828 to be executed by the processor 802 for any application desired according to the instructions.
- the instructions 828 may be stored in the system memory 808 , processor 802 , and/or instruction cache 804 as examples of non-transitory computer-readable medium 830 .
- the instructions 828 may also reside, completely or at least partially, within the system memory 808 and/or within the processor 802 during their execution.
- the instructions 828 may further be transmitted or received over the network 824 via the modem 820 , such that the network 824 includes the computer-readable medium 830 .
- FIG. 9 is a flowchart illustrating details regarding what can happen if a wide_immediate operand is not found in the HCILT 226 (i.e., an HCILT miss) according to one embodiment of the present disclosure.
- the process begins at block 312 of the process discussed above with respect to FIG. 3 , where the wide immediate operand is not found in the HCILT 226 (the NO path from block 312 in FIG. 3 ). If the wide_immediate operand is not found in the HCILT 226 , a determination is made whether the processor element 104 has backend support for an HCILT miss (block 900 ).
- the processor-based system 100 may include a policy for determining when wide_immediate operands that were not found in the HCILT 226 should be copied from the CILT into the HCILT 226 .
- the size of the HCILT 226 is smaller than a number of entries in the CILT.
- policy rules such as a certain number of HCILT misses for a wide_immediate operand, a frequency of HCILT misses, or any number of different events may dictate that a wide_immediate operand be added to the HCILT 226 . If the policy dictates that the wide_immediate operand should be inserted in the HCILT 226 , a victim entry in the HCILT 226 is chosen (block 918 ), and the victim entry is replaced with the wide_immediate operand (block 920 ). The victim entry may similarly be chosen by any number of policy rules, such as frequency of use, for example.
- the pipeline is flushed (block 908 ), the instruction is re-fetched (block 910 ) and transformed such that the immediate operand is replaced with the wide_immediate operand from the CILT (block 912 ), and the transformed instruction is processed (block 914 ).
- the process can proceed to block 916 , where a determination is made whether the wide_immediate should be added to the HCILT 226 and can be added or not added based thereon.
- a processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).
- the embodiments disclosed herein may be embodied in hardware and in instructions that are stored in hardware, and may reside, for example, in RAM, flash memory, ROM, Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer-readable medium known in the art.
- An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium.
- the storage medium may be integral to the processor.
- the processor and the storage medium may reside in an ASIC.
- the ASIC may reside in a remote station.
- the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
Abstract
Description
-
- movi r0, 0xBADDFOODDEADCAFE
may be altered such that the wide immediate operand is stored in the program binary (at memory location 0xF9 when the binary is loaded into memory in the present example) and the move immediate instruction becomes: - ldr r0, [0xF9]
This can be done either explicitly by a developer of the program or by a compiler at compile time. Notably, any instructions that are dependent on the move immediate instruction must wait for the wide immediate operand to be loaded from memory before they can be processed. This may take several processing cycles and thus increase the execution time of a program binary.
- movi r0, 0xBADDFOODDEADCAFE
-
- movi r0, 0xBADDFOODDEADCAFE
may be altered to become: - movi r0, 0xBADDFOOD
- shl r0, 32
- addi r0, 0xDEADCAFE
Again, this can be done either explicitly by a developer of the program or by a compiler at compile time.
- movi r0, 0xBADDFOODDEADCAFE
A=load X//X is a wide immediate operand
B=Y+A//dependent on preceding load instruction
can be reconfigured as:
B=Y+A′//A′ is an immediate operand with dual semantics
As shown, two instructions used to process an instruction having a wide immediate operand can be condensed into a single instruction, where the loading of the wide immediate value is handled by the processor according to a dedicated ISA specification. This not only reduces the static code size of the program binary but also the instruction fetch bandwidth, which is likely to improve the performance of the
Register Name | Function |
HCILT_present | Indicates if the hardware implements an HCILT. |
If this bit is not set, the OS must not attempt to | |
load the CILT into the HCILT. Read only register. | |
CILT_base_address | Contains the virtual address at which the CILT is |
loaded in the program's address space. | |
HCILT_active_entry | Contains the current active HCILT entry, which is |
the entry in the HCILT table that is implicitly | |
written/read when accessing the HCILT through a | |
system register read/write. | |
HCILT_table | A system register such that the instruction write |
system register (wsr) HCILT_table, constant will | |
write CILT_base_address[immediate operand * | |
size of CILT entry in bytes] into the HCILT array | |
entry number pointed to by HCILT_active_entry. | |
The register can also be read to retrieve stored | |
wide immediate operands therein. | |
Notably, these registers are only one exemplary implementation of ISA support for a CILT and HCILT for improving processing of instructions having wide immediate operands. In one or more alternative embodiments, dedicated instructions in the ISA are provided to load wide immediates from the CILT such that one or more of the registers discussed above may be unnecessary and thus not included.
-
- wsr HCILT_active_entry, 0
- wsr HCILT_table wide_immediate_0
- wsr HCILT_active_entry, 1
- wsr HCILT_table wide_immediate_1
- wsr HCILT_active_entry, 2
- wsr HCILT_table wide_immediate_2
- . . .
- wsr HCILT_active_entry, 31
- wsr HCILT_Table wide_immediate_31
where “wsr register, immediate” is a write system register instruction that writes “immediate” to “register,” “wide_immediate_x” is wide immediate operand “x” stored in the CILT. As shown, HCILT_active_entry is written to update the index of the HCILT_table before every write to the HCILT_table. However, in some embodiments the handling of the HCILT_table index may be opaque such that it is automatically incremented and decremented (e.g., similar to a stack). The context is then switched in (block 506).
-
- wsr HCILT_active_entry, 0
- wsr HCILT_table wide_immediate_0
- wsr HCILT_active_entry, 1
- wsr HCILT_table wide_immediate_4
- wsr HCILT_active_entry, 2
- wsr HCILT_table wide_immediate_12
- wsr HCILT_active_entry, 3
- wsr HCILT_table wide_immediate_29
such thatentries 0, 4, 12, and 29 of the CILT are copied into the HCILT 226. The context is then switched in (block 506). Any number of different policies can be provided to determine which entries from the CILT are copied into the HCILT 226 when the number of entries in the HCILT 226 is not sufficient to store all of the entries in the CILT. Further, a caching policy can be implemented as discussed above such that when a wide_immediate operand is not found in the HCILT 226 (i.e., an HCILT 226 miss) and the wide_immediate operand must be fetched from the CILT, the wide_immediate operand is copied into the HCILT 226 at that time.
Claims (20)
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/579,161 US11036512B2 (en) | 2019-09-23 | 2019-09-23 | Systems and methods for processing instructions having wide immediate operands |
KR1020227012866A KR20220065017A (en) | 2019-09-23 | 2020-06-19 | Systems and methods for processing instructions with wide immediate operands |
PCT/US2020/038570 WO2021061234A1 (en) | 2019-09-23 | 2020-06-19 | Systems and methods for processing instructions having wide immediate operands |
JP2022518185A JP2022548392A (en) | 2019-09-23 | 2020-06-19 | System and method for processing instructions containing wide immediate operands |
CN202080066451.9A CN114430822A (en) | 2019-09-23 | 2020-06-19 | System and method for processing instructions with wide immediate operands |
EP20737732.6A EP4034990A1 (en) | 2019-09-23 | 2020-06-19 | Systems and methods for processing instructions having wide immediate operands |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/579,161 US11036512B2 (en) | 2019-09-23 | 2019-09-23 | Systems and methods for processing instructions having wide immediate operands |
Publications (2)
Publication Number | Publication Date |
---|---|
US20210089308A1 US20210089308A1 (en) | 2021-03-25 |
US11036512B2 true US11036512B2 (en) | 2021-06-15 |
Family
ID=71528002
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/579,161 Active US11036512B2 (en) | 2019-09-23 | 2019-09-23 | Systems and methods for processing instructions having wide immediate operands |
Country Status (6)
Country | Link |
---|---|
US (1) | US11036512B2 (en) |
EP (1) | EP4034990A1 (en) |
JP (1) | JP2022548392A (en) |
KR (1) | KR20220065017A (en) |
CN (1) | CN114430822A (en) |
WO (1) | WO2021061234A1 (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5881259A (en) * | 1996-09-23 | 1999-03-09 | Arm Limited | Input operand size and hi/low word selection control in data processing systems |
US6012125A (en) * | 1997-06-20 | 2000-01-04 | Advanced Micro Devices, Inc. | Superscalar microprocessor including a decoded instruction cache configured to receive partially decoded instructions |
US6725360B1 (en) * | 2000-03-31 | 2004-04-20 | Intel Corporation | Selectively processing different size data in multiplier and ALU paths in parallel |
US7389408B1 (en) | 2003-08-05 | 2008-06-17 | Sun Microsystems, Inc. | Microarchitecture for compact storage of embedded constants |
US20090182992A1 (en) * | 2008-01-11 | 2009-07-16 | International Business Machines Corporation | Load Relative and Store Relative Facility and Instructions Therefore |
US7730281B2 (en) | 1998-12-30 | 2010-06-01 | Intel Corporation | System and method for storing immediate data |
US20100169621A1 (en) * | 2008-12-26 | 2010-07-01 | Fujitsu Limited | Processor test apparatus, processor test method, and processor test program |
US20100312991A1 (en) | 2008-05-08 | 2010-12-09 | Mips Technologies, Inc. | Microprocessor with Compact Instruction Set Architecture |
US20120151189A1 (en) * | 2009-03-31 | 2012-06-14 | Freescale Semiconductor, Inc. | Data processing with variable operand size |
US9063749B2 (en) | 2011-05-27 | 2015-06-23 | Qualcomm Incorporated | Hardware support for hashtables in dynamic languages |
US9207880B2 (en) | 2013-12-27 | 2015-12-08 | Intel Corporation | Processor with architecturally-visible programmable on-die storage to store data that is accessible by instruction |
US20170286110A1 (en) * | 2016-03-31 | 2017-10-05 | Intel Corporation | Auxiliary Cache for Reducing Instruction Fetch and Decode Bandwidth Requirements |
US10261791B2 (en) | 2017-02-24 | 2019-04-16 | International Business Machines Corporation | Bypassing memory access for a load instruction using instruction address mapping |
-
2019
- 2019-09-23 US US16/579,161 patent/US11036512B2/en active Active
-
2020
- 2020-06-19 WO PCT/US2020/038570 patent/WO2021061234A1/en unknown
- 2020-06-19 KR KR1020227012866A patent/KR20220065017A/en unknown
- 2020-06-19 EP EP20737732.6A patent/EP4034990A1/en active Pending
- 2020-06-19 JP JP2022518185A patent/JP2022548392A/en active Pending
- 2020-06-19 CN CN202080066451.9A patent/CN114430822A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5881259A (en) * | 1996-09-23 | 1999-03-09 | Arm Limited | Input operand size and hi/low word selection control in data processing systems |
US6012125A (en) * | 1997-06-20 | 2000-01-04 | Advanced Micro Devices, Inc. | Superscalar microprocessor including a decoded instruction cache configured to receive partially decoded instructions |
US7730281B2 (en) | 1998-12-30 | 2010-06-01 | Intel Corporation | System and method for storing immediate data |
US6725360B1 (en) * | 2000-03-31 | 2004-04-20 | Intel Corporation | Selectively processing different size data in multiplier and ALU paths in parallel |
US7389408B1 (en) | 2003-08-05 | 2008-06-17 | Sun Microsystems, Inc. | Microarchitecture for compact storage of embedded constants |
US20090182992A1 (en) * | 2008-01-11 | 2009-07-16 | International Business Machines Corporation | Load Relative and Store Relative Facility and Instructions Therefore |
US20100312991A1 (en) | 2008-05-08 | 2010-12-09 | Mips Technologies, Inc. | Microprocessor with Compact Instruction Set Architecture |
US20100169621A1 (en) * | 2008-12-26 | 2010-07-01 | Fujitsu Limited | Processor test apparatus, processor test method, and processor test program |
US20120151189A1 (en) * | 2009-03-31 | 2012-06-14 | Freescale Semiconductor, Inc. | Data processing with variable operand size |
US9063749B2 (en) | 2011-05-27 | 2015-06-23 | Qualcomm Incorporated | Hardware support for hashtables in dynamic languages |
US9207880B2 (en) | 2013-12-27 | 2015-12-08 | Intel Corporation | Processor with architecturally-visible programmable on-die storage to store data that is accessible by instruction |
US20170286110A1 (en) * | 2016-03-31 | 2017-10-05 | Intel Corporation | Auxiliary Cache for Reducing Instruction Fetch and Decode Bandwidth Requirements |
US10261791B2 (en) | 2017-02-24 | 2019-04-16 | International Business Machines Corporation | Bypassing memory access for a load instruction using instruction address mapping |
Non-Patent Citations (3)
Title |
---|
"International Search Report and Written Opinion Issued in PCT Application No. PCT/US20/038570", dated Sep. 29, 2020, 14 Pages. |
Glokler, et al., "Power Efficient Semi-Automatic Instruction Encoding for Application Specific Instruction Set Processors", In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, May 7, 2001, pp. 1169-1172. |
Wilcox, et al., "Tool support for software lookup table optimization", In Journal of Scientific Programming, vol. 19, Issue 4, Dec. 5, 2011, 36 Pages. |
Also Published As
Publication number | Publication date |
---|---|
EP4034990A1 (en) | 2022-08-03 |
WO2021061234A1 (en) | 2021-04-01 |
JP2022548392A (en) | 2022-11-18 |
KR20220065017A (en) | 2022-05-19 |
US20210089308A1 (en) | 2021-03-25 |
CN114430822A (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6744423B2 (en) | Implementation of load address prediction using address prediction table based on load path history in processor-based system | |
US6832296B2 (en) | Microprocessor with repeat prefetch instruction | |
US20080065809A1 (en) | Optimized software cache lookup for simd architectures | |
US11068273B2 (en) | Swapping and restoring context-specific branch predictor states on context switches in a processor | |
CN117421259A (en) | Servicing CPU demand requests with in-flight prefetching | |
CN114546485A (en) | Instruction fetch unit for predicting the target of a subroutine return instruction | |
US11360773B2 (en) | Reusing fetched, flushed instructions after an instruction pipeline flush in response to a hazard in a processor to reduce instruction re-fetching | |
JP2022545848A (en) | Deferring cache state updates in non-speculative cache memory in a processor-based system in response to a speculative data request until the speculative data request becomes non-speculative | |
US11036512B2 (en) | Systems and methods for processing instructions having wide immediate operands | |
US20190065060A1 (en) | Caching instruction block header data in block architecture processor-based systems | |
CN112395000B (en) | Data preloading method and instruction processing device | |
WO2021055056A1 (en) | Dynamic hammock branch training for branch hammock detection in an instruction stream executing in a processor | |
US11915002B2 (en) | Providing extended branch target buffer (BTB) entries for storing trunk branch metadata and leaf branch metadata | |
US10896041B1 (en) | Enabling early execution of move-immediate instructions having variable immediate value sizes in processor-based devices | |
CN115080464B (en) | Data processing method and data processing device | |
US11755327B2 (en) | Delivering immediate values by using program counter (PC)-relative load instructions to fetch literal data in processor-based devices | |
US11487545B2 (en) | Processor branch prediction circuit employing back-invalidation of prediction cache entries based on decoded branch instructions and related methods | |
CN116627505A (en) | Instruction cache and operation method, processor core and instruction processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PERAIS, ARTHUR;SMITH, RODNEY WAYNE;PRIYADARSHI, SHIVAM;AND OTHERS;SIGNING DATES FROM 20190920 TO 20190923;REEL/FRAME:050463/0343 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |