EP3320428A1 - Processeur doté d'un accès à la mémoire efficace - Google Patents

Processeur doté d'un accès à la mémoire efficace

Info

Publication number
EP3320428A1
EP3320428A1 EP16820923.7A EP16820923A EP3320428A1 EP 3320428 A1 EP3320428 A1 EP 3320428A1 EP 16820923 A EP16820923 A EP 16820923A EP 3320428 A1 EP3320428 A1 EP 3320428A1
Authority
EP
European Patent Office
Prior art keywords
memory
outcome
instructions
instruction
load
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP16820923.7A
Other languages
German (de)
English (en)
Other versions
EP3320428A4 (fr
Inventor
Noam Mizrahi
Jonathan Friedmann
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Centipede Semi Ltd
Original Assignee
Centipede Semi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US14/794,853 external-priority patent/US20170010973A1/en
Priority claimed from US14/794,841 external-priority patent/US20170010972A1/en
Priority claimed from US14/794,835 external-priority patent/US10185561B2/en
Priority claimed from US14/794,837 external-priority patent/US9575897B2/en
Application filed by Centipede Semi Ltd filed Critical Centipede Semi Ltd
Publication of EP3320428A1 publication Critical patent/EP3320428A1/fr
Publication of EP3320428A4 publication Critical patent/EP3320428A4/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • G06F9/3832Value prediction for operands; operand history buffers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/35Indirect addressing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • G06F9/384Register renaming

Definitions

  • the present invention relates generally to microprocessor design, and particularly to methods and systems for efficient memory access in microprocessors.
  • An embodiment of the present invention that is described herein provides a method including, in a processor, processing program code that includes memory-access instructions, wherein at least some of the memory-access instructions include symbolic expressions that specify memory addresses in an external memory in terms of one or more register names.
  • a relationship between the memory addresses accessed by two or more of the memory-access instructions is identified, based on respective formats of the memory addresses specified in the symbolic expressions.
  • An outcome of at least one of the memory-access instructions is assigned to be served from an internal memory in the processor, based on the identified relationship.
  • identifying the relationship between the memory addresses is independent of actual numerical values of the memory addresses. In an embodiment, identifying the relationship between the memory addresses is performed at a point in time at which the actual numerical values of the memory addresses are undefined. In a disclosed embodiment, identifying the relationship is performed by a given pipeline stage in a pipeline of the processor, and the actual numerical values of the memory addresses are calculated in another pipeline stage that is located later in the pipeline than the given pipeline stage. In some embodiments, identifying the relationship includes searching the program code for memory-access instructions that specify the memory addresses using identical symbolic expressions. In an embodiment, identifying the relationship includes searching the program code for memory-access instructions that specify the memory addresses using different symbolic expressions that refer to the same memory address. In another embodiment, assigning the outcome of at least one of the memory-access instructions is performed by a decoding unit or a renaming unit in a pipeline of the processor.
  • assigning the outcome to be served from the internal memory further includes executing a memory-access instruction in the external memory, and verifying that the outcome of the memory-access instruction executed in the external memory matches the outcome assigned to the memory-access instruction from the internal memory.
  • verifying the outcome includes comparing the outcome of the memory-access instruction executed in the external memory with the outcome assigned to the memory-access instruction from the internal memory.
  • verifying the outcome includes verifying that no intervening event causes a mismatch between the outcome in the external memory and the outcome assigned from the internal memory.
  • verifying the outcome includes adding to the program code one or more instructions or micro-ops that verify the outcome, or modifying one or more existing instructions or micro-ops to the instructions or micro-ops that verify the outcome.
  • the method further includes flushing subsequent code upon finding that the outcome executed in the external memory does not match the outcome served from the internal memory.
  • the method further includes inhibiting the at least one of the memory-access instructions from being executed in the external memory. In other embodiments, the method further includes parallelizing execution of the program code, including assignment of the outcome from the internal memory, over multiple hardware threads. In another embodiment, processing the program code includes executing the program code, including assignment of the outcome from the internal memory, in a single hardware thread.
  • identifying the relationship includes identifying the memory-access instructions in a loop or a function. In another embodiment, identifying the relationship is performed at runtime. In an embodiment, identifying the relationship is performed, at least partly, based on indications embedded in the program code.
  • a processor including an internal memory and processing circuitry.
  • the processing circuitry is configured to process program code that includes memory-access instructions, wherein at least some of the memory-access instructions include symbolic expressions that specify memory addresses in an external memory in terms of one or more register names, to identify a relationship between the memory addresses accessed by two or more of the memory- access instructions, based on respective formats of the memory addresses specified in the symbolic expressions, and to assign an outcome of at least one of the memory-access instructions to be served from the internal memory, based on the identified relationship.
  • Fig. 1 is a block diagram that schematically illustrates a processor, in accordance with an embodiment of the present invention
  • Fig. 2 is a flow chart that schematically illustrates a method for processing code that contains memory-access instructions, in accordance with an embodiment of the present invention
  • Fig. 3 is a flow chart that schematically illustrates a method for processing code that contains recurring load instructions, in accordance with an embodiment of the present invention
  • Fig. 4 is a flow chart that schematically illustrates a method for processing code that contains load-store instruction pairs, in accordance with an embodiment of the present invention
  • Fig. 5 is a flow chart that schematically illustrates a method for processing code that contains repetitive load-store instruction pairs with intervening data manipulation, in accordance with an embodiment of the present invention.
  • Fig. 6 is a flow chart that schematically illustrates a method for processing code that contains recurring load instructions from nearby memory addresses, in accordance with an embodiment of the present invention.
  • Embodiments of the present invention that are described herein provide improved methods and systems for processing software code that includes memory-access instructions.
  • a processor monitors the code instructions, and finds relationships between memory-access instructions. Relationships may comprise, for example, multiple load instructions that access the same memory address, load and store instruction pairs that access the same memory address, or multiple load instructions that access a predictable pattern of memory addresses.
  • the processor is able to serve the outcomes of some memory-access instructions, to subsequent code that depends on the outcomes, from internal memory (e.g., internal registers, local buffer) instead of from external memory.
  • internal memory e.g., internal registers, local buffer
  • reading from the external memory via a cache, which is possibly internal to the processor, is also regarded as serving an instruction from the external memory.
  • the processor when multiple load instructions read from the same memory address, the processor reads a value from this memory address on the first load instruction, and saves the value to an internal register.
  • the processor serves the value to subsequent code from the internal register, without waiting for the load instruction to retrieve the value from the memory address.
  • next load instructions are still carried out in the external memory, e.g., in order to verify that the value served from the internal memory is still valid, but execution progress does not have to wait for them to complete. This feature improves performance since the dependencies of subsequent code on the load instructions are broken, and instruction parallelization can be improved.
  • the processor identifies the relationships between memory-access instructions based on the formats of the symbolic expressions that specify the memory addresses in the instructions, and not based on the actual numerical values of the addresses.
  • the symbolic expressions are available early in the pipeline, as soon as the instructions are decoded.
  • the disclosed techniques identify and act upon interrelated memory-access instructions with small latency, thus enabling fast operation and a high degree of parallelization.
  • the disclosed techniques provide considerable performance improvements and are suitable for implementation in a wide variety of processor architectures, including both multi- thread and single-thread architectures.
  • Fig. 1 is a block diagram that schematically illustrates a processor 20, in accordance with an embodiment of the present invention.
  • Processor 20 runs pre-compiled software code, while parallelizing the code execution. Instruction parallelization is performed by the processor at run- time, by analyzing the program instructions as they are fetched from memory and processed.
  • processor 20 comprises multiple hardware threads 24 that are configured to operate in parallel. Each thread 24 is configured to process a respective segment of the code. Certain aspects of thread parallelization, including definitions and examples of partially repetitive segments, are addressed, for example, in U.S. Patent Applications 14/578,516, 14/578,518, 14/583,119, 14/637,418, 14/673,884, 14/673,889 and 14/690,424, which are all assigned to the assignee of the present patent application and whose disclosures are incorporated herein by reference.
  • each thread 24 comprises a fetching unit 28, a decoding unit 32 and a renaming unit 36.
  • Fetching unit 24 fetch the program instructions of their respective code segments from a memory, e.g., from a multi-level instruction cache.
  • the multi-level instruction cache comprises a Level-1 (LI) instruction cache 40 and a Level-2 (L2) cache 42 that cache instructions stored in a memory 43.
  • Decoding units 32 decode the fetched instructions (and possibly transform them into micro-ops), and renaming units 36 carry out register renaming.
  • OOO buffer 44 comprises a register file 48.
  • the processor further comprises a dedicated register file 50, also referred to herein as an internal memory.
  • Register file 50 comprises one or more dedicated registers that are used for expediting memory-access instructions, as will be explained in detail below.
  • execution units 52 comprise two Arithmetic Logic Units (ALU) denoted ALUO and ALU1, a Multiply- Accumulate (MAC) unit, two Load-Store Units (LSU) denoted LSUO and LSU1, a Branch execution Unit (BRU) and a Floating-Point Unit (FPU).
  • ALU Arithmetic Logic Unit
  • MAC Multiply- Accumulate
  • LSU Load-Store Units
  • BRU Branch execution Unit
  • FPU Floating-Point Unit
  • execution units 52 may comprise any other suitable types of execution units, and/or any other suitable number of execution units of each type.
  • the cascaded structure of threads 24, OOO buffer 44 and execution units 52 is referred to herein as the pipeline of processor 20.
  • the results produced by execution units 52 are saved in register file 48 and/or register file 50, and/or stored in memory 43.
  • a multi -level data cache mediates between execution units 52 and memory 43.
  • the multi -level data cache comprises a Level-1 (LI) data cache 56 and L2 cache 42.
  • LI Level-1
  • the Load-Store Units (LSU) of processor 20 store data in memory 43 when executing store instructions, and retrieve data from memory 43 when executing load instructions.
  • the data storage and/or retrieval operations may use the data cache (e.g., LI cache 56 and L2 cache 42) for reducing memory access latency.
  • high-level cache e.g., L2 cache
  • L2 cache may be implemented, for example, as separate memory areas in the same physical memory, or simply share the same memory without fixed pre-allocation.
  • memory 43, LI cache 40 and 56, and L2 cache 42 are referred to collectively as an external memory 41. Any access to memory 43, cache 40, cache 56 or cache 42 is regarded as an access to the external memory. References to "addresses in the external memory” or “addresses in external memory 41" refer to the addresses of data in memory 43, even though the data may be physically retrieved by reading cached copies of the data in cache 56 or 42. By contrast, access to register file 50, for example, is regarded as access to internal memory.
  • a branch prediction unit 60 predicts branches or flow-control traces (multiple branches in a single prediction), referred to herein as "traces" for brevity, that are expected to be traversed by the program code during execution.
  • the code may be executed in a single-thread processor or a single thread within a multi-thread processor, or by the various threads 24 as described in U.S. Patent Applications 14/578,516, 14/578,518, 14/583, 119, 14/637,418, 14/673,884, 14/673,889 and 14/690,424, cited above.
  • branch prediction unit 60 instructs fetching units 28 which new instructions are to be fetched from memory.
  • Branch prediction in this context may predict entire traces for segments or for portions of segments, or predict the outcome of individual branch instructions.
  • a state machine unit 64 manages the states of the various threads 24, and invokes threads to execute segments of code as appropriate.
  • processor 20 parallelizes the processing of program code among threads 24.
  • processor 20 performs efficient processing of memory-access instructions using methods that are described in detail below.
  • Parallelization tasks are typically performed by various units of the processor. For example, branch prediction unit 60 typically predicts the control-flow traces for the various threads, state machine unit 64 invokes threads to execute appropriate segments at least partially in parallel, and renaming units 36 handle memory-access parallelization.
  • memory parallelization unit may be performed by decoding units 32, and/or jointly by decoding units 32 and renaming units 36.
  • units 60, 64, 32 and 36 are referred to collectively as thread parallelization circuitry (or simply parallelization circuitry for brevity).
  • the parallelization circuitry may comprise any other suitable subset of the units in processor 20.
  • some or even all of the functionality of the parallelization circuitry may be carried out using run-time software.
  • run-time software is typically separate from the software code that is executed by the processor and may run, for example, on a separate processing core.
  • register file 50 is referred to as internal memory, and the terms “internal memory” and “internal register” are sometimes used interchangeably.
  • the remaining processor elements are referred to herein collectively as processing circuitry that carries out the disclosed techniques using the internal memory.
  • processing circuitry that carries out the disclosed techniques using the internal memory.
  • other suitable types of internal memory can also be used for carrying out the disclosed techniques.
  • the processor pipeline may comprise, for example, a single fetching unit 28, a single decoding unit 32, a single renaming unit 36, and no state machine 64.
  • the disclosed techniques accelerate memory access in single-thread processing.
  • the examples below refer to memory-access acceleration functions being performed by the parallelization circuitry, these functions may generally be carried out by the processing circuitry of the processor.
  • processor 20 shown in Fig. 1 is an example configuration that is chosen purely for the sake of conceptual clarity. In alternative embodiments, any other suitable processor configuration can be used.
  • multi -threading is implemented using multiple fetching, decoding and renaming units. Additionally or alternatively, multi-threading may be implemented in many other ways, such as using multiple OOO buffers, separate execution units per thread and/or separate register files per thread. In another embodiment, different threads may comprise different respective processing cores.
  • the processor may be implemented without cache or with a different cache structure, without branch prediction or with a separate branch prediction per thread.
  • the processor may comprise additional elements not shown in the figure.
  • the disclosed techniques can be carried out with processors having any other suitable micro-architecture.
  • the disclosed techniques can be used to improve the processor performance, e.g., replace (and reduce) memory access time with register access time, reduce the number of external memory access operations, regardless of thread parallelization.
  • Such techniques can be applied in single-thread configurations or other configurations that do not necessarily involve thread parallelization.
  • Processor 20 can be implemented using any suitable hardware, such as using one or more Application-Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs) or other device types. Additionally or alternatively, certain elements of processor 20 can be implemented using software, or using a combination of hardware and software elements.
  • the instruction and data cache memories can be implemented using any suitable type of memory, such as Random Access Memory (RAM).
  • Processor 20 may be programmed in software to carry out the functions described herein.
  • the software may be downloaded to the processor in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
  • the parallelization circuitry of processor 20 monitors the code processed by one or more threads 24, identifies code segments that are at least partially repetitive, and parallelizes execution of these code segments. Certain aspects of parallelization functions performed by the parallelization circuitry, including definitions and examples of partially repetitive segments, are addressed, for example, in U.S. Patent Applications 14/578,516, 14/578,518, 14/583,119, 14/637,418, 14/673,884, 14/673,889 and 14/690,424, cited above.
  • the program code that is processed by processor 20 contains memory-access instructions such as load and store instructions.
  • memory-access instructions such as load and store instructions.
  • different memory-access instructions in the code are inter-related, and these relationships can be exploited for improving performance.
  • different memory-access instructions may access the same memory address, or a predictable pattern of memory addresses.
  • one memory-access instruction may read or write a certain value, subsequent instructions may manipulate that value in a predictable way, and a later memory-access instruction may then write the manipulated value to memory.
  • the parallelization circuitry in processor 20 identifies such relationships between memory-access instructions, and uses the relationships to improve parallelization performance.
  • the parallelization circuitry identifies the relationships by analyzing the formats of the symbolic expressions that specify the addresses accessed by the memory-access instructions (as opposed to the numerical values of the addresses).
  • the operand of a memory-access instruction comprises a symbolic expression, i.e., an expression defined in terms of one or more register names, specifying the memory-access operation to be performed.
  • the symbolic expression of a memory-access instruction may specify, for example, the memory address to be accessed, a register whose value is to be written, or a register into which a value is to be read.
  • the symbolic expressions may have a wide variety of formats. Different symbolic formats may relate to different addressing modes (e.g., direct vs. indirect addressing), or to pre-incrementing or post-incrementing of indices, to name just a few examples.
  • decoding units 32 decode the instructions, including the symbolic expressions.
  • the actual numerical values of the expressions e.g., numerical memory addresses to be accessed and/or numerical values to be written
  • the symbolic expressions are typically evaluated later, by renaming units 36, just before the instructions are written to OOO buffer 44. Only at the execution stage, the LSUs and/or ALUs evaluate the symbolic expressions and assign the memory-access instructions actual numerical values.
  • the numerical memory addresses to be accessed is evaluated in the LSU and the numerical values to be written are evaluated in the ALU. In another example embodiment, both the numerical memory addresses to be accessed, and the numerical values to be written, are evaluated in the LSU.
  • the time delay between decoding an instruction (making the symbolic expression available) and evaluating the numerical values in the symbolic expression is not only due to the pipeline delay.
  • a symbolic expression of a given memory-access instruction cannot be evaluated (assigned numerical values) until the outcome of a previous instruction is available. Because of such dependencies, the symbolic expression may be available, in symbolic form, long before (possibly several tens of cycles before) it can be evaluated.
  • the parallelization circuitry identifies and exploits the relationships between memory-access instructions by analyzing the formats of the symbolic expressions. As explained above, the relationships may be identified and exploited at a point in time at which the actual numerical values are still undefined and cannot be evaluated (e.g., because they depend on other instructions that were not yet executed). Since this process does not wait for the actual numerical values to be assigned, it can be performed early in the pipeline. As a result, subsequent code that depends on the outcomes of the memory-access instructions can be executed sooner, dependencies between instructions can be relaxed, and parallelization can thus be improved.
  • the disclosed techniques are applied in regions of the code containing one or more code segments that are at least partially repetitive, e.g., loops or functions.
  • the disclosed techniques can be applied in any other suitable region of the code, e.g., sections of loop iterations, sequential code and/or any other suitable instruction sequence, with a single or multi -threaded processor.
  • Fig. 2 is a flow chart that schematically illustrates a method for processing code that contains memory-access instructions, in accordance with an embodiment of the present invention.
  • the method begins with the parallelization circuitry in processor 20 monitoring code instructions, at a monitoring step 70.
  • the parallelization circuitry analyzes the formats of the symbolic expressions of the monitored memory-access instructions, at a symbolic analysis step 74.
  • the parallelization circuitry analyzes the parts of the symbolic expressions that specify the addresses to be accessed.
  • the parallelization circuitry Based on the analyzed symbolic expressions, the parallelization circuitry identifies relationships between different memory-access instructions, at a relationship identification step 78. Based on the identified relationships, at a serving step 82, the parallelization circuitry serves the outcomes of at least some of the memory-access instructions from internal memory (e.g., internal registers of processor 20) instead of from external memory 41.
  • internal memory e.g., internal registers of processor 20
  • serving a memory-access instruction from external memory 41 covers the cases of serving a value that is stored in memory 43, or cached in cache 56 or 42.
  • serving a memory-access instruction from internal memory refers to serving the value either directly or indirectly.
  • One example of serving the value indirectly is copying the value to an internal register, and then serving the value from that internal register.
  • Serving from the internal memory may be assigned, for example, by decoding unit 32 or renaming unit 36 of the relevant thread 24 and later performed by one of execution units 52.
  • the parallelization circuitry identifies multiple load instructions (e.g., Idr instructions) that read from the same memory address in the external memory.
  • the identification typically also includes verifying that no store instruction writes to this same memory address between the load instructions.
  • Idr rl,[r6] that is found inside a loop, wherein r6 is a global register.
  • the term "global register” refers to a register that is not written to between the various loads within the loop iterations (i.e., the register value does not change between loop iterations).
  • the instruction above loads from memory the value which resides in the address which is held in r6 and puts it in rl .
  • the parallelization circuitry analyzes the format of the symbolic expression of the address "fr6f identifies that r6 is global, recognizes that the symbolic expression is defined in terms of one or more global registers, and concludes that the load instructions in the various loop iterations all read from the same address in the external memory.
  • all the identified load instructions specify the address using the same symbolic expression.
  • the parallelization circuitry identifies load instructions that read from the same memory address, even though different load instructions may specify the memory address using different symbolic expressions. For example, the load instructions
  • Idr r4,[r6] all access the same memory address (in the first load the register r6 is first updated by adding 4 to its value).
  • Another example for accessing the same memory address is repetitive load instructions such as:
  • the parallelization circuitry may recognize that these symbolic expressions all refer to the same address in various ways, e.g., by holding a predefined list of equivalent formats of symbolic expressions that specify the same address.
  • the parallelization circuitry Upon identifying such a relationship, the parallelization circuitry saves the value read from the external memory by one of the load instructions in an internal register, e.g., in one of the dedicated registers in register file 50. For example, the processor parallelization circuitry may save the value read by the load instruction in the first loop iteration. When executing a subsequent load instruction, the parallelization circuitry may serve the outcome of the load instruction from the internal memory, without waiting for the value to be retrieved from the external memory. The value may be served from the internal memory to any subsequent code instructions that depend on this value.
  • the parallelization circuitry may identify recurring load instructions not only in loops, but also in functions, in sections of loop iterations, in sequential code, and/or in any other suitable instruction sequence.
  • processor 20 may implement the above mechanism in various ways.
  • the parallelization circuitry (typically decoding unit 32 or renaming unit 36 of the relevant thread) implements this mechanism by adding instructions or micro-ops to the code.
  • the parallelization circuitry upon identifying the relationship between the recurring Idr instructions, adds an instruction of the form mov MSG, rl after the Idr instruction in the first loop iteration, wherein MSG denotes a dedicated internal register. This instruction assigns the value which was loaded from memory in an additional register. The first loop iteration thus becomes
  • the parallelization circuitry adds an instruction of the form mov rl,MSG which assigns the value that was saved in the additional register to rl after the Idr instruction.
  • the subsequent loop iterations thus become
  • value of register MSG will be loaded into register rl without having to wait for the Idr instruction to retrieve the value from external memory 41.
  • the mov instruction is an ALU instruction and does not involve accessing the external memory, it is considerably faster than the Idr instruction (typically a single cycle instead of four cycles). Furthermore, the add instruction no longer depends on the Idr instruction but only on the mov instruction and thus, the subsequent code benefits from the reduction in processing time.
  • the parallelization circuitry implements the above mechanism without adding instructions or micro-ops to the code, but rather by configuring the way registers are renamed in renaming units 36.
  • the parallelization circuitry implements the above mechanism without adding instructions or micro-ops to the code, but rather by configuring the way registers are renamed in renaming units 36.
  • renaming unit 36 When processing the Idr instruction in the first loop iteration, renaming unit 36 performs conventional renaming, i.e., renames destination register rl to some physical register (denoted p8 in this example), and serves the operand rl in the add instruction from p8.
  • rl When processing the mov instruction, rl is renamed to a new physical register (e.g., p9).
  • p8 is not released when p9 is committed. The processor thus maintains the value of register p8 that holds the value loaded from memory.
  • renaming unit 36 When executing the subsequent loop iterations, on the other hand, renaming unit 36 applies a different renaming scheme.
  • the operands rl in the add instructions of all subsequent loop iterations all read the value from the same physical register p8, eliminating the need to wait for the result of the load instruction. Register p8 is released only after the last loop iteration.
  • the parallelization circuitry may serve the read value from the internal register in any other suitable way.
  • the internal register is dedicated for this purpose only.
  • the internal register may comprise one of the processor's architectural registers in register file 48 which is not exposed to the user.
  • the internal register may comprise a register in register file 50, which is not one of the processor's architectural registers in register file 48 (like r6) or physical registers (like p8).
  • any other suitable internal memory of the processor can be used for this purpose.
  • an internal register e.g., MSG or p8
  • Serving the outcome of a Idr instruction from an internal register involves a small but non- negligible probability of error. For example, if a different value were to be written to the memory address in question at any time after the first load instruction, then the actual read value will be different from the value saved in the internal register. As another example, if the value of register r6 were to be changed (even though it is assumed to be global), then the next load instruction will read from a different memory address. In this case, too, the actual read value will be different from the value saved in the internal register.
  • an internal register e.g., MSG or p8
  • the parallelization circuitry verifies, after serving an outcome of a load instruction from an internal register, that the served value indeed matches the actual value retrieved by the load instruction from external memory 41. If a mismatch is found, the parallelization circuitry may flush subsequent instructions and results. Flushing typically comprises discarding all subsequent instructions from the pipeline such that all processing that was performed with a wrong operand value is discarded. In other words, the processor executes the subsequent load instructions in the external memory and retrieves the value from the memory address in question, for the purpose of verification, even though the value is served from the internal register.
  • the above verification may be performed, for example, by verifying that no store (e.g., str) instruction writes to the memory address between the recurring load instructions. Additionally or alternatively, the verification may ascertain that no fence instructions limit the possibility of serving subsequent code from the internal memory.
  • no store e.g., str
  • the memory address in question may be written to by another entity, e.g., by another processor or processor core, or by a debugger. In such cases it may not be sufficient to verify that the monitored program code does not contain an intervening store instruction that writes to the memory address. In an embodiment, the verification may use an indication from a memory management subsystem, indicative of whether the content of the memory address was modified.
  • intervening store instructions In the present context, intervening store instructions, intervening fence instructions, and/or indications from a memory management subsystems, are all regarded as intervening events that create a mismatch between the value in the external memory and the value served from the internal memory.
  • the verification process may consider any of these events, and/or any other suitable intervening event.
  • the parallelization circuitry may initially assume that no intervening event affects the memory address in question. If, during execution, some verification mechanism fails, the parallelization circuitry may deduce that an intervening event possibly exists, and refrain from serving the outcome from the internal memory.
  • the parallelization circuitry may add to the code an instruction or micro-op that retrieves the correct value from the external memory and compares it with the value of the internal register. The actual comparison may be performed, for example, by one of the ALUs or LSUs in execution units 52. Note that no instruction depends on the added micro-op, as it does not exist in the original code and is used only for verification. Further alternatively, the parallelization circuitry may perform the verification in any other suitable way. Note that this verification does not affect the performance benefit gained by the fast loading to register rl when it is correct, but rather flushes this fast loading in cases where it was wrong.
  • Fig. 3 is a flow chart that schematically illustrates a method for processing code that contains recurring load instructions, in accordance with an embodiment of the present invention. The method begins with the parallelization circuitry of processor 20 identifying a recurring plurality of load instructions that access the same memory address (with no intervening event), at a recurring load identification step 90.
  • this identification is made based on the formats of the symbolic expressions of the load instructions, and not based on the numerical values of the memory addresses.
  • the identification may also consider and make use of factors such as the Program- Counter (PC) values, program addresses, instruction-indices and address-operands of the load instructions in the program code.
  • PC Program- Counter
  • processor 20 dispatches the next load instruction for execution in external memory 41.
  • the parallelization circuitry checks whether the load instruction just executed is the first occurrence in the recurring load instructions, at a first occurrence checking step 98.
  • the parallelization circuitry saves the value read from the external memory in an internal register, at a saving step 102.
  • the parallelization circuitry serves this value to subsequent code, at a serving step 106.
  • the parallelization circuitry then proceeds to the next occurrence in the recurring load instructions, at an iteration incrementing step 110.
  • the method then loops back to step 94, for executing the next load instruction. (Other instructions in the code are omitted from this flow for the sake of clarity.)
  • the parallelization circuitry serves the outcome of the load instruction (or rather assigns the outcome to be served) from the internal register, at an internal serving step 114. Note that although step 114 appears after step 94 in the flow chart, the actual execution which relates to step 114 ends before the execution which is related to step 94.
  • the parallelization circuitry verifies whether the served value (the value saved in the internal register at step 102) is equal to the value retrieved from the external memory (retrieved at step 94 of the present iteration). If so, the method proceeds to step 110. If a mismatch is found, the parallelization circuitry flushes the subsequent instructions and/or results, at a flushing step 122.
  • the recurring load instructions all recur in respective code segments having the same flow-control. For example, if a loop does not contain any conditional branch instructions, then all loop iterations, including load instructions, will traverse the same flow-control trace. If, on the other hand, the loop does contain one or more conditional branch instructions, then different loop iterations may traverse different flow-control traces. In such a case, a recurring load instruction may not necessarily recur in all possible traces.
  • the parallelization circuitry serves the outcome of a recurring load instruction from the internal register only to subsequent code that is associated with the same flow-control trace as the initial load instruction (whose outcome was saved in the internal register).
  • the traces considered by the parallelization circuitry may be actual traces traversed by the code, or predicted traces that are expected to be traversed. In the latter case, if the prediction fails, the subsequent code may be flushed.
  • the parallelization circuitry serves the outcome of a recurring load instruction from the internal register to subsequent code regardless of whether it is associated with the same trace or not.
  • the parallelization circuitry may handle two or more groups of recurring read instructions, each reading from a respective common address. Such groups may be identified and handled in the same region of the code containing segments that are at least partially repetitive.
  • the parallelization circuitry may handle multiple dedicated registers (like the MSG register described above) for this purpose.
  • the recurring load instruction is located at or near the end of a loop iteration, and the subsequent code that depends on the read value is located at or near the beginning of a loop iteration.
  • the parallelization circuitry may serve a value obtained in one loop iteration to a subsequent loop iteration.
  • the iteration in which the value was initially read and the iteration to which the value is served may be processed by different threads 24 or by the same thread.
  • the parallelization circuitry is able to recognize that multiple load instructions read from the same address even when the address is specified indirectly using a pointer value stored in memory.
  • r4 is global.
  • the address [r4] holds a pointer. Nevertheless, the value of all loads to rl (and r3) is the same in all iterations.
  • the parallelization circuitry saves the information relating to the recurring load instructions as part of a data structure (referred to as a "scoreboard") produced by monitoring the relevant region of the code.
  • a scoreboard a data structure
  • the parallelization circuitry may save, for example, the address format or PC value.
  • the parallelization circuitry e.g., the renaming unit
  • the parallelization circuitry identifies, based on the formats of the symbolic expressions, a store instruction and a subsequent load instruction that both access the same memory address in the external memory. Such a pair is referred to herein as a "load-store pair.”
  • the parallelization circuitry saves the value stored by the store instruction in an internal register, and serves (or at least assigns for serving) the outcome of the load instruction from the internal register, without waiting for the value to be retrieved from external memory 41.
  • the value may be served from the internal register to any subsequent code instructions that depend on the outcome of the load instruction in the pair.
  • the internal register may comprise, for example, one of the dedicated registers in register file 50.
  • the identification of load-store pairs and the decision whether to serve the outcome from an internal register may be performed, for example, by the relevant decoding unit 32 or renaming unit 36.
  • both the load instruction and the store instruction specify the address using the same symbolic format, such as in the code str rl,[r2]
  • load instruction and the store instruction specify the address using different symbolic formats that nevertheless refer to the same memory address.
  • load- store pairs may comprise, for example
  • the value of r2 is updated to increase by 4 before the store address is calculated.
  • the store and load refer to the same address.
  • the value of r2 is updated to increase by 4 after the store address is calculated, while the load address is then calculated from the new value of r2 subtracted by 4.
  • the store and load refer to the same address.
  • the store and load instructions of a given load-store pair are processed by the same hardware thread 24. In alternative embodiments, the store and load instructions of a given load-store pair may be processed by different hardware threads.
  • the parallelization circuitry may serve the outcome of the load instruction from an internal register by adding an instruction or micro-op to the code.
  • This instruction or micro-op may be added at any suitable location in the code in which the data for the store instruction is ready (not necessarily after the store instruction - possibly before the store instruction). Adding the instruction or micro-op may be performed, for example, by the relevant decoding unit 32 or renaming unit 36.
  • the parallelization circuitry may add the micro-op movMSGL,r8 that assigns the value of r8 into another register (which is referred to as MSGL) at a suitable location in which the value of r8 is available. Following the Idr instruction the parallelization circuitry may add the micro-op mov rl,MSGL that assigns the value of MSGL into register rl .
  • the parallelization circuitry may serve the outcome of the load instruction from an internal register by configuring the renaming scheme so that the outcome is served from the same physical register mapped by the store instruction.
  • This operation may be performed at any suitable time in which the data for the store instruction is already assigned to the final physical register, e.g., once the micro-op that assigns the value to r8 has passed the renaming unit.
  • renaming unit 36 may assign the value stored by the store instruction to a certain physical register, and rename the instructions that depend on the outcome of the corresponding load instruction to receive the outcome from this physical register.
  • the parallelization circuitry verifies that the registers participating in the symbolic expression of the address in the store instruction are not updated between the store instruction and the load instruction of the pair.
  • the store instruction stores a word of a certain width (e.g., a 32-bit word), and the corresponding load instruction loads a word of a different width (e.g., an 8-bit byte) that is contained within the stored word.
  • a word of a certain width e.g., a 32-bit word
  • the corresponding load instruction loads a word of a different width (e.g., an 8-bit byte) that is contained within the stored word.
  • the store instruction may store a 32- bit word in a certain address, and the load instruction in the pair may load some 8-bit byte within the 32-bit word. This scenario is also regarded as a load-store pair that accesses the same memory address.
  • the parallelization circuitry may pair a store instruction and a load instruction together, for example, even if their symbolic expressions use different registers but are known to have the same values.
  • the registers in the symbolic expressions of the addresses in the store and load instructions are indices, i.e., their values increment with a certain stride or other fixed calculation so as to address an array in the external memory.
  • the load instruction and corresponding store instruction may be located inside a loop, such that each pair accesses an incrementally-increasing memory address.
  • the parallelization circuitry verifies, when serving the outcome of the load instruction in a load-store pair from an internal register, that the served value indeed matches the actual value retrieved by the load instruction from external memory 41. If a mismatch is found, the parallelization circuitry may flush subsequent instructions and results.
  • the parallelization circuitry may add an instruction or micro-op that performs the verification.
  • the actual comparison may be performed by the ALU or alternatively in the LSU.
  • the parallelization circuitry may verify that the registers appearing in the symbolic expression of the address in the store instruction are not written to between the store instruction and the corresponding load instruction.
  • the parallelization circuitry may check for various other intervening events (e.g., fence instructions, or memory access by other entities) as explained above.
  • the parallelization unit may inhibit the load instruction from being executed in the external memory.
  • the parallelization circuitry e.g., the renaming unit
  • the parallelization circuitry serves the outcome of the load instruction in a load-store pair from the internal register only to subsequent code that is associated with a specific flow-control trace or traces in which the load-store pair was identified. For other traces, which may not comprise the load-store pair in question, the parallelization circuitry may execute the load instructions conventionally in the external memory.
  • the traces considered by the parallelization circuitry may be actual traces traversed by the code, or predicted traces that are expected to be traversed. In the latter case, if the prediction fails, the subsequent code may be flushed.
  • the parallelization circuitry serves the outcome of a load instruction from the internal register to subsequent code associated with any flow-control trace.
  • the identification of the store or load instruction in the pair and the location for inserting micro-ops may also be based on factors such as the Program-Counter (PC) values, program addresses, instruction-indices and address-operands of the load and store instructions in the program code.
  • PC Program-Counter
  • the parallelization circuitry may save the PC value of the load instruction. This information indicates to the parallelization circuitry exactly where to insert the additional micro-op whenever the processor traverses this PC.
  • Fig. 4 is a flow chart that schematically illustrates a method for processing code that contains load-store instruction pairs, in accordance with an embodiment of the present invention.
  • the method begins with the parallelization circuitry identifying one or more load-store pairs that, based on the address format, access the same memory address, at a pair identification step 130.
  • the parallelization circuitry saves the value that is stored (or to be stored) by the store instruction in an internal register, at an internal saving step 134.
  • the parallelization circuitry does not wait for the load instruction in the pair to retrieve the value from external memory. Instead, the parallelization circuitry serves the outcome of the load instruction, to any subsequent instructions that depend on this value, from the internal register.
  • the examples above refer to a single load-store pair in a given repetitive region of the code (e.g., loop). Generally, however, the parallelization circuitry may identify and handle two or more different load-store pairs in the same code region. Furthermore, multiple load instructions may be paired to the same store instruction. The parallelization circuitry may regard this scenario as multiple load store pairs, but assign the stored value to an internal register only once.
  • the parallelization circuitry may store the information on identification of load-store pairs in the scoreboard relating to the code region in question.
  • the renaming unit may use the physical name of the register being stored as the operand of the registers to be loaded when the mov micro-op is added.
  • the parallelization circuitry identifies a region of the code containing one or more code segments that are at least partially repetitive, wherein the code in this region comprises repetitive load-store pairs. In some embodiments, the parallelization circuitry further identifies that the value loaded from external memory is manipulated using some predictable calculation between the load instructions of successive iterations (or, similarly, between the load instruction and the following store instruction in a given iteration).
  • the parallelization circuitry saves the loaded value in an internal register or other internal memory, and manipulates the value using the same predictable calculation.
  • the manipulated value is then assigned to be served to subsequent code that depends on the outcome of the next load instruction, without having to wait for the actual load instruction to retrieve the value from the external memory.
  • r6 is a global register. Instructions E-G increment a counter value that is stored in memory address "fr6J”. Instructions A and B make use of the counter value that was set in the previous loop iteration. Between the load instruction and the store instruction, the program code manipulates the read value by some predictable manipulation (in the present example, incrementing by 1 in instruction F).
  • instruction A depends on the value stored into "[r6]" by instruction G in the previous iteration.
  • the parallelization circuitry assigns the outcome of the load instruction (instruction A) to be served to subsequent code from an internal register (or other internal memory), without waiting for the value to be retrieved from external memory.
  • the parallelization circuitry performs the same predictable manipulation on the internal register, so that the served value will be correct.
  • instruction A still depends on instruction G in the previous iteration, but instructions that depend on the value read by instruction A can be processed earlier.
  • the parallelization circuitry adds the micro- op
  • MSI denotes an internal register, such as one of the dedicated registers in register file 50.
  • the parallelization circuitry adds the micro-op
  • micro-op increments the internal register MSI by 1, i.e., performs the same predictable manipulation of instruction F in the previous iteration.
  • parallelization circuitry adds the micro-op
  • any instruction that depends on these load instructions will be served from the internal register MSI instead of from the external memory. Adding the instructions or micro-ops above may be performed, for example, by the relevant decoding unit 32 or renaming unit 36.
  • the parallelization circuitry performs the predictable manipulation once in each iteration, so as to serve the correct value to the code of the next iteration.
  • the parallelization circuitry may perform the predictable manipulation multiple times in a given iteration, and serve different predicted values to code of different subsequent iterations.
  • the parallelization circuitry may calculate the next n values of the counter, and provide the code of each iteration with the correct counter value. Any of these operations may be performed without waiting for the load instruction to retrieve the counter value from external memory. This advance calculation may be repeated every n iterations.
  • the parallelization circuitry in the first iteration, renames the destination register rl (in instruction A) to a physical register denoted p8.
  • the parallelization circuitry then adds one or more micro-ops or instructions (or modifies an existing micro-op, e.g., instruction A) to calculate a vector of n r8,r8,#l values.
  • the vector is saved in a set of dedicated registers mi ...m n , e.g., in register file 50.
  • the parallelization circuitry renames the operands of the add instructions (instruction D) to read from respective registers mi ...m n (according to the iteration number).
  • the parallelization circuitry may comprise suitable vector-processing hardware for performing these vectors in a small number of cycles.
  • Fig. 5 is a flow chart that schematically illustrates a method for processing code that contains repetitive load-store instruction pairs with intervening data manipulation, in accordance with an embodiment of the present invention.
  • the method begins with the parallelization circuitry identifying a code region containing repetitive load-store pairs having intervening data manipulation, at an identification step 140.
  • the parallelization circuitry analyzes the code so as to identify both the load-store pairs and the data manipulation.
  • the data manipulation typically comprises an operation performed by the ALU, or by another execution units such as an FPU or MAC unit. Typically although not necessarily, the manipulation is performed by a single instruction.
  • each load-store pair typically comprises a store instruction in a given loop iteration and a load instruction in the next iteration that reads from the same memory address.
  • the parallelization circuitry assigns the value that was loaded by a first load instruction in an internal register, at an internal saving step 144.
  • the parallelization circuitry applies the same data manipulation (identified at step 140) to the internal register. The manipulation may be applied, for example, using the ALU, FPU or MAC unit.
  • the parallelization circuitry does not wait for the next load instruction to retrieve the manipulated value from external memory. Instead, the parallelization circuitry assigns the manipulated value (calculated at step 148) to any subsequent instructions that depend on the next load instruction, from the internal register.
  • the counter value is always stored in (and retrieved from) the same memory address ("fr6f wherein r6 is a global register).
  • r6f the same memory address
  • each iteration may store the counter value in a different (e.g., incrementally increasing) address in external memory 41.
  • the value may be loaded from a given address, manipulated and then stored in a different address.
  • a relationship still exists between the memory addresses accessed by the load and store instructions of different iterations: The load instruction in a given iteration accesses the same address as the store instruction of the previous iteration.
  • the store instruction stores a word of a certain width (e.g., a 32-bit word), and the corresponding load instruction loads a word of a different width (e.g., an 8-bit byte) that is contained within the stored word.
  • the store instruction may store a 32- bit word in a certain address, and the load instruction in the pair may load some 8-bit byte within the 32-bit word.
  • This scenario is also regarded as a load-store pair that accesses the same memory address. In such embodiments, the predictable manipulation should be applied to the smaller-size word loaded by the load instruction.
  • the parallelization circuitry typically verifies, when serving the manipulated value from the internal register, that the served value indeed matches the actual value after retrieval by the load instruction and manipulation. If a mismatch is found, the parallelization circuitry may flush subsequent instructions and results. Any suitable verification scheme can be used for this purpose, such as by adding one or more instructions or micro-ops, or by verifying that the address in the store instruction is not written to between the store instruction and the corresponding load instruction.
  • the parallelization circuitry may check for various other intervening events (e.g., fence instructions, or memory access by other entities) as explained above. Addition of instructions or micro-ops can be performed, for example, by the renaming unit. The actual comparison between the served value and the actual value may be performed by the ALU or LSU.
  • the parallelization unit may inhibit the load instruction from being executed in the external memory.
  • the parallelization circuitry e.g., the renaming unit
  • the parallelization circuitry serves the manipulated value from the internal register only to subsequent code that is associated with a specific flow-control trace or group of traces, e.g., only if the subsequent load-store pair is associated with the same flow- control trace as the current pair.
  • the traces considered by the parallelization circuitry may be actual traces traversed by the code, or predicted traces that are expected to be traversed. In the latter case, if the prediction fails, the subsequent code may be flushed.
  • the parallelization circuitry serves the manipulated value from the internal register to subsequent code associated with any flow-control trace.
  • the decision to serve the manipulated value from an internal register, and/or the identification of the location in the code for adding or manipulate micro-ops may also consider factors such as the Program-Counter (PC) values, program addresses, instruction-indices and address-operands of the load and store instructions in the program code.
  • PC Program-Counter
  • the decision to serve the manipulated value from an internal register, and/or the identification of the code to which the manipulated value should be served may be carried out, for example, by the relevant renaming or decoding unit.
  • the parallelization circuitry may identify and handle two or more different predictable manipulations, and/or two or more sequences of repetitive load-store pairs, in the same code region.
  • multiple load instructions may be paired to the same store instruction. This scenario may be considered by the parallelization circuitry as multiple load-store pairs, wherein the stored value is assigned to an internal register only once.
  • the parallelization circuitry may store the information on identification of load-store pairs and predictable manipulations in the scoreboard relating to the code region in question.
  • EXAMPLE RELATIONSHIP RECURRING LOAD INSTRUCTIONS THAT ACCESS A PATTERN OF NEARBY MEMORY ADDRESSES
  • the parallelization circuitry identifies a region of the program code, which comprises a repetitive sequence of load instructions that access different but nearby memory addresses in external memory 41.
  • a region of the program code which comprises a repetitive sequence of load instructions that access different but nearby memory addresses in external memory 41.
  • Such a scenario occurs, for example, in a program loop that reads values from a vector or other array stored in the external memory, in accessing the stack, or in image processing or filtering applications.
  • the load instructions in the sequence access incrementing adjacent memory addresses, e.g., in a loop that reads respective elements of a vector stored in the external memory.
  • the load instructions in the sequence access addresses that are not adjacent but differ from one another by a constant offset (sometimes referred to as "stride"). Such a case occurs, for example, in a loop that reads a particular column of an array.
  • the load instructions in the sequence may access addresses that increment or decrement in accordance with any other suitable predictable pattern.
  • the pattern is periodic.
  • the parallelization circuitry may identify any other region of code that comprises such repetitive load instructions, e.g., in sections of loop iterations, sequential code and/or any other suitable instruction sequence.
  • the parallelization circuitry identifies the sequence of repetitive load instructions, and the predictable pattern of the addresses being read from, based on the formats of the symbolic expressions that specify the addresses in the load instructions. The identification is thus performed early in the pipeline, e.g., by the relevant decoding unit or renaming unit.
  • the parallelization circuitry may access a plurality of the addresses in response to a given read instruction in the sequence, before the subsequent read instructions are processed.
  • the parallelization circuitry uses the identified pattern to read a plurality of future addresses in the sequence into internal registers (or other internal memory).
  • the parallelization circuitry may then assign any of the read values from the internal memory to one or more future instructions that depend on the corresponding read instruction, without waiting for that read instruction to read the value from the external memory.
  • the basic read operation performed by the LSUs reads a plurality of data values from a contiguous block of addresses in memory 43 (possibly via cache 56 or 42).
  • a cache line may comprise, for example, sixty-four bytes, and a single data value may comprise, for example four or eight bytes, although any other suitable cache-line size can be used.
  • the LSU or cache reads an entire cache line regardless of the actual number of values that were requested, even when requested to read a single data value from a single address.
  • the LSU or cache reads a cache line in response to a given read instruction in the above-described sequence.
  • the cache line may also contain one or more data values that will be accessed by one or more subsequent read instructions in the sequence (in addition to the data value requested by the given read instruction).
  • the parallelization circuitry extracts the multiple data values from the cache line based on the pattern of addresses, saves them in internal registers, and serves them to the appropriate future instructions.
  • the term “nearby addresses” means addresses that are close to one another relative to the cache-line size. If, for example, each cache line comprises n data values, the parallelization circuitry may repeat the above process every n read instructions in the sequence.
  • the parallelization circuitry, LSU or cache identifies that in order to load n data values from memory there is a need to get another cache line, it may initiate a read from memory of the relevant cache line.
  • the parallelization circuitry, LSU or cache may initiate a read from memory of the relevant cache line.
  • This technique is especially effective when a single cache line comprises many data values that will be requested by future read instructions in the sequence (e.g., when a single cache line comprises many periods of the pattern).
  • the performance benefit is also considerable when the read instructions in the sequence arrive in execution units 52 at large intervals, e.g., when they are separated by many other instructions.
  • Fig. 6 is a flow chart that schematically illustrates a method for processing code that contains recurring load instructions from nearby memory addresses, in accordance with an embodiment of the present invention.
  • the method begins at a sequence identification step 160, with the parallelization circuitry identifying a repetitive sequence of read instructions that access respective memory addresses in memory 43 in accordance with a predictable pattern.
  • an LSU in execution units 52 or the cache
  • reads one or several cache lines from memory 43 possibly via cache 56 or 42
  • the parallelization circuitry extracts the data value requested by the given read instruction from the cache line.
  • the parallelization circuitry uses the identified pattern of addresses to extract from the cache lines one or more data values that will be requested by one or more subsequent read instructions in the sequence. For example, if the pattern indicates that the read instructions access every fourth address starting from some base address, the parallelization circuitry may extract every fourth data value from the cache lines.
  • the parallelization circuitry saves the extracted data values in internal memory.
  • the extracted data values may be saved, for example, in a set of internal registers in register file 50.
  • the other data in the cache lines may be discarded.
  • the parallelization circuitry may copy the entire cache lines to the internal memory, and later assign the appropriate values from the internal memory in accordance with the pattern.
  • the parallelization circuitry serves the data values from the internal registers to the subsequent code instructions that depend on them.
  • the k th extracted data value may be served to any instruction that depends on the outcome of the k th read instruction following the given read instruction.
  • the k th extracted data value may be served from the internal memory without waiting for the k th read instruction to retrieve the data value from external memory.
  • r6 is a global register.
  • This loop reads data values from every fourth address, starting from some base address that is initialized at the beginning of the loop.
  • the parallelization circuitry may identify the code region containing this loop, identify the predictable pattern of addresses, and then extract and serve multiple data values from a retrieved cache line.
  • this mechanism is implemented by adding one or more instructions or micro-ops to the code, or modifying existing one or more instructions or micro- ops, e.g., by the relevant renaming unit 36.
  • the parallelization circuitry modifies the load ⁇ Idr) instruction to
  • MA denotes a set of internal registers, e.g., in register file 50.
  • the parallelization circuitry adds the following instruction after the Idr instruction:
  • the vec Idr instruction in the first loop iteration saves multiple retrieved values to the MA registers, and the mov instruction in the subsequent iterations assigns the values from the MA registers to register rl with no direct relationship to the Idr instruction. This allows the subsequent add instruction to be issued/executed without waiting for the Idr instruction to complete.
  • the parallelization circuitry (e.g., renaming unit 36) implements the above mechanism by proper setting of the renaming scheme.
  • the parallelization circuitry modifies the load ⁇ Idr) instruction to
  • the parallelization circuitry renames the operands of the add instructions to read from MA(iteration num) even though the new Idr destination is renamed to a different physical register.
  • the parallelization circuitry does not release the mapping of the MA registers in a conventional manner, i.e., on the next time the write to rl is committed. Instead, the mapping is retained until all data values extracted from the current cache line have been served.
  • the parallelization circuitry may use a series of Idr micro- ops instead of the Idr vec instruction.
  • each cache line contains a given number of data values. If the number of loop iterations is larger than the number of data values per cache line, or if one of the loads crosses the cache-line boundary (e.g., because since the loads are not necessarily aligned with the beginning of a cache line), then a new cache line should be read when the current cache line is exhausted. In some embodiments, the parallelization circuitry automatically instructs the LSU to read a next cache line.
  • repetitive load instructions that access predictable nearby address patterns may comprise:
  • all the load instructions in the sequence are processed by the same hardware thread 24 (e.g., when processing an unrolled loop, or when the processor is a single- thread processor).
  • the load instructions in the sequence may be processed by at least two different hardware threads.
  • the parallelization circuitry verifies, when serving the outcome of a load instruction in the sequence from the internal memory, that the served value indeed matches the actual value retrieved by the load instruction from external memory. If a mismatch is found, the parallelization circuitry may flush subsequent instructions and results. Any suitable verification scheme can be used for this purpose. For example, as explained above, the parallelization circuitry (e.g., the renaming unit) may add an instruction or micro-op that performs the verification. The actual comparison may be performed by the ALU or alternatively in the LSU.
  • the parallelization circuitry may also verify, e.g., based on the formats of the symbolic expressions of the instructions, that no intervening event causes a mismatch between the served values and the actual values in the external memory.
  • the parallelization circuitry may initially assume that no intervening event affects the memory address in question. If, during execution, some verification mechanism fails, the parallelization circuitry may deduce that an intervening event possibly exists, and refrain from serving the outcome from the internal memory.
  • the parallelization unit may inhibit the load instruction from being executed in the external memory.
  • the parallelization circuitry e.g., the renaming unit
  • the parallelization circuitry serves the outcome of a load instruction from the internal memory only to subsequent code that is associated with one or more specific flow-control traces (e.g., traces that contain the load instruction).
  • the traces considered by the parallelization circuitry may be actual traces traversed by the code, or predicted traces that are expected to be traversed. In the latter case, if the prediction fails, the subsequent code may be flushed.
  • the parallelization circuitry serves the outcome of a load instruction from the internal register to subsequent code associated with any flow-control trace.
  • the decision to assign the outcome from an internal register, and/or the identification of the locations in the code for adding or modifying instructions or micro-ops may also consider factors such as the Program-Counter (PC) values, program addresses, instruction-indices and address-operands of the load instructions in the program code.
  • PC Program-Counter
  • the MA registers may reside in a register file having characteristics and requirements that differ from other registers of the processor.
  • this register file may have a dedicated write port buffer from the LSU, and only read ports from the other execution units 52.
  • the parallelization circuitry may identify and handle in the same code region two or more different sequences of load instructions, which access two or more respective patterns of memory addresses.
  • the parallelization circuitry may store the information on identification of the sequence of load instructions, and on the predictable pattern of memory addresses, in the scoreboard relating to the code region in question.
  • processor 20 identifies and acts upon the relationships between memory-access instructions, at partially based on hints or other indications embedded in the program code by the compiler.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)
  • Advance Control (AREA)

Abstract

L'invention se rapporte à un procédé qui comprend, dans un processeur (20), un code de programme de traitement qui inclut des instructions d'accès à la mémoire, au moins certaines des instructions d'accès à la mémoire comportant des expressions symboliques qui spécifient des adresses de mémoire dans une mémoire externe (41) sous la forme d'un ou plusieurs noms de registres. Une relation entre les adresses de mémoire auxquelles accèdent au moins deux des instructions d'accès à la mémoire est identifiée sur la base des formats respectifs des adresses de mémoire spécifiées dans les expressions symboliques. Le résultat d'au moins une des instructions d'accès à la mémoire est affecté à la desserte à partir d'une mémoire interne (50) dans le processeur, selon la relation identifiée.
EP16820923.7A 2015-07-09 2016-07-04 Processeur doté d'un accès à la mémoire efficace Withdrawn EP3320428A4 (fr)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US14/794,853 US20170010973A1 (en) 2015-07-09 2015-07-09 Processor with efficient processing of load-store instruction pairs
US14/794,841 US20170010972A1 (en) 2015-07-09 2015-07-09 Processor with efficient processing of recurring load instructions
US14/794,835 US10185561B2 (en) 2015-07-09 2015-07-09 Processor with efficient memory access
US14/794,837 US9575897B2 (en) 2015-07-09 2015-07-09 Processor with efficient processing of recurring load instructions from nearby memory addresses
PCT/IB2016/053999 WO2017006235A1 (fr) 2015-07-09 2016-07-04 Processeur doté d'un accès à la mémoire efficace

Publications (2)

Publication Number Publication Date
EP3320428A1 true EP3320428A1 (fr) 2018-05-16
EP3320428A4 EP3320428A4 (fr) 2019-07-17

Family

ID=57685264

Family Applications (1)

Application Number Title Priority Date Filing Date
EP16820923.7A Withdrawn EP3320428A4 (fr) 2015-07-09 2016-07-04 Processeur doté d'un accès à la mémoire efficace

Country Status (3)

Country Link
EP (1) EP3320428A4 (fr)
CN (1) CN107710153B (fr)
WO (1) WO2017006235A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11249723B2 (en) * 2020-04-02 2022-02-15 Micron Technology, Inc. Posit tensor processing

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5911057A (en) * 1995-12-19 1999-06-08 Texas Instruments Incorporated Superscalar microprocessor having combined register and memory renaming circuits, systems, and methods
US5926832A (en) * 1996-09-26 1999-07-20 Transmeta Corporation Method and apparatus for aliasing memory data in an advanced microprocessor
US7415576B2 (en) * 2002-09-30 2008-08-19 Renesas Technology Corp. Data processor with block transfer control
US7024537B2 (en) * 2003-01-21 2006-04-04 Advanced Micro Devices, Inc. Data speculation based on addressing patterns identifying dual-purpose register
US7263600B2 (en) * 2004-05-05 2007-08-28 Advanced Micro Devices, Inc. System and method for validating a memory file that links speculative results of load operations to register values
US8452946B2 (en) * 2009-12-17 2013-05-28 Intel Corporation Methods and apparatuses for efficient load processing using buffers
US8850166B2 (en) * 2010-02-18 2014-09-30 International Business Machines Corporation Load pair disjoint facility and instruction therefore
US20140189249A1 (en) * 2012-12-28 2014-07-03 Futurewei Technologies, Inc. Software and Hardware Coordinated Prefetch
US9519586B2 (en) * 2013-01-21 2016-12-13 Qualcomm Incorporated Methods and apparatus to reduce cache pollution caused by data prefetching
CN104252425B (zh) * 2013-06-28 2017-07-28 华为技术有限公司 一种指令缓存的管理方法和处理器
US11494188B2 (en) * 2013-10-24 2022-11-08 Arm Limited Prefetch strategy control for parallel execution of threads based on one or more characteristics of a stream of program instructions indicative that a data access instruction within a program is scheduled to be executed a plurality of times
US20150134933A1 (en) * 2013-11-14 2015-05-14 Arm Limited Adaptive prefetching in a data processing apparatus

Also Published As

Publication number Publication date
EP3320428A4 (fr) 2019-07-17
CN107710153A (zh) 2018-02-16
CN107710153B (zh) 2022-03-01
WO2017006235A1 (fr) 2017-01-12

Similar Documents

Publication Publication Date Title
US20170010973A1 (en) Processor with efficient processing of load-store instruction pairs
KR101511837B1 (ko) 벡터 분할 루프들의 성능 향상
US9400651B2 (en) Early issue of null-predicated operations
US9256427B2 (en) Tracking multiple conditions in a general purpose register and instruction therefor
TWI758319B (zh) 用於處置針對向量指令的元素間位址危害的裝置及資料處理方法
US9632775B2 (en) Completion time prediction for vector instructions
US20180293078A1 (en) Handling exceptional conditions for vector arithmetic instruction
US20100058034A1 (en) Creating register dependencies to model hazardous memory dependencies
US9715390B2 (en) Run-time parallelization of code execution based on an approximate register-access specification
TW201349111A (zh) 抑制零述詞分支錯誤預測之分支錯誤預測行為
WO2017203442A1 (fr) Processeur à gestion efficace de tampon de réordonnance (rob)
US11036511B2 (en) Processing of a temporary-register-using instruction including determining whether to process a register move micro-operation for transferring data from a first register file to a second register file based on whether a temporary variable is still available in the second register file
US9575897B2 (en) Processor with efficient processing of recurring load instructions from nearby memory addresses
US10185561B2 (en) Processor with efficient memory access
US7051193B2 (en) Register rotation prediction and precomputation
US9442734B2 (en) Completion time determination for vector instructions
CN108027736B (zh) 使用通过对物理寄存器预分配的乱序重命名的运行时代码并行化
CN110806898B (zh) 处理器及指令操作方法
KR20070108936A (ko) 조건부 명령어가 실행되지 않을 경우 소스 오퍼랜드를대기하는 것을 중지하는 방법
CN107710153B (zh) 具有有效的存储器访问的处理器
US20170010972A1 (en) Processor with efficient processing of recurring load instructions
WO2018100456A1 (fr) Commande d'accès mémoire pour traitement parallélisé
US20130151818A1 (en) Micro architecture for indirect access to a register file in a processor
EP3238040A1 (fr) Parallélisation de code de temps d'exécution à contrôle continu de séquences d'instructions répétitives
US11347506B1 (en) Memory copy size determining instruction and data transfer instruction

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20171121

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20190617

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 9/34 20180101ALI20190611BHEP

Ipc: G06F 9/38 20180101AFI20190611BHEP

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20200115