CN111133413A

CN111133413A - Load-store unit with partitioned reorder queue using a single CAM port

Info

Publication number: CN111133413A
Application number: CN201880061955.4A
Authority: CN
Inventors: B·辛哈罗伊; B·劳埃德; C·冈萨雷斯
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2017-10-06
Filing date: 2018-10-03
Publication date: 2020-05-08
Anticipated expiration: 2038-10-03
Also published as: CN111133421B; JP7025100B2; JP7064273B2; GB2579757A; DE112018004006T5; WO2019069256A1; DE112018004004T5; CN111133413B; GB2579534A; GB2579534B; GB202006338D0; GB202006344D0; CN111133421A; GB2579757B; JP2020536310A; WO2019069255A1; JP2020536308A; DE112018004006B4

Abstract

A technical solution is described for a load-store unit (LSU) to execute a plurality of instructions in an out-of-order (OoO) window using a plurality of LSU pipelines. The executing includes selecting an instruction from the OoO window, the instruction using an effective address; and if the instruction is a load instruction: if the processing unit is operating in a single threaded mode, an entry is created in a first partition of a Load Reorder Queue (LRQ) if the instruction is issued on a first load pipe and a second partition of the LRQ if the instruction is issued on a second load pipe. Further, if the processing unit is operating in a multi-threaded mode, an entry is created in a first predetermined portion of a first partition of the LRQ if the instruction is on a first load pipeline and issued by a first thread of the processing unit.

Description

Load-store unit with partitioned reorder queue using a single CAM port

Technical Field

Embodiments of the invention relate generally to out-of-order (OoO) processors and, more particularly, to implementing partitioned load and store reorder queued load-store units (LSUs) with a single Content Addressable Memory (CAM) port to efficiently support out-of-order execution of instructions in OoO processors.

Background

In the OoO processor, an Instruction Sequencing Unit (ISU) dispatches instructions to respective issue queues, renames registers that support OoO execution, issues instructions from the respective issue queues to an execution pipeline, completes executed instructions, and handles exception conditions. Register renaming is typically performed by mapper logic in the ISU before instructions are placed in their respective issue queues. The ISU includes one or more issue queues containing dependency matrices for tracking dependencies between instructions. The dependency matrix typically includes one row and one column for each instruction in the issue queue.

Disclosure of Invention

Embodiments of the present invention include methods, systems, and computer program products for implementing effective address based load store units in an out-of-order processor. Non-limiting examples of a processing unit for executing one or more instructions include a load-store unit (LSU) that executes a plurality of instructions in an out-of-order (OoO) window using a plurality of LSU pipelines. The executing includes selecting an instruction from the OoO window, the instruction using an effective address; and in response to the instruction being a load instruction: in response to the processing unit operating in a single threaded mode, creating an entry in a first partition of a load reorder queue based on the instruction being issued on a first load pipeline, and creating the entry in a second partition of the load reorder queue based on the instruction being issued on a second load pipeline. The execution also includes, in response to the processing unit operating in a multi-threaded mode in which multiple threads are concurrently processed, creating the entry in a first predetermined portion of the first partition of the load reorder queue based on the instruction being issued by a first thread of the processing unit on the first load pipeline.

According to one or more embodiments, a computer-implemented method for out-of-order execution of one or more instructions by a processing unit comprises: receiving, by a load-store unit (LSU) of the processing unit, an out-of-order instruction window comprising a plurality of instructions to be executed out-of-order; and an instruction issued by the LSU from the OoO window. Issuing of the instruction includes selecting the instruction from the OoO window, the instruction using the effective address; and in response to the instruction being a load instruction: in response to the processing unit operating in a single threaded mode, creating an entry in a first partition of a load reordering queue based on the instruction being issued on a first load pipeline, and creating the entry in a second partition of the load reordering queue based on the instruction being issued on a second load pipeline. The execution also includes, in response to the processing unit operating in a multi-threaded mode in which multiple threads are concurrently processed, creating the entry in a first predetermined portion of the first partition of the load reorder queue based on the instruction being issued by a first thread of the processing unit on the first load pipeline.

In accordance with one or more embodiments, a computer program product includes a computer-readable storage medium having program instructions embodied therewith, where the program instructions are executable by a processing unit to cause the processing unit to perform operations. The operations include receiving, by a load-store unit (LSU) of the processing unit, an out-of-order instruction window including a plurality of instructions to be executed out-of-order; and an instruction issued by the LSU from the OoO window. Issuing of the instruction includes selecting the instruction from the OoO window, the instruction using the effective address; and in response to the instruction being a load instruction: in response to the processing unit operating in a single threaded mode, creating an entry in a first partition of a load reorder queue based on the instruction being issued on a first load pipeline, and creating the entry in a second partition of the load reorder queue based on the instruction being issued on a second load pipeline. The execution also includes, in response to the processing unit operating in a multi-threaded mode in which multiple threads are concurrently processed, creating the entry in a first predetermined portion of the first partition of the load reorder queue based on the instruction being issued by a first thread of the processing unit on the first load pipeline.

Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.

Drawings

The details of the exclusive rights described herein are particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and advantages of embodiments of the invention will become further apparent from the following detailed description, taken in conjunction with the accompanying drawings, in which:

FIG. 1 shows a block diagram of a system including an effective address based load store unit in an out-of-order processor, according to one or more embodiments of the invention;

FIG. 2 is an exemplary block diagram of a processor architecture of OoO processor in which an Effective Address Directory (EAD) and associated mechanisms for utilizing the EAD are implemented in accordance with one or more embodiments of the invention;

FIG. 3 illustrates a load-store unit (LSU) of a processing core in accordance with one or more embodiments of the invention;

FIG. 4 is an exemplary block of an Effective Address Directory (EAD) structure (L1 cache) in accordance with one illustrative embodiment;

FIG. 5 is an exemplary block diagram of an Efficient Real Translation (ERT) table structure in accordance with one illustrative embodiment;

FIG. 6 depicts a flow diagram of an exemplary method for accessing memory for execution of instructions by an LSU in accordance with one or more embodiments of the invention;

FIG. 7 depicts a flow diagram of a method for reloading an ERT in accordance with one or more embodiments of the invention;

FIG. 8 illustrates an example structure of a Synonym Detection Table (SDT) in accordance with one or more embodiments of the invention;

FIG. 9 shows a flow diagram of a method for performing ERT and SDT EA interchange in accordance with one or more embodiments of the invention;

FIG. 10 illustrates an ERT eviction (ERTE) table in accordance with one or more embodiments of the invention;

FIG. 11 depicts a flowchart of an exemplary method for adding an entry to an ERTE table in accordance with one or more embodiments of the invention;

FIG. 12 illustrates an example sequence diagram of an example set of instructions initiated in accordance with one or more embodiments of the invention;

FIG. 13 depicts a flowchart of an example method for issuing instructions by an LSU in multiple pipeline mode and in OoO depending on whether the processor is operating in Single Thread (ST) mode or Multithreading (MT) mode in accordance with one or more embodiments of the invention; and

FIG. 14 depicts a block diagram of a computer system for implementing some or all aspects of one or more embodiments of the invention.

The drawings described herein are illustrative. Many changes may be made to the drawings or operations described therein without departing from the spirit of the invention. For instance, the acts may be performed in a differing order, or acts may be added, deleted or modified. Furthermore, the term "coupled" and variations thereof describe having a communication path between two elements and does not imply a direct connection between the elements without intermediate elements/connections therebetween. All such variations are considered a part of the specification.

Detailed Description

One or more embodiments of the invention described herein provide an Effective Address (EA) based Load Store Unit (LSU) for an out-of-order (OoO) processor by dynamically removing OoO effective real address table entries in the processor. The solution described herein uses an Effective Address Directory (EAD) along with an Effective Real Table (ERT) and a Synonym Detection Table (SDT), among other components, in order to reduce chip area and further improve OoO processor timing. Furthermore, the technical solutions described herein facilitate the OoO LSU to execute load/store instructions in an out-of-order manner. The OoO LSU uses multiple pipelines to execute load/store instructions to provide improved performance. The multi-pipe implementation of the LSU is based on partitioned ERT, Load Reorder Queue (LRQ), and Store Reorder Queue (SRQ) as described herein.

Most modern computing devices provide support for virtual memory. Virtual memory is a technology by which applications are given the impression that they have a continuous working memory or address space, while in reality physical memory may be fragmented and may even overflow onto disk storage. In essence, the application is given an idea of the memory of the computing device in which the application accesses seemingly contiguous memory in an effective address space visible to the application using effective addresses that are then translated into physical addresses of the actual physical memory or storage device(s) to actually perform the access operation. An effective address is a value that is used to specify a memory location to be accessed by an operation from the perspective of the entity issuing the operation (e.g., an application, process, thread, interrupt handler, kernel component, etc.).

That is, if the computing device does not support the concept of virtual memory, the effective address and the physical address are the same. However, if the computing device supports virtual memory, the effective address of a particular operation submitted by an application is translated by the computing device's memory mapping unit to a physical address that specifies the location in physical memory or storage device(s) where the operation is to be performed.

Furthermore, in modern computing devices, processors of the computing device process instructions (operations) submitted by entities (e.g., applications, processes, etc.) using a processor instruction pipeline that includes a series of data processing elements. Instruction pipelining is a technique for increasing instruction throughput by dividing the processing of computer instructions into a series of steps and storing at the end of each step. Instruction pipelining facilitates control circuitry of a computing device to issue instructions to a processor instruction pipeline at the processing rate of the slowest step, which is much faster than the time required to execute all steps at once. Processors with instruction pipelines, i.e., pipeline processors, are internally organized into stages that can work on separate jobs semi-independently. Each stage is organized and linked with the next stage in the series chain such that the output of each stage is fed to the other stage up to the last stage of the pipeline.

Such a pipeline processor may take the form of an in-order or out-of-order pipeline processor. For an in-order pipelined processor, instructions are executed in order such that if data is not available for the instruction to be processed at a particular stage of the pipeline, instruction execution through the pipeline may be stalled until the data is available. An out-of-order pipelined processor, on the other hand, allows the processor to avoid stalls that occur when data needed to perform an operation is unavailable. An out-of-order processor instruction pipeline avoids these stalls by filling "gaps" with other instructions that are ready to be processed in time and then reordering the results at the end of the pipeline so that it appears that the instructions are processed in order. The manner in which instructions are ordered in the raw computer code is referred to as program order, whereas in a processor instructions are processed in data order (i.e., in the order in which data and operands become available in the processor's registers).

Modern processor instruction pipelines track the effective address of instructions as they flow through the instruction pipeline. Tracking the effective address of an instruction is important because it is utilized whenever processing of the instruction causes an exception to occur, the instruction flushes to a previous state, the instruction branches to a new memory location relative to its current memory location, or the instruction completes its execution.

Tracking the effective address of an instruction is costly in terms of processor chip area, power consumption, etc. This is because these effective addresses have a large size (e.g., 64 bits) and modern processor instruction pipelines are deep, i.e., have many stages, such that the lifetime of instructions from the fetch instruction stage of the processor instruction pipeline to the completion stage of the processor instruction pipeline is very long. In highly multithreaded, out-of-order processors, i.e., processors that execute instructions from multiple threads out-of-order, this cost may be further increased because a large number of instructions from different address ranges may be processed at the same time, i.e., "running".

In one or more examples, a computing device tracks effective addresses of instructions using a combination of pipeline latches, Branch Information Queues (BIQs), and Global Completion Tables (GCTs). A latch is used to transfer the base Effective Address (EA) of an instruction group from the front end of the pipeline until it can be deposited and tracked in the GCT of an Instruction Sequencer Unit (ISU). The number of latches required to store this data is approximately the number of pipeline stages between the fetch and dispatch stages of the pipeline. This is wasteful because the EA is not typically needed during these phases. Instead, as the group of instructions flows through the pipeline, it is simply the payload data that "follows" the group of instructions. In addition, this approach results in duplicate stores because branch instructions have their EA in both the BIQ and GCT.

Accordingly, computing devices have been developed that eliminate these inefficiencies by tracking the EA only in the GCT. For example, in these new computing devices, the instruction sequencer unit creates an entry in the GCT at fetch time. The EA is now loaded into the GCT, and then removed when the instruction completes. This eliminates many pipeline latches in the overall machine. Instead of a complete EA as long as the number of address lines, e.g., a 64-bit EA, a small tag is carried through the pipeline along with the instruction group. The tag points back to an entry in the GCT that holds the base EA for the instruction group. Address storage in the BIQ is no longer needed because branches can retrieve their EA directly from the GCT as they issue. These techniques improve area efficiency, but they cannot be applied to out-of-order processors. Furthermore, they lack sufficient information to handle address requests that arrive out of program order. Furthermore, these techniques cannot support the dispatch and completion bandwidth required for out-of-order execution because they lack the ability to track groups of instructions that may have been formed from multiple disjoint address ranges. Historically, this mechanism only supported instruction groups from a single address range, which can significantly reduce the number of instructions available for out-of-order execution. In addition, to look up the corresponding address, e.g., the RA corresponding to the EA (or vice versa), a Content Addressable Memory (CAM) is used. The CAM uses a dedicated compare circuit to implement the look-up table function in a single clock cycle. The overall function of a CAM is to fetch a search word and return a matching memory location. However, such CAMs occupy chip area and consume power for such lookups.

Illustrative embodiments of the solution described herein improve upon these techniques by providing an Effective Address Directory (EAD), an Effective Real Table (ERT), and a Synonym Detection Table (SDT) that have the area efficiency of the GCT solution described above, but can also support widely issued out-of-order pipelining without inhibiting performance. The solution described herein also facilitates a processor to run only with EA as long as the processor is able to avoid EA synonyms within an out-of-order (OoO) window. OoO A window is a set of instructions in an instruction pipeline of a processor. By avoiding EA synonyms in the OoO window, the processor reduces chip area and power consumption for address translation because the processor can avoid translating the EA in the OoO window. This is because load-hit-store (LHS), store-hit-load (SHL), and load-hit-load (LHL) conditions no longer need to be detected for in-flight instructions because the EA synonym no longer exists in the OoO window.

In other words, the technical solution described herein solves the technical problem by policing with EA aliasing within the window OoO, thereby reducing the translation data structure and hardware of the load/store port. Thus, the solution described herein facilitates a reduction in chip area by tracking only one address EA. In addition, the technical scheme facilitates OoO processor to operate in 2 load and 2 store mode with a partition load store queue, further reducing the CAM ports normally used for address translation.

Furthermore, if the OoO processor supports Multithreading (MT) operations, for each thread operating out-of-order, the OoO processor must provide multiple CAM ports for each load/store queue in the load-store unit in order to convert the EA to the RA and vice versa. For example, consider OoO a processor executing four threads in MT mode, each thread executing simultaneously by executing independent instructions. In this case, the load-store unit (LSU) of the OoO processor typically uses 4 or more CAM ports for each load/store queue to translate effective addresses to real addresses and vice versa. Such multiple CAM ports for address translation occupy significant chip area and further consume power. The solution described herein solves this technical challenge for multiple CAM ports for multithreading.

One or more exemplary embodiments of the invention described herein address aspects of the technical challenges described herein by using a single CAM port for the load/store queue, thereby reducing chip area and power for address translation. For example, the exemplary embodiments of the invention described herein may facilitate an LSU to be a multi-load and multi-store LSU with partitioned load/store queues, which facilitates reducing the number of CAM ports used for address translation. A "multi-load LSU" is an LSU that issues multiple load instructions simultaneously on separate pipelines for each load instruction. For example, a "load 2 LSU" is an LSU that issues two load instructions simultaneously on two separate pipelines LD0 and LD 1. Similarly, a "multi-store LSU" is an LSU that issues multiple store instructions simultaneously on separate pipelines for each store instruction. For example, a "2 store LSU" is an LSU that issues two store instructions simultaneously on two separate pipelines ST0 and ST 1.

Turning now to FIG. 1, a block diagram of a system 100 including an Instruction Sequencing Unit (ISU) of an out-of-order (OoO) processor for implementing a technique for avoiding EA synonyms in a OoO instruction window is shown generally, in accordance with one or more embodiments of the present invention. The system 100 shown in FIG. 1 includes a fetch instruction unit/instruction decode unit (IFU/IDU)106 that fetches and decodes instructions for input to a setup block 108 that prepares the decoded instructions for input to a mapper 110 of the ISU. According to one or more embodiments of the invention, IFU/IDU 106 may fetch and decode six instructions from a thread at a time. According to one or more embodiments of the present disclosure, the six instructions sent to the setup block 108 may include six non-branch instructions, five non-branch instructions, and one branch instruction, or four non-branch instructions and two branch instructions. In accordance with one or more embodiments of the invention, setup block 108 checks whether sufficient resources, such as entries in the issue queue, completion table, mapper, and register file, exist before sending the fetched instructions to these blocks in the ISU.

The mapper 110 shown in FIG. 1 maps programmer instructions (e.g., logical register names) to physical resources (e.g., physical register addresses) of the processor. Various mappers 110 are shown in fig. 1, including a Condition Register (CR) mapper; a link/count (LNK/CNT) register mapper; an integer exception register (XER) mapper; a unified mapper (UMaper) for mapping General Purpose Registers (GPRs) and Vector Scalar Registers (VSRs); an architecture mapper (ARCH mapper) for mapping GPR and VSR; and a Floating Point Status and Control Register (FPSCR) mapper.

The output from the setup block 108 is also input to a Global Completion Table (GCT)112 for tracking all instructions currently in the ISU. The output from setup block 108 is also input to dispatch unit 114 for dispatching instructions to an issue queue. The embodiment of the ISU shown in FIG. 1 includes a CR issue queue CR ISQ 116 that receives and tracks instructions from a CR mapper and issues 120 them to a fetch instruction Unit (IFU)124 to execute CR logic instructions and move instructions. Also shown in FIG. 1 is a branch issue queue, branch ISQ 118, which receives and tracks branch instructions and LNK/CNT physical addresses from the LNK/CNT mapper. If the predicted branch address and/or direction is incorrect, branch ISQ 118 may issue an instruction to IFU124 to redirect instruction fetching.

Instructions output from dispatch logic and rename registers from the LNK/CNT mapper, the XER mapper, the UMaper (GPR/VSR), the ARCH mapper (GPR/VSR), and the FPSCR mapper are input to issue queue 102. As shown in FIG. 1, issue queue 102 tracks dispatched fixed point instructions (fx), load instructions (L), store instructions (S), and Vector and Scale Unit (VSU) instructions. As shown in the embodiment of FIG. 1, issue queue 102 is divided into two portions, ISQ01020 and ISQ 11021, each holding N/2 instructions. When the processor executes in single-threaded (ST) mode, issue queue 102 may serve as a single logical issue queue containing ISQ01020 and ISQ 11021 to process all instructions of a single thread (in this example, all N instructions).

When the processor is executing in Multithreading (MT) mode, ISQ01020 may be used to process N/2 instructions from the first thread and ISQ 11021 used to process N/2 instructions from the second thread ISQ 11021.

As shown in FIG. 1, issue queue 102 issues instructions to execution units 104, which are divided into two groups of

execution units

1040 and 1041. The two sets of

execution units

1040 and 1041 shown in FIG. 1 comprise full fixed point execution units (full FX0, full FX 1); load execution units (LU0, LU 1); simple fix, store data and store address execution units (simple FX0/STD0/STA0, simple FX1/STD1/STA 1); and floating point, vector multimedia extensions, decimal floating point and store data execution units (FP/VMX/DFP/STD0, FP/VMX/DFP/STD 1). LU0, simple FX0/STD0/STA0 and FP/VMX/DFP/STD0 collectively form a Load Store Unit (LSU) 1042. Similarly, LU1, simple FX1/STD1/STA1 and FP/VMX/DFP/STD1 form Load Store Unit (LSU) 1043. Together, the two

LSUs

1042 and 1043 are referred to as the LSU of system 100.

As shown in fig. 1, when the processor executes in ST mode, the first group of execution units 1040 executes instructions issued from ISQ01020, and the second group of execution units 1041 executes instructions issued from ISQ 11021. In an alternative embodiment of the invention, instructions issued from ISQ01020 and ISQ 11021 in issue queue 102 may be issued to execution units in either execution unit 1040 in first set of execution units 1040 and second set of execution units 1041 when the processor is executing in ST mode.

According to one or more embodiments of the invention, when the processor is executing in MT mode, the first group of execution units 1040 executes instructions of a first thread issued from ISQ01020, and the second group of execution units 1041 executes instructions of a second thread issued from ISQ 11021.

The number of entries in issue queue 102 and the size of other elements (e.g., bus width, queue size) shown in FIG. 1 are intended to be exemplary in nature, as embodiments of the invention may be implemented for a variety of different sizes of issue queues and other elements. According to one or more embodiments of the invention, the size is selectable or programmable.

In one or more examples, the system 100 is an OoO processor in accordance with the illustrative embodiments. FIG. 2 is an example block diagram of a processor architecture of an OoO processor in which an Effective Address Directory (EAD) and associated mechanisms for utilizing the EAD are implemented in accordance with one or more embodiments of the invention. As shown in FIG. 2, the processor architecture includes an instruction cache 202, an instruction fetch buffer 204, an instruction decode unit 206, and an instruction dispatch unit 208. Instructions are fetched from the instruction cache 202 by the instruction fetch buffer 204 and provided to the instruction decode unit 206. Instruction decode unit 206 decodes instructions and provides decoded instructions to instruction dispatch unit 208. Depending on the instruction type, the output of instruction dispatch unit 208 is provided to global completion table 210 and one or more of branch issue queue 212, condition register issue queue 214, unified issue queue 216, load reorder queue 218, and/or store reorder queue 220. The instruction type is determined by the decoding and mapping of instruction decode unit 206. The

issue queue

212, 220 provides input to each of the

execution units

222, 240. The data cache 250 and the register file contained by each respective unit provide data for use with instructions.

The instruction cache 202 receives instructions from the L2 cache 260 via the second stage conversion unit 262 and the pre-decode unit 270. The second stage translation unit 262 uses its associated segment lookaside buffer 264 and translation lookaside buffer 266 to translate the address of the fetched instruction from an effective address to a system memory address. The pre-decode unit partially decodes instructions arriving from the L2 cache and augments them with unique identification information, which simplifies the operation of downstream instruction decoders.

If the instruction is a branch instruction, the instruction fetched into instruction fetch buffer 204 is also provided to branch prediction unit 280. The branch prediction unit 280 includes a branch history table 282, a return stack 284, and a count cache 286. These elements predict the next Effective Address (EA) to be fetched from the instruction cache. A branch instruction is a point in a computer program where control flow is changed. It is a low-level machine instruction generated from a control construct in a computer program, such as an if-then-else or do-while statement. A branch may not be taken where the control flow is unchanged and the next instruction to be executed is the instruction immediately following in memory, or a branch may be taken where the next instruction to be executed is an instruction somewhere else in memory. If the branch is taken, a new EA needs to be provided to the instruction cache.

The EA and associated prediction information from the branch prediction unit are written to the effective address directory 290. This EA is later validated by branch execution unit 222. If correct, the EA remains in the directory until all instructions from that address region have completed their execution. If not, the branch execution unit clears the address and writes the corrected address to its location. EAD290 also includes logic to facilitate using the directory as a CAM.

Instructions to read from or write to memory (e.g., load or store instructions) are issued to the LS/

EX execution units

238, 240. The LS/EX execution unit retrieves data from the data cache 250 using the memory address specified by the instruction. This address is an effective address and needs to be first translated to a system memory address via the second stage translation unit before use. If the address is not found in the data cache, a load miss queue is used to manage miss requests to the L2 cache. To reduce the penalty of such cache misses, the advanced data prefetch engine predicts addresses that the instruction is likely to use in the near future. In this way, when an instruction needs data, the data will likely already be in the data cache, thereby preventing long latency miss requests to the L2 cache.

LS/

EX execution units

238, 240 execute instructions out of program order by tracking instruction age and memory dependencies in load reorder queue 218 and store reorder queue 220. These queues are used to detect when out-of-order execution produces results that are inconsistent with in-order execution of the same program. In such cases, the current programming flow is flushed and executed again.

The processor architecture also includes an Effective Address Directory (EAD)290 that maintains the effective addresses of a set of instructions in a centralized manner so that the effective addresses are available when needed, but not required through the pipeline. In addition, EAD290 includes circuitry and/or logic to support out-of-order processing. FIG. 2 shows accessing EAD290 via branch prediction unit 280, however, it should be understood that circuitry may be provided for allowing various ones of the units shown in FIG. 2 to access EAD290 without having to pass through branch prediction unit 280.

Those of ordinary skill in the art will appreciate that the hardware in FIGS. 1-2 may vary depending on the implementation. Other internal hardware or peripheral devices, such as flash memory, equivalent non-volatile memory, or optical disk drives and the like, may be used in addition to or in place of the hardware depicted in figures 1-2. Also, the processes of the illustrative embodiments may be applied to a multiprocessor data processing system, other than the SMP system mentioned previously, without departing from the spirit and scope of the present invention.

Moreover, data processing system 100 may take the form of any of a number of different data processing systems including client computing devices, server computing devices, a tablet computer, laptop computer, telephone or other communication device, a Personal Digital Assistant (PDA), or the like. In some illustrative examples, data processing system 100 may be a portable computing device configured with flash memory to provide non-volatile memory for storing operating system files and/or user-generated data, for example. Essentially, data processing system 100 may be any known or later developed data processing system without architectural limitation.

As will be appreciated by one skilled in the art, the present invention may be embodied as a system, apparatus or method. In an illustrative embodiment, the mechanisms are provided entirely in hardware, such as circuitry, hardware modules or units of a processor, and so forth. However, in other illustrative embodiments, the features and mechanisms of the illustrative embodiments may be provided or implemented using a combination of software and hardware. For example, the software may be provided in firmware, resident software, microcode, etc. The various flow diagrams set forth hereinafter provide a general description of the operations that may be performed by the hardware and/or a combination of hardware and software.

In the illustrative embodiments, where the mechanisms of the illustrative embodiments are implemented at least in part in software, any combination of one or more computer-usable or computer-readable media storing the software may be utilized. The computer-usable or computer-readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: random Access Memory (RAM), Read Only Memory (ROM), erasable programmable read only memory (EPROM or flash memory), and the like.

In general, for each load and each store instruction, the EA is transformed into a corresponding RA. This EA to RA transformation is also performed for instruction fetches (I-fetches). Such a transformation typically requires an effective to real address table (ERAT) for retrieving instructions from a low-order memory. In the solution described herein, the EA to RA transformation is not performed for every load and store instruction, but only in the case of a load-miss, an I-fetch miss, and all stores.

By using only an EA for operation, this technique facilitates removing RA bits (e.g., bits 8:51) from one or more data structures, such as an EA directory (also referred to as an L1 directory), LRQF entries, LMQ entries. Furthermore, if only EA is used, then the SRQ LHS RA comparison logic is not performed. Removing these elements reduces the chip area of the processor used, thereby facilitating a reduction in chip area on a typical processor.

Furthermore, by using only an EA, the solution herein eliminates ERAT content addressable store operations (ERAT scaling) on each load and store address generation. This solution further eliminates switching of the RA bus throughout the unit and also avoids fast SRQ LHS RA content addressed store operations. Thus, by not performing the above operations, the solution facilitates the processor consuming less power than a typical processor.

In addition, the technical scheme of the application is favorable for improving the L1 delay. For example, by eliminating address translation, the solution herein is at least 1 cycle faster than a typical processor performing an EA to RA transformation when determining the "final dval". Latency is also improved because "bad dval" conditions, such as set table multi-hit, set table hit/RA miss, etc., are eliminated by using only EA (no RA transform). In a similar manner, the present techniques help improve the L2 delay.

Technical challenges of using only EA based LSUs include being able to handle snoops from L2. For example, the LSU must be able to have the capability of reverse conversion of RA to EA. Thus, the technical solution herein facilitates converting an RA-based snoop from L2 to an EA-based snoop destined for an LSU subunit.

Furthermore, only EA-based LSUs have the technical challenge of handling the same thread synonyms (i.e., two different EAs from a thread map to the same RA). Technical solutions address such technical challenges using Synonym Detection Tables (SDT) or ERT Eviction (ERT) tables as described herein. For example, between LHS, SHL, and LHL, L1 access synonyms are defined as follows:

Tid＝w，EA(0:51)＝x＝>RA(8:51)＝z；

tid ═ w, EA (0:51) ═ y ═ RA (8:51) ═ z. Thus, different EAs correspond to the same RA. The solution as described herein facilitates rejection of synonym EAs and restart with a corresponding primary EA.

Referring again to the drawings, FIG. 3 illustrates a load-store unit (LSU)104 of a processing core in accordance with one or more embodiments of the present invention. The depicted LSU104 facilitates execution in a 2-load-2-store mode; it should be noted, however, that the solution described herein is not limited to such LSUs. The flow of execution of the LSU is described below. An EA (effective address, as used by a programmer in a computer program) is generated from a load or store instruction. Similarly, for instruction fetches, an EA is also generated. Typically, for each instruction, the EA is transformed into a RA (real address, as used by hardware, after EA to RA translation), which requires more chip area and frequent translation, among other technical challenges. The solution described herein addresses such technical challenges by using only an EA (without converting to a RA) and generating the RA using a valid real table (ERT)255 only on load-misses, I-fetch misses, and stores.

LSU104 includes a load-reorder-queue (LRQF)218, where all load operations are tracked from dispatch to completion, similar to LRQ 218 in a typical LSU design. LSU104 also includes a second load-reorder queue LRQE 225. When a load is rejected (for a cache miss, or translation miss, or a dependent previous instruction is rejected), the load is fetched from the issue queue and placed in an LRQE entry to reissue the load therefrom. For the two load modes, the depicted LRQE 225 is divided into 2 instances, LRQE0 and LRQE1, each with 12 entries (24 total entries). In ST mode, there is no thread/pipe based partitioning. In MT mode, T0, T2 operations are initiated on pipe LD 0; and T1, T3 operations are started on pipe LD1 for restart. Here Tx is thread-x, e.g., To is thread-0, T1 is thread-1, T2 is thread-2, and T3 is thread-3. It should be noted that although the examples herein use four threads in MT mode, in other examples, MT mode may include executing a different number of threads simultaneously, such as 8, 16, or any other number. In one or more examples, the number of threads in MT mode is configurable. Further, in the example herein, LSU104 uses two load pipes LD0 and LD1, however in other examples the number of pipes may be different, e.g., 3, 4, 8, etc. In one or more examples, LRQF218 is divided into as many partitions as the number of pipes.

As described, LRQF218 is divided into 2 instances LRQF0 and LRQF1 for two load patterns, with (each instance) having 40 entries. LRQF218 is round-robin for in-order entry allocation, round-robin for in-order entry draining, and round-robin for in-order entry deallocation. Further, in the MT mode, T0, T2 operations are started on the pipes LD0, ST 0; and initiating T1, T3 operations on the pipes LD1, ST 1. In ST mode, LRQF does not have any pipes/threads.

In one or more examples, LRQF218 (and other structures described herein) is divided into T0 for SMT4 mode: LRQF0[0:19] circular queue, T1: LRQF1[0:19] circular queues; t2: LRQF0[20:39] circular queue, T3: LRQF1[20:39] round robin queue.

In one or more examples, LRQF218 (and other structures described herein) is divided into modes T0 for SMT 2: LRQF0[0:39] circular queue, and T1: LRQF1[0:39] round robin queue. Further, in one or more examples, for ST mode, LRQF0[0:39] circular queue where LRQF1 is a copy of LRQF 0. For other data structures, in ST mode, a similar partitioning scheme is used, where the second instance is a copy of the first instance.

In the case of a cross-invalidation flush (XI flush), for LRQF, NTC +1 flushes the XI that is draining from another thread or any thread that stores a hit, so that in the case of the XI flush, LSU104 does not perform an explicit L/L ordering flush for synonyms.

All stores are checked against LRQF218 for SHL detection, upon which LRQF218 initiates a flush of all (any instructions/operations) after any load or store. Furthermore, the DCB instruction checks against LRQF218 for a SHL case where LRQF218 causes a flush of anything after the load or DCB. Furthermore, all loads check against LRQF218 for LHL detection (sequential load consistency), and when a LHL is detected, LRQF218 causes a flush of any load either newer loads or everything after older loads. In one or more examples, LRQF218 provides quadword atomicity, and LQ checks LRQF218 for quadword atomicity and flushes LQ if not atomic. Further, in the case of a LARX instruction, LSU104 checks against LRQF218 for a LARX-hit-LARX case and in response flushes the newer LARX, or any subsequent older LARX instruction.

Thus, LRQF218 facilitates tracking all load operations from issue to completion. Entries in LRQF218 are indexed by Real _ ltag (rltag), which is the physical location in the queue structure. The age of the load operation/entry in LRQF218 is determined with ordered Virtual _ Ltag (vltag). LRQF flushes loads using GMASK, and partial group flushes using GTAG and IMASK. The LRQF logic may be flushed from the current iTag or iTag +1 or the exact load iTag.

Furthermore, LRQF does not include the commonly used RA (8:51) field, but rather is EA based and includes ERT ID (0:6) and EA (40:51) (24 bit savings). LRQF pages are matched on SHL, and LHL is matched based on ERT ID. In addition, each LRQ entry has a Force Page Match (Force Page Match) bit. When an ERT ID matching the LRQ entry ERT ID is invalidated, a force page match bit is set. The LRQ will detect the LHL, SHL, and store an ordering flush involving any entry with a forced match of 1.

Thus, LRQF218 addresses the technical challenge of multiple CAM ports occupying chip area and consuming power for address translation by maintaining a partition load request queue partitioned for OoO a predetermined number of instructions and a predetermined number of threads that a processor may execute simultaneously.

SRQ220 of LSU104 has a similar structure as LRQF218, with (each instance) two instances SRQR0 and SRQR1 of 40 entries, which are looped in order entry allocation, looped in order entry draining, and looped in order entry deallocation. Furthermore, SRQ220 is divided similarly to LRQ 218, e.g., T0, T2 op launched on pipes LD0, ST 0; t1, T3 op initiated on the pipe LD1, ST 1; and no pipe/thread partitioning in ST mode. In ST mode, the two copies have the same value, wherein the copies differ in MT mode. In the SMT4 mode, the two instances are further partitioned, with each thread being allocated 20 entries from the SRQ220 (see the example partitioning of LRQF described herein). In one or more examples, for store drain arbitration, SRQ intra read pointer multiplexing is performed in SMT4 mode. Alternatively, or in addition, SRQ0/1 inter-multiplexing is performed in SMT2 and SMT4 modes. In the ST mode, a purge operation is performed only on SRQ 0.

Here Tx is thread-x, e.g., To is thread-0, T1 is thread-1, T2 is thread-2, and T3 is thread-3. It should be noted that although the examples herein use four threads in MT mode, in other examples, MT mode may include executing a different number of threads simultaneously, such as 8, 16, or any other number. In one or more examples, the number of threads in MT mode is configurable. Further, in the example herein, LSU104 uses two storage pipes ST0 and ST1, however in other examples the number of storage pipes may be different, e.g., 3, 4, 8, etc. In one or more examples, SRQR 220 is divided into as many partitions as the number of storage pipes.

Each entry of SRQ220 contains a store TID (0:1), ERT ID (0:6), EA (44:63), and RA (8: 51). To detect the LHS, the LSU uses { store Tid, EA (44:63) }, thereby eliminating the RA LHS alias check. ERT IDs are used to "catch" EA (44:63) partial match mispredictions. The SRQ entry has an RA (8:51) that is translated at the storage agent and is only used when a store request is sent to L2 (the store instruction is drained, not issued). Each SRQ entry also has a "force page match" bit. When an ERT ID matching the SRQ entry ERT ID is invalidated, a force page match bit is set. The SRQ may detect the LHS, which refers to any entry with a forced page match of 1. For example, the LHS causes a load instruction to be rejected for an entry with a forced page match of 1. Further, if the page match is forced to 1 for the SRQ entry, the store flush forces a miss in the L1 cache. This works in conjunction with the "extended store hit reload" LMQ action.

For example, for LMQ, LMQ address matches { ERT ID, EA page offset (xx: 51), EA (52:56) } matches. Further, when the ERT ID matching the LMQ entry ERT ID is invalidated, the "force page match" bit (═ 1) of each LMQ entry is set. If the valid LMQ entry [ X ] ForcePageMatch is 1 and Ld Miss EA [52:56] LMQENentry [ X ] EA (52:56), then the LMQ rejects a load Miss. In addition, the LMQ has an extended store hit reload. For example, if the reload EA (52:56) is srqentiry [ X ] EA (52:56) and srqentiry [ X ] ForcePageMatch is 1, the LMQ suppresses the reload enablement. Alternatively or additionally, LMQ inhibit reload is enabled if LMQEEntry [ X ] EA (52:56) StDrain EA (52:56) and StDrainForcePageMatch 1.

The depicted LSU104 breaks down the Store Data Queue (SDQ) into portions of the SRQ220 itself to further save chip area. If the operand size is smaller than the SRQ entry size, e.g., 8 bytes, the operand is stored in an entry in the SRQ itself. In the case of wider operands, such as vector operands that are 16 bytes wide, the SRQ uses two consecutive entries in the SRQ220 to store operands in MT mode. In ST mode, wider operands are stored in SRQ0 and SRQ1, e.g., 8 bytes each.

The SRQ220 queues operations of the type store, barrier, DCB, ICBI, or TLB type. A single s-tag is used for store _ gen and store _ data. The SRQ220 handles load-hit-store (LHS) cases (same thread only). For example, all issued loads are checked by the SRQ220 to ensure that there are no older stores with data conflicts. For example, data conflicts are detected by comparing the load EA and data byte flags to older stores in the SRQ EA array.

The SRQ entry is allocated at dispatch, where the dispatched instruction tag (iTag) is populated into the correct row. Furthermore, SRQ entries are deallocated when the memory is empty. In one or more examples, the iTag array holds an "overflow" dispatch. For example, if a desired row in the SRQ, such as SRQ entry x, is still in use, information is written into the iTag array at dispatch time. When the SRQ entry x is deallocated, its corresponding row in the SRQ overflow itag structure is read out and copied into the main SRQ itag array structure (the read of the overflow itag structure is gated by whether there are any valid entries in the overflow itag array for a given thread/region). The master SRQ0/1iTag array is operated by the CAM port (or 1/2 content-addressed store in SMT4 (1/2cammed)) to determine which physical row to write when a store is issued, so that the ISU issues stores based on itags. When the memory is drained and deallocated, SRQ220 sends the iTag to the ISU.

FIG. 4 is an exemplary block of an effective address directory structure (L1 cache) 290 in accordance with one illustrative embodiment. In one or more examples, the EAD is part of the LSU 104. As shown in FIG. 3, EAD290 includes one or more entries, e.g., entry 0 through entry N, where each entry includes a plurality of information fields regarding a group of one or more instructions. For example, in one illustrative embodiment, each entry in EAD290 may represent 1 to 32 instructions. An entry in EAD290 is created in response to an instruction that is fetched to be located in a new cache line of a processor cache (e.g., L2 cache 260 of FIG. 2). As additional instructions are fetched from the cache line, the entry in EAD290 is updated. Each entry of EAD290 terminates on a taken branch (i.e., a taken branch instruction from the cache is resolved as "taken"), a cache line crossing (i.e., the next fetch instruction is in a different cache line than the current cache line), or flushing of the processor pipeline (e.g., when a branch misprediction occurs, etc.).

As shown in FIG. 3, the fields of the EAD290 entry include a base effective address 310, a first instruction identifier 320, a last instruction identifier 330, a close identifier 340, a global history vector field 350, a link stack pointer field 360, a branch taken identifier 370, and a branch information field 380. EAD290 is organized like an L1 data cache. Setting up the associated organization. For example, in one or more examples, it is 32 indices, selected by EA (52:56) with EA (0:51) via 8-way addressing.

Base effective address 310 is the starting Effective Address (EA) of the instruction group. Each instruction in the set of instructions has the same base EA, and then an offset from the base EA. For example, in one illustrative embodiment, the EA is a 64-bit address that includes bits 0: 63. In one illustrative embodiment, the base EA may include bits 0:56 of the EA, where 57:61 represents an offset from the base EA for a particular instruction within an instruction group. Bits 62 and 63 point to a specific byte of each instruction. In the illustrative embodiment, each address references an instruction that is 32 bits long (i.e., 4 bytes), where each byte in memory is addressable. The instruction cannot be further divided into addressable subparts, so the instruction address will always have bits 62 and 63 set to zero. Thus, bits 62 and 63 need not be stored and may always be assumed to be zero by EAD.

The first instruction identifier field 320 stores an effective address offset bit, such as bits 57:61 of the EA of the first instruction in the instruction group to which the EAD290 entry corresponds. The combination of the base EA from field 310 and the effective address offset bit in the first instruction identifier field 320 provides the EA for the first instruction in the instruction group represented by the EAD290 entry. As discussed below, this first field 320 may be used to recover the reacquisition address and branch prediction information, for example, if the pipeline is flushed.

The last instruction identifier field 330 stores the effective address offset bits, e.g., bits 57:61 of the EA, of the last instruction in the instruction group to which the EAD290 entry corresponds. The EAD logic updates this field when additional instructions in the group of instructions represented by the EAD290 entry are fetched. When a cache line intersection or branch is found, the EAD logic interrupts the update of this field 330 in a particular EAD290 entry in response to the EAD290 entry being closed. This field will remain intact unless a pipeline flush occurs that clears a portion of the EAD entry. In this case, the EAD logic updates this field to store the effective address offset bit for the instruction, which is now the new last instruction in the entry as a result of the flush. This field is eventually used for completion, as described below, to free up entries in EAD 290.

The close identifier field 340 is used to indicate that the EAD290 entry has been closed and that no further instruction fetches are to be performed to fetch instructions of the instruction group corresponding to the EAD290 entry. The EAD290 entry may be closed for a variety of different reasons, including cache line crossing, branch taken, or flushing of the pipeline. Any of these conditions may result in the value in close field 340 being set to indicate that the EAD entry is closed, e.g., to a value of "1". This field 340 is used to release entries in the EAD290 upon completion, as discussed in more detail below.

Global history vector field 350 identifies the global history vector of the first instruction fetch group that created the entry in EAD 290. The global history vector is used to identify the history of whether a branch is taken or not taken, as discussed in more detail below. The global history vector is used for branch prediction purposes to help determine whether a current branch is likely to be taken based on the recent history of branches being taken or not taken.

The link stack pointer field 360 identifies the link stack pointer of the first instruction fetch group that created the entry in the EAD 290. The link stack pointer is another branch prediction mechanism, which will be described in more detail below.

The branch taken field 370 indicates whether the group of instructions corresponding to the EAD290 entry has a branch instruction in which the branch is taken. The value in branch taken field 370 is updated in response to a branch instruction of the group of instructions represented by the EAD290 entry being predicted taken. In addition, once a branch in the instruction of the EAD290 entry is taken, the EAD290 entry is also closed by writing the appropriate value to the close field 340. Since the branch taken field is speculatively written at the predicted time, it may need to be replaced with the correct value when the branch is actually executed. For example, it may be predicted that the branch is not taken, in which case a "0" is written to the branch taken field. However, later in execution, it may be found that a branch is taken, in which case the field must be corrected by writing it to the value "1". The second write occurs only if the branch is mispredicted.

The branch information field 380 stores various branch information used to update the branch prediction structure when a branch is resolved or the architectural EA state when a branch instruction completes.

The ERT _ ID field 385 stores an index into an ERT table (described further below) that identifies the corresponding ERT entry. When an ERT entry is invalid, the associated ERT _ ID is invalid and will also invalidate all relevant entries in the L1 cache and the L1D cache.

Entries in EAD290 are accessed using an effective address tag (etag) that includes at least two parts: base eatag and eatag offset. In one illustrative embodiment, the eatag is a 10-bit value that is relatively much smaller than the 64-bit effective address. In one exemplary implementation, for an EAD290 of 10 bits and having a size of 14 entries, the EAD consists of a first 5 bits (referred to as a base EAD) for identifying entries within the EAD290 and a second 5 bits (referred to as an EAD offset) for providing an offset for a particular instruction within the group of instructions represented by the entry in the EAD 290. The first of the 5 bits that identify an entry within EAD290 may be used as a wrap bit to indicate whether wrapping occurs when going from the topmost entry to the bottommost entry of EAD 290. This can be used for age detection. The second through fifth bits of the 5 bits that identify an entry within EAD290 may be used to index into the EAD to identify the underlying EA of the instruction, namely EA (0: 56). A 5-bit offset value may be used to provide, for example, bits 57:61 of the effective address of a particular instruction. This example eatag is illustrated below:

eatag (0:9) | | offset (0:4) | with line (0:4)

Line (0): a wrap bit for the EAD indicating whether a wrap occurs when going from a top-most entry to a bottom-most entry of the EAD.

Line (1: 4): to the index in the 14 entry EAD, for determining the EA (0:56) of the instruction.

Offset (0: 4): bits 57:61 of the EA of the instruction.

FIG. 5 illustrates an exemplary active real Table (ERT) structure in accordance with one or more embodiments of the invention. In one or more examples, the ERT255 includes a total of 128 entries, however, it should be noted that in other examples, the total number of entries may be different, and the number of entries may be selectable or programmable. Further, where the LSU104 uses multiple pipelines, each pipeline has a separate partition in ERT 255. In one or more examples, a predetermined maximum number of entries in ERT255 are split equally between pipelines. For example, in the case of two pipelines (i.e., two instructions in parallel, respectively), the LSU maintains two partitions of ERT255, each having 64 (half) entries, e.g., ERT0 and ERT 1. For example, LD0 and ST0 use ERT0, and LD1 and ST1 use ERT 1. In ST mode, a first partition of ERT255 is used, another partition is a copy of the first partition, e.g., ERT0 is used, and ERT1 is a copy of ERT 0. Alternatively, in one or more examples, if the LSU uses a single load pipe and a single store pipe, the entire ERT255 is used as a single partition. Unless otherwise noted, any partition of ERT255 is described below.

ERT255 includes valid ERT entries, typically any active page present in the L1I-cache or D-cache directory (EAD290), or SRQ entries or LRQF entries or LMQ entries. In other words, ERT255 is a table of all active RPNs in the LSU and IFU (L1 DC, SRQ, LRQE, LRQF, LMQ, IC). In one or more examples, if the processor 106 is operating in ST mode, all entries in ERT255 are used for a single thread that is executing. Alternatively, in one or more examples, the entries in ERT255 are divided into sets, and in ST mode, each set has the same content. For example, if ERT255 has 128 entries in total and supports a maximum of two threads, when the processor is operating in ST mode, ERT255 includes two sets of 64 entries each with the same contents.

Alternatively, if the processor 106 is operating in MT mode, the ERT entries are divided among the executing threads. For example, in the case of two threads, the ERT entries are divided into two equal sets, a first set of entries associated with the first thread and a second set of entries associated with the second thread. For example, 1 copy miss of LD0 pipeline L1, ST0 pipeline launch, T0/T2I-get: ERT0 for handling T0 in SMT2 mode and T0/T2 in SMT4 mode; and 1 copy miss of LD1 pipeline L1, ST1 pipeline launch, T1/T3I-get: ERT1 that handles T1 in SMT2 mode and T1/T3 in SMT4 mode.

In one or more examples, each ERT entry includes at least the following ERT fields, ERT _ ID (0:6), Tid _ en (0:1), page size (0:1), EA (0:51), and RA (8: 51). The ERT _ ID field is a unique index for each ERT entry. For example, the ERT _ ID may include a sequence number that identifies the ERT entry. The ERT _ ID is stored in the ERT _ ID field 285 of EAD290, as well as in other data structures used by the LSU. The TID en field indicates whether an entry is enabled for use in MT mode and, in one or more examples, indicates the thread identifier of the instruction that is using the ERT entry. Further, the page size indicates the memory page size referenced by the ERT entry. The RA includes a real address associated with the ERT entry.

The LSU references ERT255 only if the RA is to be used to complete execution of an instruction. As described herein, ERT255 is queried by the LSU for the following four functions: i fetch, load, or store miss in the L1 cache; 2. storage from another thread in the kernel; 3. a snoop (XI) from another core; TLB and SLB invalidate.

In the first case of an I-fetch, load, or store miss in the L1 cache, the EA and thread _ id are used to index into ERT255, and if there is a valid ERT entry, the RA from the corresponding ERT entry is sent to the L2 cache. In the case of an ERT miss, i.e., there are no valid ERT entries for the EA and thread _ id, the SLB/TLB is used.

In the second case, where storage from another thread in the kernel, the storage drained from the SRQ checks the ERT255 and ERTE tables (described further below) for a hit from another thread. If there is no hit from a different thread, then there is no load from another thread using the same RA. If there is a hit from a different thread using the same RA, the LSU checks the LRQ. However, in rare cases, there is a hit from another thread if the RA is used by another thread. Thus, the LSU looks up ERT table 400 to find the relevant EAs for the common RA. The EA is then used to look up a match in the LRQ (rejecting any store issues in this cycle). The LRQs are divided per thread, so the LSU only looks at the LRQs of the relevant threads. If there is a matching load in the LRQ, the LSU flushes the oldest of the matching loads.

In a third case of a snoop from another core of the processor, the LSU operates similar to the second case and checks for a hit from any other thread that is executing. In the case where TLB/SLB is invalidated, ERT255 is also invalidated.

FIG. 6 illustrates a flow diagram of an exemplary method for accessing memory for execution of instructions by an LSU in accordance with one or more embodiments of the invention. The instruction may be a load, store, or instruction fetch for OoO processor 106. Upon receiving the instruction, the LSU uses the parameters of the instruction to check whether the EAD290 has an entry corresponding to the instruction, as shown at 505 and 510. In one or more examples, parameters for checking include a thread identifier, a page size, an EA, and so on.

If the LSU experiences an EAD hit in EAD290, i.e., the EA of the instruction matches an entry in EAD table 300, the LSU reads the contents of the matching EAD entry to determine the corresponding ERT entry, as shown at 520. Each EAD entry contains an ERT _ ID (0:6) field 285. As previously described, when an ERT entry is invalidated, the associated ERT _ ID is invalidated, which also invalidates all associated entries in the EAD table 300. Thus, an EAD hit means an ERT hit, since using the ERT _ ID field 285, an ERT entry can be found for the load/store instruction. Thus, in the event of an EAD hit, after identifying the corresponding EAD entry, the LSU reads the ERT _ ID out of the EAD entry and sends it to the SRQ, LMQ, and/or LRQF, as shown at 530. The SRQ, LMQ, and/or LRQF use the EA from the identified EAD entry. In the case of a store instruction using a RA, the RA from the ERT entry is read out for L2 access, as shown at 540 and 545. Thus, since RA is not used anywhere else, but rather stores instructions, the core implementing the subject technology is referred to as an EA-only core.

Now consider the case where an instruction misses EAD290, i.e., the EA of the instruction does not have a matching entry in EAD table 300. Thread _ id and EA are compared to each entry from ERT255, as shown at 550. If an ERT hit occurs, i.e., the ERT entry matches a parameter, then LSU reads the RA (8:51) from the ERT entry, as shown at 555 and 530. For load requests, LSU sends RA to L2 cache for access 530. For store instructions, the LSU stores the RA in the SRQ, and then sends the RA to the L2 cache when the store empties into the L2 cache, as shown at 540 and 545.

If an ERT miss occurs, the LSU initiates a reload of ERT255, as shown at 555 and 560. In addition, an ERT entry replacement is initiated. ERT entry replacement is LRU based, and the LSU ensures that synonyms in the out-of-order window are tracked during this process.

Thus, by implementing the above method for loading, if there is an EA hit in the EA based L1 directory, no address translation is performed. This improves upon typical processors where the L1 directory is RA-based, which in the event of a load miss in the L1 directory, results in an EA being sent to the ERAT table for translation, resulting in an RA being sent to the L2 directory and above.

Further, for stores, using the methods described herein, the LSU must traverse the ERT table to determine the RA, which is then stored in the SRQR to drain the store to the cache (L1, L2, memory) when the store is drained from the SRQ. SRQR saves all RAs for storage. The RA is only stored for draining to nesting (i.e., L2, memory, and other locations of the memory subsystem). RA is not used for load-hit-store, store-hit-load, or load-hit-load type out-of-order execution hazard detection, as is done in typical solutions. The RA computation for storage occurs before the storage is complete because after completion, the LSU cannot process any interrupts for storage (storage may generate address translation related interrupts that will be processed before the storage is complete). Here, the RA computation is done when the store is issued (from SRQR), preventing the LSU from having to perform address translation. Thus, stores are issued and executed out of order, then completed in order, and then drained from the SRQ in order. No other thread or kernel knows about the store (only the current thread) before it is drained. After draining the store from the SRQ, the store is written to L1 (if the line already exists in L1) and the L2 cache (if the cache is enabled), and at this point the store is known to all other threads and cores in the system 100.

For instruction fetches that miss the EA-based L1I-cache, the EA is converted to a RA using ERT255, and the RA is sent to the nest to fetch the I-cache line. Here, the LHS (load-hit-store), SHL (store-hit-load), and LHL (load-hit-load) are determined based on the EA and ERT indices, which are stored with directory entries in the EA based L1 cache (EAD 290). All entries in the EAD table 300 have their translations in the ERT table 400, which can be used once the LHS, SHL, and LHL are determined. If the ERT entry is invalid, the corresponding L1 cache entry is invalid.

LRQF is a load-reorder queue that ensures that all load operations are tracked from dispatch to completion. When a load is rejected (for a cache miss, or translation miss, or a previous instruction dependent thereon), the load is fetched from the issue queue and placed in the LRQE for reissuing the load therefrom.

FIG. 7 shows a flow diagram of a method for reloading an ERT in accordance with one or more embodiments of the invention. The ERT reload causes an entry in the ERT to be created or updated in response to and based on the ERT miss. The ERT receives the RA to be added to ERT255 and compares the RA to each entry in ERT0 and ERT1, as shown at 605. If the RA does not exist in ERT255, and if a new entry can be created, ERT255 creates a new entry with a new ERT _ ID to store the RA, as shown at 610 and 615. A new entry is created in the ERT0 or ERT1 based on whether the executing thread is the first thread or the second thread, respectively. In the case where the processor is operating in ST mode, ERT0 is updated. If ERT255 does not have an open void for the new entry, then the existing entry is replaced based on least recently used or other such technique, as shown at 615.

If an existing entry in ERT255 is found to have the same RA as the received RA (reload RA), ERT255 compares the page size (0:1) of the existing entry to the page size of the received RA, as shown at 620. If the page size of the existing entry is less than the page size of the reload RA, the existing entry for the RA is deleted from ERT255 and a new entry with a new ERT _ ID is added for the RA with the larger page size, as shown at 625. If the existing entry has the same or larger page size, and if the implementation uses SDT, an entry is created in the SDT for reloading the RA, as shown at 627. It should be noted that this operation may not be performed in the case where the LSU uses ert.

If the page size of the existing entry is the same as the size of the reload RA, ERT255 checks if the existing entry is on a local ERT for the executing thread, as shown at 630. In this case, a local ERT refers to an ERT associated with the executing thread, e.g., ERT0 for the first thread and ERT1 for the second thread. If the RA hit is in another ERT, i.e., an ERT that is not a local ERT, the ERT255 creates a new entry in the local ERT with an ERT _ ID that matches the ERT _ ID in the non-local ERT, as shown at 632. For example, if the RA hits in an instruction for thread-0 execution in ERT1, an entry with a matching ERT _ ID is created in ERT0 as the entry in ERT 1.

If the RA hits on a local ERT instance, and if the EA also matches, because both EA and RA match an existing entry, but there is an ERT miss for this thread that hints at ERT reload, ERT considers this to indicate that both threads are sharing the same EA-RA mapping (with the same page size). Thus, as shown at 634, the tiden (0) or tiden (1) bit in the existing match entry for the bit corresponding to the reloaded thread is turned ON to indicate this condition.

If the RA hit is on a local ERT instance, then the EA does not match the existing entry, and if the existing entry is for the same thread as the reload RA, the ERT identifies an aliasing case in which two different EAs map from the same thread to the same RA, as shown at 636. If the processor uses an SDT based implementation, the synonym entry is installed in the SDT, which maps to the ERT ID, EA offset of the existing matching entry (40: 51). If the processor uses an ERTE-based implementation, the LSU rejects the instruction until it is non-speculative, at which point it evicts the entry from the ERT and adds the entry to the ERT.

If the RA hits on a local ERT instance, the EA does not match the existing entry, and if the existing entry is for a different thread, the ERT identifies an aliasing case in which both EAs map to the same RA from different threads, as shown at 638. If the processor uses an SDT based implementation, the synonym entry is installed in the SDT, which maps to the ERTID, EA offset of the existing matching entry (40: 51). If the processor uses an ERTE-based implementation, a new local ERT entry is added using the new ERT ID, where tiden is valid only for threads with ERT misses.

The above approach facilitates the use of two different ERT entries by two threads with the same RA but different EAs in an ERT-based implementation; and in an SDT based implementation, when two threads have the same RA but different EAs, one of the translations uses ERT entries while the other will use SDT entries. Thus, ERT entries facilitate the use of the same EA and the same RA on different threads by having the tiden field in the ERT entry. For example, Tid _ en (0:1) on ERT0 instances, { Tid0en, Tid1 en }; and Tid en (0:1) ═ Tid1 en, Tid1 en on ERT1 examples. Further, ERT entries facilitate the situation where the same EA corresponds to different RA's across different threads by having multiple entries in ERT0 and ERT1 with their respective thread identifiers. ERT entries also support cases with different EAs corresponding to the same RA (same or different thread cases). Both cases will now be described further based on whether the implementation uses ert or SDT.

In the case of an LSU using an implementation that utilizes SDT, when a new instruction with a different EA corresponding to the same RA is detected at ERT reload, the LSU installs an entry in the SDT (synonym detection table) instead of ERT 255. The SDT hit is restarted with the EA of the original (or earlier) ERT entry. If the new synonym page size is larger than the page size in the existing ERT entry with a matching RA, then the existing ERT entry is replaced with the new synonym (with the larger page size) instead of installing the synonym in the SDT. The old ERT entry is eventually reinstalled in the SDT as a synonym.

Alternatively, in the case of an implementation in which the LSU uses ERTE, if instructions with different EAs corresponding to the same RA are for different threads, the LSU installs a new entry in the ERT table with the appropriate Tid _ en enabled. If the instruction is for the same thread, the LSU denies the load/store until it is non-speculative. After that point, the LSU deletes the existing ERT entry and places it in the ERTE table, tagged with the ITAG of the most recent instruction being processed from the thread. The LSU also installs a new EA-RA pair in ERT table 400. This ensures that no mapping of two different EAs from the same thread to the same RA occurs.

Further, referring back to the ERT case, consider the case where the LSU receives a snoop from another core from processor 106. The snoop may come from a different core in the system (the snoop indicating that another core or thread changed data at the same real address). The LSU also checks for stores from threads in the kernel as potential snoops to other threads in the kernel. All snoops (from other cores) or stores (from other threads in the core) are accompanied by an RA. In this case, the LSU reverse converts the RA to determine the corresponding EA, ERT _ ID, and page size based on ERT 255. The LSU compares this information to ERT _ ID, PS, EA (40:56) stored in each of the following structures to detect snoop hits and take appropriate action. For example, if a snoop hit is detected in an LRQF entry, the LSU indicates a potential load-hit-load out of order hazard. If a snoop hit is detected in EAD290, the LSU initiates an L1 invalidation if the snoop is from a different core. If another thread from the shared line is stored, the line automatically gets the new storage and is updated.

Thus, the solution described herein facilitates a reduction in the chip area of the LSU by tracking only one address EA. In addition, the technical scheme enables the processor core to operate in a 2-load and 2-store mode with partitioned load-store queues, further reduces the CAM ports for translation, and further reduces the power consumption of translation. Furthermore, by using only EA, the technical solution has the advantage that no translation to RA is performed in the load/store path unless an EAD miss occurs. Furthermore, timely detection of hazards such as LHL, SHL, LHS, and suppression of DVAL do not cause timing problems. Since the LSU only uses the EA to detect the LHS, SHL, the LHL may miss when two different EAs map to the same RA. The solution described herein addresses this technical challenge by using the EA and ERT indices from the EAD. Further, upon detecting an EA synonym, the LSU processes the instruction by using the SDT or ERTE table for the instruction in the OoO window.

If the LSU uses SDT (as opposed to ERTE), and if a snoop hit exists in the LMQ, the LSU also updates the LMQ entry to not be stored in the L1D cache, the SRQ entry is not used for snoops in the SRQ, is used only for LHS EA miss RA hit pattern checking, and creates a new SDT entry for the snoop hit.

FIG. 8 illustrates an example structure of a Synonym Detection Table (SDT)700, according to one or more embodiments of the invention. The depicted example shows a case with 16 entries, however, it should be noted that in other examples, the SDT 700 may include a different number of entries than this example. The SDT 700 is common among multiple threads and pipes of the LSU 104. For example, LD0, LD1, ST0, and ST1, all access entries in SDT 700, and SDT 700, do not have separate partitions for each.

The entries in the SDT 700 include at least the following fields: issue address { issue tid (0:1), issue EA (0:51) }, page size (0:1) (e.g., 4k, 64k, 2MB, 16MB), and restart address { EA (40:51), ERT ID (0:6) }. The Tid (thread identifier) field indicates which thread from the OoO processor is executing the instruction associated with the entry in the SDT 700. In the case where the instruction that misses L1 is launched, the LSU compares the instruction to the SDT 700. If the launched instruction results in an SDT hit on the original address comparison, the LSU rejects the instruction and restarts the instruction with the corresponding replacement address from the SDT entry. For example, the LSU uses the alternate Addr (40:51) for the SRQ LHS and "forces a match" to the ERT ID in the execution pipeline.

Entries are added to the SDT 700 during ERT reloading as described herein. For example, during an ERT reload, the reload RA is compared to valid ERT entries. If an ERT entry with a matching RA already exists and is not an EA hit condition, where only the additional tid _ en bit is set in the original ERT entry, the EA from the existing ERT entry is read (32:51) and the entry is installed into the SDT 700 instead of adding the entry to ERT 255.

Because the SDT 700 has a limited number of entries, the entries are replaced. In one or more examples, entries are replaced based on Least Recently Used (LRU) techniques or any other order. In one or more examples, if the SDT entry is replaced, the SDT entry installation sequence is re-triggered using a subsequent initiation of the secondary address. In addition, the CAM clears SDT entries having ERT IDs that match invalid ERT entries.

Fig. 9 shows a flow diagram of a method for performing ERT and SDT EA interchange in accordance with one or more embodiments of the invention. In one or more examples, the LSU performs a swap where ERT and SDT entries have the same page size. The interchange increases the efficiency of the processor 106 for the case where different EAs correspond to the same RA on different instructions on the same or different threads. For example, consider two commands X and Y, such that EAx > RAz, and EAy > RAz. If EAx misses ERT first, i.e., before EAy, the LSU installs an ERT entry with EAx mapped to RAz as described herein. Subsequently, when EAy misses an ERT at a later time, the LSU CAM has an ERT of RAz, gets a RA hit, and installs an entry in the SDT 700 with the original address EAy and the replacement address EAx.

Now, if most of the subsequent access to RAz is with EAy, the LSU must use SDT more frequently than the EAD itself. In one or more examples, a solution to improve the efficiency of LSUs by reducing such frequent trips to SDTs includes providing an incrementing counter in each SDT entry. As shown in FIG. 8, the LSU initiates an instruction with an ERT ID that matches the ERT ID from the SDT entry, as shown at 810. If the SDT entry ERT ID matches, the LSU also compares the EA of the launched instruction to the original EA in the SDT entry, as shown at 820. If the SDT entry has an original address value that matches the EA from the instruction, the counter of the SDT entry is incremented, as shown at 830 and 835. In the event that the initiated instruction has an EA that is different from the original address of the SDT entry, the counter of the SDT entry is reset, as shown at 840.

In one or more examples, the counter is a 4-bit field, implying a maximum value of 15. It should be appreciated that in other examples, the field has a different length, and/or has a different maximum value, which is used as the threshold. For example, after the instruction has been initiated, the counter value is compared to a threshold value, as shown at 845 and 850. If the counter is below the threshold, the LSU continues to operate as described above. If the counter exceeds, or in some cases equals, the threshold, the LSU invalidates the ERT entry corresponding to the SDT entry, as shown at 860. For example, an ERT entry with an ERT ID from an SDT entry is invalidated. Invalidation of ERT entries invalidates the corresponding entries from the EA directory, LRQF, LMQ, and SRQ.

Furthermore, the LSU addresses the technical challenges of exceptions in the launched instruction that require the original EA to complete in the following manner. For example, consider the case where a launched instruction gets an SDT hit and wants to restart with a replacement address from the SDT entry instead of the original launch address, but an exception occurs that requires the original EA to end. This may occur in the case of DAWR/SDAR, etc.

An LSU implementing the technical scheme described herein addresses this technical challenge by keeping the original address in a queue in the LRQE. The LRQE also holds an SDT hit flag (bit), SDT index (0:3), for each LRQE entry. When restarting, the SDT index is read one cycle ahead to obtain the replacement address. The LRQE is also multiplexed between the LRQE entry address (original address) and the SDT replacement address (read from SDT) before restarting. For exceptional cases such as those described above that require the original address to end, the LRQE has an additional SDT hit override flag (bit) or the like for each entry set on the DAWR partial match. LRQE restarts where there is an SDT hit but ends abnormally and forces the original address to be started. The SRQ restart is similar to the LRQE restart described herein, where the SDT hit override flag is used when it is determined to end up with an exception before restarting.

FIG. 10 illustrates an ERT eviction (ERTE) table 900 in accordance with one or more embodiments of the invention. The ERTE table 900 facilitates the LSU to track evicted (or invalid) lines from ERT 255. The ERTE table 900 further facilitates checking whether different EAs exist for the same RA on the same thread when creating an entry in the ERT 255. The ERTE table 900 is shared by all concurrent threads. In one or more examples, a portion of the ert table 900 is reserved for NTC entries. The entry in the ERTE table 900 includes fields for thread ID, ITAG, EA, and RA. In one or more examples, the ert table entry may include additional fields. In one or more examples, the thread ID may be a four-bit field.

The ERTE table 900 may be viewed as a combination of two tables: ERT _ EA and ERT _ RA have a 1:1 correspondence. The ERT _ EA table uses EA to CAM, while the ERT _ RA table uses RA to CAM. In one or more examples, each table has 64 entries, but in other examples the number of entries may be varied/programmable. If the EA-RA translation is removed from ERTE table 900, the associated cache line from EAD table 300 is also invalidated. Thus, ERT255 is a superset of all translations in the processor core (except TLB, SLB).

The ERTE table 900 keeps track of all translations that are not in ERT255 but are used by the running instructions. The ERTE table entry is tagged with the latest possible instruction that may have used the evicted entry. The latest ITAG of all activate instructions in the OoO window is stored in ERTE table 900 due to the OoO issue of the load-store. At the time of the flush, the ITAG of the last surviving instruction is stored to all valid entries. Once completed, all entries with the same or older ITAG are released. When full, the ERTE table 900 blocks dispatches and waits for instructions to complete (and/or flush), and the table eventually becomes completely free. It should be noted that although the examples described herein use ITAG to track the age of the initiated instruction, in other examples, another label (e.g., EATAG, LSTAG, etc.) that monotonically increases and winds up may be used.

When a translation in ERT255 or EAD290 is evicted/invalidated, the EA-RA of the evicted entry is added to ERTE table 900 without the last predetermined number of bits, e.g., the last 12 bits, from the evicted entry. In addition, the entry in the ERTE table 900 having the most recent valid ITAG for the same thread to which the evicted translation belongs is marked, e.g., with a flag (bit).

When a new address translation (EA to RA) is performed, the LSU compares the RA to ERT255 to check if a different EA to RA from the same thread already exists in ERT 255. If so, the LSU installs the new translation as a synonym in ERT 255. Thus, ERT255 (when ERTE is used) may have two entries with different EAs pointing to the same RA for the same thread. Because synonyms for in-flight instructions are not allowed, in one or more examples, the LSU initiates an NTC +1 flush, just to ensure forward progress.

Balanced flushing is a thread control mechanism that flushes stalled and/or resource consuming target threads completely from the system to restore resource usage fairness or balance. The balanced flush includes a next complete instruction flush (NTC +1) that flushes all instruction groups on the selected thread following the next complete instruction group. NTC +1 balance flushing flushes the execution units, global completion table and EAD of the selected thread. Threads are flushed in balance only when they are stopped at dispatch. The < bf: 1> field to enable or disable balanced flushing.

In one or more examples, entries in the ERTE are marked as invalid after OoO window execution completes. It should be noted that an ERTE entry is marked as valid when it is written and a new EA-RA translation pair is evicted from ERT table 255.

Figure 11 illustrates a flow diagram of an exemplary method for adding an entry to the ert table 900 in accordance with one or more embodiments of the present invention. As described herein, when a new entry is added to ERT255, the LSU writes the newly translated EA and RA to a given line that may be managed by the LRU, as shown at 1010. The ERTE table 900CAM uses the RA to check if the RA is already present in an entry in the ERTE table 900 corresponding to another EA to check for a potential multi-hit condition with the same thread at installation, as shown at 1012. If the RA is already present in the ERTE table 900, the ERT table refuses to create an entry for the EA-RA up to the NTC and installs when an NTC is detected, as shown at 1015.

Before overwriting an entry in ERT255, the LSU reads the EA and RA of the existing entry in the ERT that is being overwritten by the new entry, as shown at 1020; and further stores the read entry in the ert table 900, as shown at 1030. In addition, the ERTE table 900CAM and the read EA, as shown at 1040 and 1050, are used when there is a snoop or memory drain from another core of the processor.

FIG. 12 depicts an example sequence diagram of an example set of instructions initiated in accordance with one or more embodiments of the invention. The instructions are depicted in program order on the left, which is initiated by OoO, resulting in a different sequence of operations than the sequence of instructions. For example, consider the following events occurring in chronological order: the command M, OoO issues, using the conversion "ea1, ra 1"; 2. the instruction K, OoO issues, resulting in an ERT miss, a new entry is installed, and "ra2 ea2" is evicted from the ERT. At this point, the last in-use ITAG — N (all lines evicted from the same thread); that is, up to N instructions may have used "ra2 ea2" and no instructions after N may use "ra2 ea 2". 3. Instruction H, OoO issues, gets an ERT miss, and evicts "ra1 ea1" from the ERT. At this time, the final used ITAG is Q. 4. The flushed pipeline and the last surviving instruction have ITAG ═ E; further, the next instruction fetched is R, S. 5. Instructions E through R complete in a given cycle, which frees up all entries in ERTE.

Thus, the solution described herein facilitates using only EA, providing technical advantages such that ERAT (which is typically used in processors) is not referenced in the load/store path, and further such that timely detection of SHL and suppression of DVAL does not cause timing problems. Furthermore, the solution described herein solves the technical problem of using only EAs, e.g. LHS, SHL, LHL detection may be missed when two different EAs map to the same RA. The solution described herein solves such technical problems by using a Synonym Detection Table (SDT) or ERT eviction table for instructions in the OoO window. The described technical solutions provide various technical advantages including reduction in chip area (by not storing the RA), reduction in power consumption (by not converting the EA-RA), and improvements in latency, among others.

Further, the solution facilitates power consumption savings by eliminating content addressed store operations (trimming) for determining an RA for an EA each time a load and store address is generated. Instead, the EA is used until an EAD miss and an ERT miss occur. Furthermore, since only a single CAM port is now used, this solution helps to remove RA bus switching in the whole cell.

Fig. 13 shows a flowchart of an example method for issuing instructions by LSU104 in multi-pipe mode and OoO depending on whether the processor is operating in ST mode or MT mode, in accordance with one or more embodiments of the present invention. For example, the LSU may operate in a 2-load 2-store mode (multi-pipeline mode). LSU104 selects the instruction to issue from the OoO window, as shown at block 1310. The selected instruction may be a load instruction, a store instruction, or any derivative of such an instruction issued by LSU104, such as a LARX instruction.

The LSU104 determines OoO whether the processor is operating in ST mode or MT mode, as shown at block 1320. In the case where ST mode is used, the processor uses a single thread, and LSU104 determines only the LSU pipeline associated with the instruction, as shown at block 1330. For example, if the instruction is a load instruction, LSU104 may associate the load instruction with an LD0 pipeline, an LD1 pipeline, or any other load pipeline. Alternatively, if the instruction is a store instruction, LSU104 may associate the store instruction with the ST0 pipeline, the ST1 pipeline, or any other store pipeline.

In addition, LSU104 creates/accesses an entry to issue an instruction using the partition of LRQF218, SRQR 220, LRQE 222, and ERT255 associated with the pipe, as shown in block 1340. For example, if the instruction is a load instruction and the associated pipeline is LD0, the instruction uses entries from partitions LRQF0, LRQE0, and ERT 0. Similarly, in the case of store instructions on pipeline ST0, partitions SRQR0 and ERT0 are used. In the case of LD1 or ST1 pipelines, LRQF1, LRQE1, ERT1 and SRQR1 partitions are used. Entries are created in the partition on a first-in-first-out basis.

Alternatively, if the processor is operating in MT mode, i.e., multiple threads are being executed at the same time, LSU104 determines the thread identifier associated with the selected instruction, as shown in block 1350. LSU104 further determines the LSU pipe associated with the instruction, as shown in block 1360. In addition, LSU10 identifies the partitions and locations in the partitions of LRQF218, SRQR 220, LRQE 222, and ERT255 to create/access an entry to issue an instruction based on the combination of { thread id and pipe }, as shown at block 1370. For example, the LSU limits certain threads to certain pipelines, e.g., even threads to LD0 and ST0, and odd threads to LD1 and ST 1. It should be noted that in other examples, the classification of threads and pipes may be different. The LD0 and ST0 pipelines are associated with a "0" suffix partition, while the LD1 and ST1 pipelines are associated with a "1" suffix partition (or vice versa).

In one or more exemplary embodiments of the present invention, each partition is further divided into partitions according to the number of threads that the processor executes in MT mode. For example, if a processor is executing four threads, the two partitions in the LSU are each further divided into two portions, a first partition for a first thread and a second partition for a second thread, where each partition is associated with a pair of threads. In one or more other exemplary embodiments, where the number of threads in MT mode is different than four, the partitions are divided into a different number of portions based on the number of threads associated with each partition. In the above example, a pair of threads is associated with each partition, and each partition is further divided into equal portions, with a first thread from the pair using the first portion and a second thread using the second portion. Thus, instructions executed at T0 on LD0/ST0 are associated with the first portion of the LRQF0, LRQE0, SRQR0, and ERT0 partitions; and instructions at T2 on LD0/ST0 are associated with the second portion of the LRQF0, LRQE0, SRQR0, and ERT0 partitions. Further, instructions executed at T1 on LD1/ST1 are associated with a first portion of LRQF1, LRQE1, SRQR1, and ERT1 partitions; and instructions at T3 on LD1/ST1 are associated with the second portion of the LRQF1, LRQE1, SRQR1, and ERT1 partitions.

Turning now to FIG. 14, a block diagram of a computer system 1400 for implementing some or all aspects of one or more embodiments of the invention. The processes described herein may be implemented in hardware, software (e.g., firmware), or a combination thereof. In an exemplary embodiment, the described methods may be implemented at least partially in hardware, and may be part of the microprocessor of a special purpose or general-purpose computer system 1400, such as a mobile device, personal computer, workstation, minicomputer, or mainframe computer.

In an exemplary embodiment, as shown in fig. 14, computer system 1400 includes a processor 1405, a memory 1412 coupled to a memory controller 1415, and one or more input devices 1445 and/or output devices 1447, such as peripherals, communicatively coupled via a local I/O controller 1435. These devices 1447 and 1445 may include, for example, printers, scanners, microphones, and so forth. A conventional keyboard 1450 and mouse 1455 may be coupled to the I/O controller 1435. I/O controller 1435 may be, for example, one or more buses or other wired or wireless connections, as is known in the art. The I/O controller 1435 may have additional elements to enable communication, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers.

The I/O devices 1447, 1445 may also include devices that communicate with inputs and outputs, such as disk and tape storage, Network Interface Cards (NICs) or modulators/demodulators (for accessing other files, devices, systems, or networks), Radio Frequency (RF) or other transceivers, telephone interfaces, bridges, routers, and so forth.

The processor 1405 is a hardware device for executing hardware instructions or software, particularly those stored in the memory 1412. The processor 1405 may be a custom made or commercially available processor, a Central Processing Unit (CPU), an auxiliary processor among several processors associated with the computer system 1400, a semiconductor-based microprocessor (in the form of a microchip or chip set), a microprocessor, or other device for executing instructions. Processor 1405 may include caches such as, but not limited to, an instruction cache to accelerate executable instruction fetching, a data cache to accelerate data fetching and storing, and a Translation Lookaside Buffer (TLB) to accelerate virtual-to-physical address translations of executable instructions and data. The caches may be organized into a hierarchy of more cache levels (L1, L2, etc.).

The memory 1412 may include one or a combination of volatile memory elements (e.g., random access memory, RAM, such as DRAM, SRAM, SDRAM, etc.) and non-volatile memory elements (e.g., ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), tape, compact disc read-only memory (CD-ROM), disks, cartridges, etc.). Additionally, the memory 1412 may include electrical, magnetic, optical, or other types of storage media. Note that the memory 1412 can have a distributed architecture, where various components are remote from each other, but can be accessed by the processor 1405.

The instructions in the memory 1412 may include one or more separate programs, each of which includes an ordered listing of executable instructions for implementing logical functions. In the example of fig. 14, the instructions in memory 1412 include a suitable Operating System (OS) 1411. The operating system 1411 may essentially control the execution of other computer programs, and provide scheduling, input-output control, file and data management, memory management, and communication control and related services.

Additional data, including, for example, instructions for processor 1405 or other retrievable information, may be stored in storage 1427, which may be a storage device such as a hard disk drive or solid state drive. The instructions stored in memory 1412 or storage 1427 may include instructions that enable processor 1405 to perform one or more aspects of the dispatch systems and methods of the present disclosure.

The computer system 1400 may also include a display controller 1425 coupled to the display 1430. In an exemplary embodiment, computer system 1400 may also include a network interface 1460 for coupling to a network 1465. Network 1465 may be an IP-based network for communicating between computer system 1400 and external servers, clients, etc. via a broadband connection. The network 1465 sends and receives data between the computer system 1400 and external systems. In an exemplary embodiment, network 1465 may be a managed IP network managed by a service provider. The network 1465 may be implemented wirelessly, e.g., using wireless protocols and technologies such as WiFi, WiMax, etc. Network 1465 may also be a packet-switched network, such as a local area network, wide area network, metropolitan area network, the internet, or other similar type of network environment. Network 1465 may be a fixed wireless network, a wireless Local Area Network (LAN), a wireless Wide Area Network (WAN), a Personal Area Network (PAN), a Virtual Private Network (VPN), an intranet, or other suitable network system, and may include equipment for receiving and transmitting signals.

The systems and methods for providing partitioned load request queues and store request queues may be implemented in whole or in part in a computer program product or in a computer system 1400, such as shown in FIG. 14.

Various embodiments of the present invention are described herein with reference to the accompanying drawings. Alternative embodiments of the invention may be devised without departing from the scope thereof. In the following description and the drawings, various connections and positional relationships (e.g., above, below, adjacent, etc.) are set forth between elements. These connections and/or positional relationships may be direct or indirect unless otherwise specified, and the invention is not intended to be limited in this respect. Thus, coupling of entities may refer to direct or indirect coupling, and positional relationships between entities may be direct or indirect positional relationships. Further, various tasks and process steps described herein may be incorporated into a more comprehensive procedure or process having additional steps or functionality not described in detail herein.

The following definitions and abbreviations are used to explain the claims and the specification. As used herein, the terms "comprises," "comprising," "includes," "including," "has," "having," "contains," "containing," or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a composition, mixture, process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such composition, mixture, process, method, article, or apparatus.

Additionally, the term "exemplary" is used herein to mean "serving as an example, instance, or illustration," and any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. The terms "at least one" and "one or more" can be understood to include any integer greater than or equal to one, i.e., one, two, three, four, etc. The term "plurality" can be understood to include any integer greater than or equal to two, i.e., two, three, four, five, etc. The term "connected" can include both indirect and direct "connections.

The terms "about," "substantially," "about," and variations thereof are intended to encompass the degree of error associated with measuring a particular quantity based on equipment available at the time of filing this application. For example, "about" may include a range of ± 8% or 5% or 2% of a given value.

For the sake of brevity, conventional techniques related to making and using aspects of the present invention may or may not be described in detail herein. In particular, various aspects of computing systems and specific computer programs for implementing various features described herein are well known. Accordingly, for the sake of brevity, many conventional implementation details are only mentioned briefly herein or omitted entirely, and well-known system and/or process details are not provided.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A processing unit for executing one or more instructions, the processing unit comprising:

a load-store unit (LSU) configured to execute a plurality of instructions in an out-of-order (OoO) window using a plurality of LSU pipelines by:

selecting an instruction from the OoO window, the instruction using an effective address; and

in response to the instruction being a load instruction:

in response to the processing unit operating in a single threaded mode, creating an entry in a first partition of a load reorder queue based on the instruction being issued on a first load pipe and creating the entry in a second partition of the load reorder queue based on the instruction being issued on a second load pipe; and

in response to the processing unit operating in a multithreading mode in which multiple threads are concurrently processed, creating the entry in a first predetermined portion of the first partition of the load reorder queue based on the instruction being on the first load pipeline and issued by a first thread of the processing unit.

2. The processing unit of claim 1, wherein, in the multithreading mode, the first predetermined portion of the first partition of the load reorder queue is dedicated to load instructions issued by the first thread of the processing unit using the first load pipeline.

3. The processing unit of claim 1, the load-store unit further configured to:

in response to the instruction being a store instruction:

in response to the processing unit operating in the single threaded mode, creating a store entry in a first partition of a store reorder queue based on the store instruction being issued on a first store pipe and creating the store entry in a second partition of the store reorder queue based on the store instruction being issued on a second store pipe; and

in response to the processing unit operating in the multi-threaded mode, creating the store entry in a first predetermined portion of the first partition of the store reorder queue based on the store instruction being on the first store pipeline and issued by the first thread of the processing unit.

4. The processing unit of claim 1, wherein the load reorder queue comprises one partition for each load pipe of the LSU.

5. The processing unit of claim 4, wherein the LSU operates multiple load instructions concurrently, one load instruction using each respective load pipeline.

6. The processing unit of claim 1, wherein the store reorder queue comprises one partition for each storage pipeline of the LSU.

7. The processing unit of claim 6, wherein the LSU operates multiple store instructions concurrently, one store instruction using each respective load pipeline.

8. A computer-implemented method for out-of-order execution of one or more instructions by a processing unit, the method comprising:

receiving, by a load-store unit (LSU) of the processing unit, an out-of-order instruction window comprising a plurality of instructions to be executed out-of-order; and

issuing, by the LSU, an instruction from the OoO window by:

selecting an instruction from the OoO window, the instruction using an effective address;

in response to the instruction being a load instruction:

in response to the processing unit operating in a multithreading mode, creating the entry in a first predetermined portion of the first partition of the load reorder queue based on the instruction being on the first load pipeline and issued by a first thread of the processing unit.

9. The computer-implemented method of claim 8, wherein, in the multi-threaded mode, the first predetermined portion of the first partition of the load reorder queue is dedicated to load instructions issued by the first thread of the processing unit using the first load pipeline.

10. The computer-implemented method of claim 8, further comprising:

in response to the instruction being a store instruction:

11. The computer-implemented method of claim 8, wherein the load reorder queue comprises one partition for each load pipe of the LSU.

12. The computer-implemented method of claim 11 wherein the LSU operates multiple load instructions simultaneously, one load instruction using each respective load pipeline.

13. The computer-implemented method of claim 8, wherein the store reorder queue comprises one partition for each storage pipe of the LSU.

14. The computer-implemented method of claim 13 wherein the LSU operates multiple store instructions simultaneously, one store instruction using each respective load pipeline.

15. A computer program product comprising a computer-readable storage medium having program instructions embodied therewith, the program instructions being executable by a processing unit to cause the processing unit to perform operations comprising:

issuing, by the LSU, an instruction from the OoO window by:

in response to the instruction being a load instruction:

16. The computer program product of claim 15, wherein, in the multithreading mode, the first predetermined portion of the first partition of the load reorder queue is dedicated to load instructions issued by the first thread of the processing unit using the first load pipeline.

17. The computer-program product of claim 15, wherein in response to the instruction being a store instruction:

18. The computer program product of claim 15, wherein the load reorder queue comprises one partition for each load pipe of the LSU.

19. The computer program product of claim 18 wherein the LSU operates multiple load instructions simultaneously, one load instruction using each respective load pipeline.

20. The computer program product of claim 15, wherein the store reorder queue comprises one partition for each store pipe of the LSU, and wherein the LSU operates on multiple store instructions concurrently, one store instruction using each respective load pipe.