WO2018034876A1 - Tracking stores and loads by bypassing load store units - Google Patents

Tracking stores and loads by bypassing load store units Download PDF

Info

Publication number
WO2018034876A1
WO2018034876A1 PCT/US2017/045640 US2017045640W WO2018034876A1 WO 2018034876 A1 WO2018034876 A1 WO 2018034876A1 US 2017045640 W US2017045640 W US 2017045640W WO 2018034876 A1 WO2018034876 A1 WO 2018034876A1
Authority
WO
WIPO (PCT)
Prior art keywords
scheduler
load
store
unit
memory
Prior art date
Application number
PCT/US2017/045640
Other languages
French (fr)
Inventor
Betty Ann MCDANIEL
Michael D. Achenbach
David N. Suggs
Frank C. Galloway
Kai Troester
Krishnan V. Ramani
Original Assignee
Advanced Micro Devices, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices, Inc. filed Critical Advanced Micro Devices, Inc.
Priority to JP2019510366A priority Critical patent/JP7084379B2/en
Priority to KR1020197005603A priority patent/KR102524565B1/en
Priority to CN201780050033.9A priority patent/CN109564546B/en
Priority to EP17841859.6A priority patent/EP3500936A4/en
Publication of WO2018034876A1 publication Critical patent/WO2018034876A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • G06F12/0246Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory in block erasable memory, e.g. flash memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • G06F12/0871Allocation or management of cache space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0897Caches characterised by their organisation or structure with two or more cache hierarchy levels
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/3013Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • G06F9/384Register renaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/30Providing cache or TLB in specific location of a processing system
    • G06F2212/304In main memory subsystem
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/46Caching storage objects of specific type in disk cache
    • G06F2212/463File
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/604Details relating to cache allocation

Definitions

  • Figure 1 illustrates a core processing unit of a processor in accordance with certain implementations
  • Figure 2 illustrates a load store (LS) unit for handling data access within the core processing unit of Figure 1;
  • Figure 3 illustrates a hardware flow of memory renaming in conjunction with LS unit within the core processing unit of Figure 1;
  • Figure 4 illustrates a method for memory renaming in conjunction with LS unit within the core processing unit of Figure 1;
  • Figure 5 illustrates a diagram of an example device in which one or more portions of one or more disclosed examples may be implemented.
  • Memory renaming is a way of tracking stores and loads to the same address and bypassing a load store unit when a load follows an associated store. This scenario can happen frequently. As an example, memory renaming is needed when a program stores data via a store queue, performs other processing, then loads the same data via a load queue. This load follows an associated store. Programs often seek to load data that has recently been stored.
  • a system and method for tracking stores and loads by bypassing a load store unit includes storing data in one or more memory dependent architectural register numbers (McLArns).
  • the one or more McLArns are allocated to an in-memory file cache (MEMFILE).
  • the allocated one or more McLArns are written to a map file, wherein the map file contains a McLArn map to enable subsequent access to an entry in the MEMFILE.
  • Upon receipt of a load request checking a base, an index, a displacement and a match/hit via the map file to identify an entry in the MEMFILE and an associated store. On a hit, providing the entry responsive to the load request from the one or more McLArns.
  • FIG. 1 is a high level block and flow diagram of a core processing unit 105 of a processor 100 in accordance with certain implementations.
  • the core processing unit 105 includes, but is not limited to, a decoder unit 110 which provides micro operations (micro-ops) to a scheduler and/or execution unit 115.
  • the decoder unit 110 includes, but is not limited to, a branch predictor 120 connected to a cache 122 and a micro-op cache 124.
  • the cache 122 is further connected to a decoder 126.
  • the decoder 126 and the micro-op cache 124 are connected to a micro-op queue 128.
  • the scheduler and/or execution unit 115 includes, but is not limited to, an integer scheduler and/or execution unit 130 and a floating point scheduler and/or execution unit 132, both of which are connected to a cache 134.
  • the cache 134 is further connected to an L2 cache 136, load queues 138, and store queues 140. Load queues 138, store queues 140, and cache 134 are collectively referred to as load store (LS) unit 139.
  • LS load store
  • the integer scheduler and/or execution unit 130 includes, but is not limited to, an integer renamer 150 which is connected to a scheduler 151, which includes arithmetic logic unit (ALU) schedulers (ALSQs) 152 and address generation unit (AGUs) schedulers (AGSQs) 154.
  • ALU arithmetic logic unit
  • ALSQs arithmetic logic unit
  • AGUs address generation unit schedulers
  • the scheduler 151, and in particular the ALSQs 152 and AGSQs 154, are further connected to ALUs 156 and AGUs 158, respectively.
  • the integer scheduler and/or execution unit 130 also includes an integer physical register file 160.
  • the floating point scheduler and/or execution unit 132 includes, but is not limited to, a floating point renamer 170 which is connected to a scheduler 172.
  • the scheduler 172 is further connected to multipliers 174 and adders 176.
  • the floating point scheduler and/or execution unit 132 also includes a floating point physical register file 178.
  • a pipelined processor requires a steady stream of instructions to be fed into the pipeline.
  • the branch predictor 120 predicts which set of instructions should be fetched and executed in the pipelined processor. These instructions are fetched and stored in cache 122, and when read from cache 122 are decoded into operations by the decoder 126.
  • a micro-op cache 124 caches the micro-ops as the decoder 126 generates them.
  • the micro-op queue 128 stores and queues up the micro-ops from the decoder 126 and micro-op cache 124 for dispatching the micro-ops for execution.
  • a micro-op queue dispatches certain operations, such as load or store operations, directly to a load queue and/or a store queue that holds the payloads, such as control information decoded from the operation, and memory addresses associated with the micro- ops.
  • the store queue may accept a plurality of operations, from the micro-op queue and write the payload into the store queue at dispatch time.
  • the store queue receives a queue index from a scheduler to specify which store entry is being processed.
  • the scheduler reads out the dispatch payload, and sends it to segmentation logic for segmentation checks, and to a load queue for a possible pick on the micro-op pipeline. That is, conventional pipeline processing is a two pass write process with respect to the store and load queues; once at dispatch for the payload and again at address generation to generate the address in memory.
  • the micro-ops are dispatched to the integer scheduler and/or execution unit 130 and the floating point scheduler and/or execution unit 132 only, instead of directly writing to the load queues 138 and store queues 140 as per the conventional pipeline processing.
  • the micro-ops are directed to: (1) the scheduler 151 via the integer renamer 150; and (2) the scheduler 172 via the floating point renamer 170.
  • the scheduler 151 holds all of the dispatch payloads for the micro- ops (e.g., the dispatch payloads for the store micro-ops) in the AGSQ 154.
  • the AGSQ 154 holds the micro-ops (e.g., the load and store micro-ops), until a queue entry in the appropriate load queues 138 and/or store queues 140 is available. Once a queue entry is available and the sources for the physical register file 160 are ready, the AGSQ 154 generates the address, reads the dispatch payload and sends the dispatch payload to the load queues 138 and/or store queues 140.
  • the micro-ops e.g., the load and store micro-ops
  • every store micro-op is associated with a particular queue entry or queue identifier.
  • the scheduler 151 needs to know when the AGSQ 154 can perform address generation and when the scheduler 151 can send the stored data (i.e., the dispatch payload) to the store queue 140. Accordingly, a particular queue is communicated by the store queue 140 to the AGSQ 154 when the particular queue is available. While this communication chain is not specifically shown in Figure 1, this communication is provided as a general matter.
  • the load queues 138 and store queues 140 send the scheduler 151
  • a commit-deallocation signal so that the scheduler 151 (AGSQ 154 and ALSQ 152) can update its oldest store micro-op store queue index to enable address generation or to send store data for younger store micro- ops as those older store micro-ops deallocate and free up their respective store queue entries.
  • This can be implemented, for example, by adding an output (not shown) from the load queues 138 and store queues 140 to an input at the scheduler 151 (AGSQ 154 and ALSQ 152).
  • the processor will not dispatch more than twice the store queue depth.
  • Two wrap bits are sufficient to track and compare the age of all 72 stores in the machine, and no dispatch stall is needed.
  • the wrap bits are computed at dispatch and are held in the AGSQ payload. If the AGSQ scheduler depth allows dispatch of stores more than three times the store queue depth, additional wrap bits could be added to enable an arbitrary number of stores to dispatch.
  • the load micro-ops are not necessarily age-ordered and can use other techniques known to those skilled in the art to control execution order of the instructions.
  • the load micro-ops can operate similarly to the store micro-ops.
  • the implementations described herein solve the issues outhned above.
  • the number of dispatch payload write ports can be reduced in the store queue.
  • the number of dispatch payload write ports can be reduced from four (four stores per cycle at dispatch) to two (two store address generations per cycle).
  • difficult timing paths are eliminated. For example, the timing path that involved sending the queue index to the store queue, reading out the payload and then sending the payload to the segmentation logic and load queue is eliminated.
  • the core processing unit 105 executes the micro-ops.
  • the load queues 138 and store queues 140 return data for the load micro-ops and perform writes for store micro-ops, respectively.
  • the scheduler 151 and the scheduler 172 issue micro-ops to the integer scheduler and/or execution unit 130 and floating-point scheduler and/or execution unit 132 as their respective sources become ready.
  • decoder 126 physical register file 160 and LS unit 139 are communicatively coupled.
  • FIG. 2 illustrates load store (LS) unit 139 for handling data access within the processor 100.
  • LS unit 139 includes a load queue 210 and a store queue 215, each operatively coupled to a data cache 220.
  • the LS unit 139 is configured into pipelines, collectively 225 and 230, that are independent.
  • the LS unit 139 includes three pipelines, collectively 225 and 230, enabling execution of two load memory operations 225A, 225B and one store memory operation 230 per cycle.
  • Load queue 210 of LS unit 139 includes a plurality of entries. In an implementation, load queue 210 includes 44 entries. Load queue 210 receives load operations at dispatch and loads leave load queue 210 when the load has completed and delivered data to the integer scheduler and/or execution unit 130 or the floating point scheduler and/or execution unit 132.
  • Store queue 215 includes a plurality of entries. In an implementation, store queue 215 includes 44 entries. Although this example is equal to the number of entries in the example load queue 210 above, an equal number of entries is not needed in load queue 210 and store queue 215. Store queue 215 holds stores from dispatch until the store data is written to data cache 220.
  • Data cache 220 caches data until storage in L2 235 is performed.
  • Data cache 220 is a hardware or software component that stores data so future requests for that data can be served faster. Data stored in data cache 220 can be the result of an earlier computation, the duplicate of data stored elsewhere, or store data from store queue 215. L2 235 may be a slower and/or larger version of data cache 220.
  • LS unit 139 dynamically reorders operations, supporting both load operations using load queue 210 bypassing older loads and store operations using store queue 215 bypassing older non-conflicting stores. LS unit 139 ensures that the processor adheres to the architectural load/store ordering rules as defined by the system architecture of processor 100 via load queue 210 and store queue 215.
  • LS unit 139 supports store-to-load forwarding (STLF) when there is an older store that contains all of the load's bytes, and the store's data has been produced and is available in the store queue 215.
  • STLF store-to-load forwarding
  • the load from STLF does not require any particular alignment relative to the store as long as it is fully contained within the store.
  • certain address bits are assigned to determine STLF eligibility.
  • the computer system avoids having multiple stores with the same address bits, destined for different addresses in process simultaneously. This is the case where a load may need STLF.
  • loads that follow stores to similar address bits use the same registers and accesses are grouped closely together. This grouping avoids intervening modifications or writes to the register used by the store and load when possible.
  • This allows LS unit 139 to track "in-flight" loads/stores. For example, the LS unit 139 may track "in-flight" cache misses.
  • LS unit 139 and the associated pipelines 225A, 225B, 230 are optimized for simple address generation modes.
  • Base+ displacement, base+index, and displacement-only addressing modes (regardless of displacement size) are considered simple addressing modes and achieve 4-cycle load-to-use integer load latency and 7-cycle load-to-use floating point (FP) load latency.
  • Addressing modes where both an index and displacement are present such as commonly used 3-source addressing modes with base+index+displacement, and any addressing mode utilizing a scaled index, such as x2, x4, or x8 scales, are considered complex addressing modes and require an additional cycle of latency to compute the address.
  • Complex addressing modes achieve a 5-cycle (integer)/8- cycle floating point load-to-use latency. Generally, these systems operate by avoiding complex, such as scaled-index, or index+displacement, addressing modes in latency-sensitive code.
  • Figure 3 illustrates a hardware flow 300 of memory renaming in conjunction with LS unit 139 within the core processing unit 105 of Figure 1.
  • Figure 3 shows the hardware flow 300 of tracking stores and loads by bypassing the LS unit 139.
  • memory renaming is the method for tracking stores and loads to the same address while bypassing the LS unit 139 when a load follows an associated store.
  • Memory renaming is used to optimize the forwarding of data from store to load.
  • the use of memory renaming generally operates without involving the resources of LS unit 139. In essence, memory renaming enables data to be "remembered" in integer scheduler and/or execution unit 130 and floating point scheduler and/or execution unit 132.
  • MdArns memory dependent architectural register numbers
  • the MdArns serve as the location for "remembering" data that has been stored to be used on a subsequent load.
  • the MdArns are utilized even though the data is also stored in traditional memory stores.
  • the traditional memory stores occur through the LS unit 139.
  • MdArns are architectural register numbers that are a part of and accessible to integer renamer 150 and/or floating point renamer 170 shown in Figure 1. This allows integer renamer 150 and/or floating point renamer 170 to load data from an MdArn ("remembering") without the need to request the data from the LS unit.
  • Map 320 is a file that includes the MdArn map, which provides the map to what has been stored in specific MdArns.
  • the MdArns are not architecturally visible and are only used internally for memory dependent renaming.
  • each entry in map 320 contains a physical register number (PRN) which is an index of the physical register file (PRF) 160, 178 where the given store data is written, in addition to being sent to the LS unit 139.
  • Map 320 enables store data to be forwarded locally to loads and load dependents through renaming using the associated store's MdArn. There are N number of MdArns.
  • Hardware flow 300 illustrates the dispatching of N-instructions
  • N-instructions instructions 305 are stored as described above with respect to Figures 1 and 2.
  • stores 315 also use MdArns including a plurality of individual MdArns 337.1, 337.2 ... 337.n. While Figure 3 illustrates dispatching N number of MdArns in map 320, the number of intergroup dependencies is constrained by the number of operations that are dispatched simultaneously, such as 6 operations in a 6-wide architecture, for example.
  • Address information for any stores 315 in the current dispatch group are written 308 into the MEMFILE 310 within the decode unit 110, assigned an MdArn, and renamer 150, 170 to map it to a free PRN, storing it in the map 320 just as is done with mapped ARNs. If there are multiple stores to the same address within a dispatch group, only the oldest store is stored in the MEMFILE 310 and renamed to an MdArn.
  • MEMFILE 310 is an in-memory file cache.
  • Older stores are defined by program order. Within a common dispatch grouping, operations are in program order. Intergroup dependencies are checked to ensure the correct source. The oldest operation is not dependent on any of the younger operations. For example, the second oldest operation can be dependent on the oldest operation while the youngest operation can be dependent on any of its older operations.
  • Stores 315 are allocated and written 308 to MEMFILE 310 and identified in map 320. As stores 315 are directed to MEMFILE 310 and identified in map 320, they are also compared against dispatch loads 325 for address matches, as shown in 337 (337.1, 337.2 . . . 337.n). Additionally, dispatched loads 325 are checked for address matches against stores previously written in the MEMFILE 310, depicted in 347 (347.1, 347.2 ... 347.n). Loads 325 whose address match a store in compare logic 337 and 347 are associated with the given store, undergo intergroup dependency checking (350,360,370), and are then mapped to the PRN denoted by the stores MdArn.
  • scheduler and/or execution unit 115 monitors each store 315, in order, in the MEMFILE 310, which is within the decoder 126.
  • the MEMFILE 310 is an age ordered rotating first-in, first-out (FIFO) allocated with each store 315 that is dispatched. Dispatch is when instructions have been decoded and are sent to the renamer and scheduling queues (363,368), such as between micro-op queue 128 and renamer 150 (in the case of the integer renamer).
  • FIFO rotating first-in, first-out
  • Dispatch is when instructions have been decoded and are sent to the renamer and scheduling queues (363,368), such as between micro-op queue 128 and renamer 150 (in the case of the integer renamer).
  • Each entry within MEMFILE 310 contains information about the store 315, such as the base and index registers within physical register file 160 and includes part of the displacement.
  • This store 315 gets allocated an MdArn, of which there are N, in a rotating manner.
  • the stores 315 operate as described herein above with respect to Figures 1 and 2.
  • the store 315 splits into an address generation component and a store 315 data movement to LS unit 139.
  • the store 315 also includes moving the store data to the MdArn.
  • the physical register file 160 is written for the PRN allocated to that MdArn in map 320.
  • Memory renaming reduces STLF latency by changing it to a register-to-register move.
  • a subset of operations could additionally be combined with move elimination to be accomplished in mapping only, reducing STLF to zero cycle latency.
  • the load 325 is a load-operation or a pure-load, the operand that would normally come from memory, such as cache 134 or L2 136, or other memory, for example, is instead provided by MdArn.
  • the load 325 executes an address generation and LS unit 139 verifies the correctness of the memory renaming flow 300. LS unit 139 abstains from returning data. Additionally, the LS unit 139 checks that there have been no intermediate stores to the given address which breaks the renamed store-load association. If verification fails, LS unit 139 resynchronizes load 325 by re-performing load 325.
  • the resynchronizing of load 325 includes re-performing all of the work that has been performed, flushing the pipeline and starting the execution from scratch beginning with the load.
  • Figure 4 illustrates a method 400 for memory renaming in conjunction with LS unit 139 within the core processing unit 105 of Figure 1.
  • Method 400 includes storing instructions in MdArns along with the traditional storage path at step 410.
  • method 400 allocates and writes to a MEMFILE 310 based on MdArn storage.
  • the free destination PRN is allocated to be used and a map is written at step 430.
  • the system monitors load requests at step 440.
  • the base, index, displacement and match/hit in MEMFILE 310 are checked within the dispatch logic where MEMFILE 310 resides, such as between micro-op queue 128 and map 320 (within renamer 150 as discussed) at step 450.
  • the LS unit 139 On a hit, the LS unit 139 is prevented from returning data and provides the entry for the load from MdArn identified from MEMFILE at step 460. At step 470, the LS unit 139 verifies that the store-load pair is correctly associated. If it is not, the load is flushed and re-executed.
  • Figure 5 illustrates a diagram of an example device 500 in which one or more portions of one or more disclosed examples may be implemented.
  • the device 500 may include, for example, a head mounted device, a server, a computer, a gaming device, a handheld device, a set-top box, a television, a mobile phone, or a tablet computer.
  • the device 500 includes a compute node or processor 502, a memory 504, a storage 506, one or more input devices 508, and one or more output devices 510.
  • the device 500 may also optionally include an input driver 512 and an output driver 514. It is understood that the device 500 may include additional components not shown in Figure 5.
  • the compute node or processor 502 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU.
  • the memory 504 may be located on the same die as the compute node or processor 502, or may be located separately from the compute node or processor 502.
  • the memory 504 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
  • the storage 506 may include a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive.
  • the input devices 508 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
  • the output devices 510 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
  • the input driver 512 communicates with the compute node or processor 502 and the input devices 508, and permits the compute node or processor 502 to receive input from the input devices 508.
  • the output driver 514 communicates with the compute node or processor 502 and the output devices 510, and permits the processor 502 to send output to the output devices 510. It is noted that the input driver 512 and the output driver 514 are optional components, and that the device 500 will operate in the same manner if the input driver 512 and the output driver 514 are not present.
  • a computer readable non-transitory medium including instructions which when executed in a processing system cause the processing system to execute a method for load and store allocations at address generation time.
  • processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine.
  • DSP digital signal processor
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media).
  • HDL hardware description language
  • netlists such instructions capable of being stored on a computer readable media.
  • the results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.
  • non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
  • ROM read only memory
  • RAM random access memory
  • register cache memory
  • semiconductor memory devices magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).

Abstract

A system and method for tracking stores and loads to reduce load latency when forming the same memory address by bypassing a load store unit within an execution unit is disclosed. The system and method include storing data in one or more memory dependent architectural register numbers (MdArns), allocating the one or more MdArns to a MEMFILE, writing the allocated one or more MdArns to a map file, wherein the map file contains a MdArn map to enable subsequent access to an entry in the MEMFILE, upon receipt of a load request, checking a base, an index, a displacement and a match/hit via the map file to identify an entry in the MEMFILE and an associated store, and on a hit, providing the entry responsive to the load request from the one or more MdArns.

Description

TRACKING STORES AND LOADS BY BYPASSING
LOAD STORE UNITS
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional Application
No. 62/377,301 filed August 19, 2016 and U.S. Patent Application No. 15/380,778 filed December 15, 2016, which are incorporated by reference as if fully set forth herein.
BACKGROUND
[0002] Present computer systems provide loads and stores for memory access using load queues and store queues. Generally, these systems operate using store-to-load forwarding. However, store-to-load forwarding fails to provide the lowest latency solution for situations where the loads and stores are directed to the same address.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] A more detailed understanding can be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
[0004] Figure 1 illustrates a core processing unit of a processor in accordance with certain implementations;
[0005] Figure 2 illustrates a load store (LS) unit for handling data access within the core processing unit of Figure 1;
[0006] Figure 3 illustrates a hardware flow of memory renaming in conjunction with LS unit within the core processing unit of Figure 1;
[0007] Figure 4 illustrates a method for memory renaming in conjunction with LS unit within the core processing unit of Figure 1; and
[0008] Figure 5 illustrates a diagram of an example device in which one or more portions of one or more disclosed examples may be implemented. DETAILED DESCRIPTION
[0009] Memory renaming is a way of tracking stores and loads to the same address and bypassing a load store unit when a load follows an associated store. This scenario can happen frequently. As an example, memory renaming is needed when a program stores data via a store queue, performs other processing, then loads the same data via a load queue. This load follows an associated store. Programs often seek to load data that has recently been stored.
[0010] A system and method for tracking stores and loads by bypassing a load store unit is disclosed. The system and method include storing data in one or more memory dependent architectural register numbers (McLArns). The one or more McLArns are allocated to an in-memory file cache (MEMFILE). The allocated one or more McLArns are written to a map file, wherein the map file contains a McLArn map to enable subsequent access to an entry in the MEMFILE. Upon receipt of a load request, checking a base, an index, a displacement and a match/hit via the map file to identify an entry in the MEMFILE and an associated store. On a hit, providing the entry responsive to the load request from the one or more McLArns.
[0011] Figure 1 is a high level block and flow diagram of a core processing unit 105 of a processor 100 in accordance with certain implementations. The core processing unit 105 includes, but is not limited to, a decoder unit 110 which provides micro operations (micro-ops) to a scheduler and/or execution unit 115. The decoder unit 110 includes, but is not limited to, a branch predictor 120 connected to a cache 122 and a micro-op cache 124. The cache 122 is further connected to a decoder 126. The decoder 126 and the micro-op cache 124 are connected to a micro-op queue 128.
[0012] The scheduler and/or execution unit 115 includes, but is not limited to, an integer scheduler and/or execution unit 130 and a floating point scheduler and/or execution unit 132, both of which are connected to a cache 134. The cache 134 is further connected to an L2 cache 136, load queues 138, and store queues 140. Load queues 138, store queues 140, and cache 134 are collectively referred to as load store (LS) unit 139. [0013] The integer scheduler and/or execution unit 130 includes, but is not limited to, an integer renamer 150 which is connected to a scheduler 151, which includes arithmetic logic unit (ALU) schedulers (ALSQs) 152 and address generation unit (AGUs) schedulers (AGSQs) 154. The scheduler 151, and in particular the ALSQs 152 and AGSQs 154, are further connected to ALUs 156 and AGUs 158, respectively. The integer scheduler and/or execution unit 130 also includes an integer physical register file 160.
[0014] The floating point scheduler and/or execution unit 132 includes, but is not limited to, a floating point renamer 170 which is connected to a scheduler 172. The scheduler 172 is further connected to multipliers 174 and adders 176. The floating point scheduler and/or execution unit 132 also includes a floating point physical register file 178.
[0015] A pipelined processor requires a steady stream of instructions to be fed into the pipeline. The branch predictor 120 predicts which set of instructions should be fetched and executed in the pipelined processor. These instructions are fetched and stored in cache 122, and when read from cache 122 are decoded into operations by the decoder 126. A micro-op cache 124 caches the micro-ops as the decoder 126 generates them. The micro-op queue 128 stores and queues up the micro-ops from the decoder 126 and micro-op cache 124 for dispatching the micro-ops for execution.
[0016] In conventional pipeline processing, a micro-op queue dispatches certain operations, such as load or store operations, directly to a load queue and/or a store queue that holds the payloads, such as control information decoded from the operation, and memory addresses associated with the micro- ops. For purposes of illustration, the store queue may accept a plurality of operations, from the micro-op queue and write the payload into the store queue at dispatch time. At address generation time, the store queue then receives a queue index from a scheduler to specify which store entry is being processed. The scheduler reads out the dispatch payload, and sends it to segmentation logic for segmentation checks, and to a load queue for a possible pick on the micro-op pipeline. That is, conventional pipeline processing is a two pass write process with respect to the store and load queues; once at dispatch for the payload and again at address generation to generate the address in memory.
[0017] In accordance with an implementation, the micro-ops are dispatched to the integer scheduler and/or execution unit 130 and the floating point scheduler and/or execution unit 132 only, instead of directly writing to the load queues 138 and store queues 140 as per the conventional pipeline processing. In particular, the micro-ops are directed to: (1) the scheduler 151 via the integer renamer 150; and (2) the scheduler 172 via the floating point renamer 170. The scheduler 151 holds all of the dispatch payloads for the micro- ops (e.g., the dispatch payloads for the store micro-ops) in the AGSQ 154. That is, the AGSQ 154 holds the micro-ops (e.g., the load and store micro-ops), until a queue entry in the appropriate load queues 138 and/or store queues 140 is available. Once a queue entry is available and the sources for the physical register file 160 are ready, the AGSQ 154 generates the address, reads the dispatch payload and sends the dispatch payload to the load queues 138 and/or store queues 140.
[0018] In order to maintain age-ordered operation or in-order queues, every store micro-op is associated with a particular queue entry or queue identifier. In particular, the scheduler 151 needs to know when the AGSQ 154 can perform address generation and when the scheduler 151 can send the stored data (i.e., the dispatch payload) to the store queue 140. Accordingly, a particular queue is communicated by the store queue 140 to the AGSQ 154 when the particular queue is available. While this communication chain is not specifically shown in Figure 1, this communication is provided as a general matter.
[0019] The load queues 138 and store queues 140 send the scheduler 151
(AGSQ 154 and ALSQ 152) a commit-deallocation signal so that the scheduler 151 (AGSQ 154 and ALSQ 152) can update its oldest store micro-op store queue index to enable address generation or to send store data for younger store micro- ops as those older store micro-ops deallocate and free up their respective store queue entries. This can be implemented, for example, by adding an output (not shown) from the load queues 138 and store queues 140 to an input at the scheduler 151 (AGSQ 154 and ALSQ 152).
[0020] By holding all dispatch information in the AGSQ 154 and delaying store queue allocation until address generation time (e.g., storing data for store micro-ops whose store queue entry is still in use by the previous store micro-op), more store micro-ops can be dispatched than the store queue 140 size. By eliminating the source of dispatch stalls, further micro-ops can be introduced in the window and allowed to start their work. Any store micro-ops will not be able to get started until the previous store in their store queue entry deallocates, but other micro-ops can proceed. This allows for loads that may be cache misses to dispatch and/or perform address generation in order to start the cache miss.
[0021] Support for handling a greater number of stores in the window than there are store queue entries necessitates a way to compare the age of micro-ops. The way to compare the age of the micro-ops is provided by using the store queue entry number associated with the micro-op as well as "wrap" bits that accompany the store queue entry number. The wrap bits determine which "epoch" of the store queue entry the associated store micro-ops will use. A single wrap bit provides a way to track two different "wraps" or "epochs" of the store queue, which enables dispatching the full store queue (XC_STQDEPTH). When more store micro-ops are allowed to dispatch than store queue entries, there can be micro-ops in the window with the same store queue entry, but from multiple different "wraps" or "epochs" of the store queue. One additional wrap bit, for a total of two wrap bits, provides a way to track four different "wraps" or "epochs" of the store queue, which enables dispatching up to three times the store queue depth.
[0022] In an illustrative example, if the implemented architecture has a store queue depth of 44 and there are two 14-entry AGSQs (for up to 28 additional micro-op stores at address generation), then there are a total of 72 stores that are able to be dispatched in the window. Accordingly, the processor will not dispatch more than twice the store queue depth. Two wrap bits are sufficient to track and compare the age of all 72 stores in the machine, and no dispatch stall is needed. The wrap bits are computed at dispatch and are held in the AGSQ payload. If the AGSQ scheduler depth allows dispatch of stores more than three times the store queue depth, additional wrap bits could be added to enable an arbitrary number of stores to dispatch.
[0023] The load micro-ops are not necessarily age-ordered and can use other techniques known to those skilled in the art to control execution order of the instructions. In an implementation, the load micro-ops can operate similarly to the store micro-ops.
[0024] From an architecture perspective, the implementations described herein solve the issues outhned above. First, the number of dispatch payload write ports can be reduced in the store queue. For example, the number of dispatch payload write ports can be reduced from four (four stores per cycle at dispatch) to two (two store address generations per cycle). Second, difficult timing paths are eliminated. For example, the timing path that involved sending the queue index to the store queue, reading out the payload and then sending the payload to the segmentation logic and load queue is eliminated.
[0025] Once address generation is performed by the AGSQs 154 and the data/dispatch payloads are stored in the load queues 138 and store queues 140 as needed, the core processing unit 105 executes the micro-ops. The load queues 138 and store queues 140 return data for the load micro-ops and perform writes for store micro-ops, respectively. For other types of operations the scheduler 151 and the scheduler 172 issue micro-ops to the integer scheduler and/or execution unit 130 and floating-point scheduler and/or execution unit 132 as their respective sources become ready.
[0026] As will be discussed in greater detail herein below decoder 126, physical register file 160 and LS unit 139 are communicatively coupled.
[0027] Figure 2 illustrates load store (LS) unit 139 for handling data access within the processor 100. LS unit 139 includes a load queue 210 and a store queue 215, each operatively coupled to a data cache 220. The LS unit 139 is configured into pipelines, collectively 225 and 230, that are independent. In an implementation, the LS unit 139 includes three pipelines, collectively 225 and 230, enabling execution of two load memory operations 225A, 225B and one store memory operation 230 per cycle.
[0028] Load queue 210 of LS unit 139 includes a plurality of entries. In an implementation, load queue 210 includes 44 entries. Load queue 210 receives load operations at dispatch and loads leave load queue 210 when the load has completed and delivered data to the integer scheduler and/or execution unit 130 or the floating point scheduler and/or execution unit 132.
[0029] Store queue 215 includes a plurality of entries. In an implementation, store queue 215 includes 44 entries. Although this example is equal to the number of entries in the example load queue 210 above, an equal number of entries is not needed in load queue 210 and store queue 215. Store queue 215 holds stores from dispatch until the store data is written to data cache 220.
[0030] Data cache 220 caches data until storage in L2 235 is performed.
Data cache 220 is a hardware or software component that stores data so future requests for that data can be served faster. Data stored in data cache 220 can be the result of an earlier computation, the duplicate of data stored elsewhere, or store data from store queue 215. L2 235 may be a slower and/or larger version of data cache 220.
[0031] LS unit 139 dynamically reorders operations, supporting both load operations using load queue 210 bypassing older loads and store operations using store queue 215 bypassing older non-conflicting stores. LS unit 139 ensures that the processor adheres to the architectural load/store ordering rules as defined by the system architecture of processor 100 via load queue 210 and store queue 215.
[0032] LS unit 139 supports store-to-load forwarding (STLF) when there is an older store that contains all of the load's bytes, and the store's data has been produced and is available in the store queue 215. The load from STLF does not require any particular alignment relative to the store as long as it is fully contained within the store. [0033] In the computing system including processor 100, certain address bits are assigned to determine STLF eligibility. Importantly, the computer system avoids having multiple stores with the same address bits, destined for different addresses in process simultaneously. This is the case where a load may need STLF. Generally, loads that follow stores to similar address bits use the same registers and accesses are grouped closely together. This grouping avoids intervening modifications or writes to the register used by the store and load when possible. This allows LS unit 139 to track "in-flight" loads/stores. For example, the LS unit 139 may track "in-flight" cache misses.
[0034] LS unit 139 and the associated pipelines 225A, 225B, 230 are optimized for simple address generation modes. Base+ displacement, base+index, and displacement-only addressing modes (regardless of displacement size) are considered simple addressing modes and achieve 4-cycle load-to-use integer load latency and 7-cycle load-to-use floating point (FP) load latency. Addressing modes where both an index and displacement are present, such as commonly used 3-source addressing modes with base+index+displacement, and any addressing mode utilizing a scaled index, such as x2, x4, or x8 scales, are considered complex addressing modes and require an additional cycle of latency to compute the address. Complex addressing modes achieve a 5-cycle (integer)/8- cycle floating point load-to-use latency. Generally, these systems operate by avoiding complex, such as scaled-index, or index+displacement, addressing modes in latency-sensitive code.
[0035] Figure 3 illustrates a hardware flow 300 of memory renaming in conjunction with LS unit 139 within the core processing unit 105 of Figure 1. Figure 3 shows the hardware flow 300 of tracking stores and loads by bypassing the LS unit 139. Specifically, memory renaming is the method for tracking stores and loads to the same address while bypassing the LS unit 139 when a load follows an associated store. Memory renaming is used to optimize the forwarding of data from store to load. The use of memory renaming generally operates without involving the resources of LS unit 139. In essence, memory renaming enables data to be "remembered" in integer scheduler and/or execution unit 130 and floating point scheduler and/or execution unit 132.
[0036] In general, in order to enable the "remembering", micro architectural registers that are memory dependent architectural register numbers (MdArns) are utilized. The MdArns serve as the location for "remembering" data that has been stored to be used on a subsequent load. The MdArns are utilized even though the data is also stored in traditional memory stores. The traditional memory stores occur through the LS unit 139. MdArns are architectural register numbers that are a part of and accessible to integer renamer 150 and/or floating point renamer 170 shown in Figure 1. This allows integer renamer 150 and/or floating point renamer 170 to load data from an MdArn ("remembering") without the need to request the data from the LS unit.
[0037] In an implementation, the information regarding the MdArns is stored in a map 320. Map 320 is a file that includes the MdArn map, which provides the map to what has been stored in specific MdArns. The MdArns are not architecturally visible and are only used internally for memory dependent renaming. Specifically, each entry in map 320 contains a physical register number (PRN) which is an index of the physical register file (PRF) 160, 178 where the given store data is written, in addition to being sent to the LS unit 139. Map 320 enables store data to be forwarded locally to loads and load dependents through renaming using the associated store's MdArn. There are N number of MdArns.
[0038] Hardware flow 300 illustrates the dispatching of N-instructions
305. The N-instructions instructions 305 are stored as described above with respect to Figures 1 and 2. In addition to the storing process detailed in those figures, stores 315 also use MdArns including a plurality of individual MdArns 337.1, 337.2 ... 337.n. While Figure 3 illustrates dispatching N number of MdArns in map 320, the number of intergroup dependencies is constrained by the number of operations that are dispatched simultaneously, such as 6 operations in a 6-wide architecture, for example. Address information for any stores 315 in the current dispatch group are written 308 into the MEMFILE 310 within the decode unit 110, assigned an MdArn, and renamer 150, 170 to map it to a free PRN, storing it in the map 320 just as is done with mapped ARNs. If there are multiple stores to the same address within a dispatch group, only the oldest store is stored in the MEMFILE 310 and renamed to an MdArn. MEMFILE 310 is an in-memory file cache.
[0039] Older stores are defined by program order. Within a common dispatch grouping, operations are in program order. Intergroup dependencies are checked to ensure the correct source. The oldest operation is not dependent on any of the younger operations. For example, the second oldest operation can be dependent on the oldest operation while the youngest operation can be dependent on any of its older operations.
[0040] Stores 315 are allocated and written 308 to MEMFILE 310 and identified in map 320. As stores 315 are directed to MEMFILE 310 and identified in map 320, they are also compared against dispatch loads 325 for address matches, as shown in 337 (337.1, 337.2 . . . 337.n). Additionally, dispatched loads 325 are checked for address matches against stores previously written in the MEMFILE 310, depicted in 347 (347.1, 347.2 ... 347.n). Loads 325 whose address match a store in compare logic 337 and 347 are associated with the given store, undergo intergroup dependency checking (350,360,370), and are then mapped to the PRN denoted by the stores MdArn.
[0041] In an implementation, scheduler and/or execution unit 115 monitors each store 315, in order, in the MEMFILE 310, which is within the decoder 126. In short, in an implementation, the MEMFILE 310 is an age ordered rotating first-in, first-out (FIFO) allocated with each store 315 that is dispatched. Dispatch is when instructions have been decoded and are sent to the renamer and scheduling queues (363,368), such as between micro-op queue 128 and renamer 150 (in the case of the integer renamer). Each entry within MEMFILE 310 contains information about the store 315, such as the base and index registers within physical register file 160 and includes part of the displacement. This store 315 gets allocated an MdArn, of which there are N, in a rotating manner. [0042] In scheduler and/or execution unit 115, the stores 315 operate as described herein above with respect to Figures 1 and 2. The store 315 splits into an address generation component and a store 315 data movement to LS unit 139. For memory renaming, the store 315 also includes moving the store data to the MdArn. During store data movement to the LS unit 139, the physical register file 160 is written for the PRN allocated to that MdArn in map 320.
[0043] Memory renaming reduces STLF latency by changing it to a register-to-register move. A subset of operations could additionally be combined with move elimination to be accomplished in mapping only, reducing STLF to zero cycle latency.
[0044] If the load 325 is a load-operation or a pure-load, the operand that would normally come from memory, such as cache 134 or L2 136, or other memory, for example, is instead provided by MdArn. The load 325 executes an address generation and LS unit 139 verifies the correctness of the memory renaming flow 300. LS unit 139 abstains from returning data. Additionally, the LS unit 139 checks that there have been no intermediate stores to the given address which breaks the renamed store-load association. If verification fails, LS unit 139 resynchronizes load 325 by re-performing load 325. The resynchronizing of load 325 includes re-performing all of the work that has been performed, flushing the pipeline and starting the execution from scratch beginning with the load.
[0045] Figure 4 illustrates a method 400 for memory renaming in conjunction with LS unit 139 within the core processing unit 105 of Figure 1. Method 400 includes storing instructions in MdArns along with the traditional storage path at step 410. At step 420, method 400 allocates and writes to a MEMFILE 310 based on MdArn storage. The free destination PRN is allocated to be used and a map is written at step 430. The system monitors load requests at step 440. Upon on a load request, the base, index, displacement and match/hit in MEMFILE 310 are checked within the dispatch logic where MEMFILE 310 resides, such as between micro-op queue 128 and map 320 (within renamer 150 as discussed) at step 450. On a hit, the LS unit 139 is prevented from returning data and provides the entry for the load from MdArn identified from MEMFILE at step 460. At step 470, the LS unit 139 verifies that the store-load pair is correctly associated. If it is not, the load is flushed and re-executed.
[0046] Figure 5 illustrates a diagram of an example device 500 in which one or more portions of one or more disclosed examples may be implemented. The device 500 may include, for example, a head mounted device, a server, a computer, a gaming device, a handheld device, a set-top box, a television, a mobile phone, or a tablet computer. The device 500 includes a compute node or processor 502, a memory 504, a storage 506, one or more input devices 508, and one or more output devices 510. The device 500 may also optionally include an input driver 512 and an output driver 514. It is understood that the device 500 may include additional components not shown in Figure 5.
[0047] The compute node or processor 502 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 504 may be located on the same die as the compute node or processor 502, or may be located separately from the compute node or processor 502. The memory 504 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
[0048] The storage 506 may include a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 508 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 510 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
[0049] The input driver 512 communicates with the compute node or processor 502 and the input devices 508, and permits the compute node or processor 502 to receive input from the input devices 508. The output driver 514 communicates with the compute node or processor 502 and the output devices 510, and permits the processor 502 to send output to the output devices 510. It is noted that the input driver 512 and the output driver 514 are optional components, and that the device 500 will operate in the same manner if the input driver 512 and the output driver 514 are not present.
[0050] In general and without limiting embodiments described herein, a computer readable non-transitory medium including instructions which when executed in a processing system cause the processing system to execute a method for load and store allocations at address generation time.
[0051] It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements.
[0052] The methods provided may be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.
[0053] The methods or flow charts provided herein may be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
* *

Claims

CLAIMS What is claimed is:
1. A method for tracking stores and loads to reduce load latency when storing and loading from a same memory address by bypassing a load store (LS) unit within an execution unit, the method comprising:
storing data in one or more memory dependent architectural register numbers (McLArns);
allocating the one or more McLArns to an in-memory file cache;
writing the allocated one or more McLArns to a file, wherein the file contains a MdArn map to enable subsequent access to an entry in the in-memory file cache;
upon receipt of a load request, checking a base, an index, a displacement and a match/hit via the file to identify an entry in the in-memory file cache and an associated store; and
on a hit, providing the entry responsive to the load request from the one or more McLArns.
2. The method of claim 1 further comprising allocating a free destination in a physical register file (PRF).
3. The method of claim 1 further comprising on a hit, preventing the LS unit from returning data responsive to load request.
4. The method of claim 1 wherein the preventing occurs in a scheduler.
5. The method of claim 4 wherein the scheduler comprises an arithmetic logic unit (ALU) scheduler (ALSQ).
6. The method of claim 4 wherein the scheduler comprises an address generation unit (AGU) scheduler (AGSQ).
7. A system for tracking stores and loads to reduce load latency when storing and loading from a same memory address by bypassing a load store (LS) unit within an execution unit, the system comprising:
a plurality of memory dependent architectural register numbers (MdArns) for storing data;
an in-memory file cache for allocating the at least one of the plurality of MdArns; and
a file for writing a McLArn map;
wherein upon receipt of a load request, checking a base, an index, a displacement and a match/hit via the file to identify an entry in the in-memory file cache and an associated store; and
wherein on a hit, providing the entry responsive to the load request from the at least one of the plurality of MdArns.
8. The system of claim 7 further comprising a physical register file (PRF) to allocate a free destination.
9. The system of claim 7 further comprising on a hit, preventing the LS unit from returning data responsive to load request.
10. The system of claim 9 wherein the preventing is performed by a scheduler.
11. The system of claim 10 wherein the scheduler comprises an arithmetic logic unit (ALU) scheduler (ALSQ).
12. The system of claim 10 wherein the scheduler comprises an address generation unit (AGU) scheduler (AGSQ).
13. The system of claim 7 wherein the checking is performed by a scheduler and/or execution unit.
14. A non-transient computer readable medium containing program instructions for causing a computer to perform tracking stores and loads to reduce load latency when storing and loading from a same memory address by bypassing a load store (LS) unit within an execution unit, the method comprising:
storing data in one or more memory dependent architectural register numbers (MdArns);
allocating the one or more MdArns in an in-memory file cache;
writing the allocated one or more MdArns to a file, wherein the file contains a MdArn map to enable subsequent access to an entry in the in-memory file cache;
upon receipt of a load request, checking a base, an index, a displacement and a match/hit via the file to identify an entry in the in-memory file cache and an associated store; and
on a hit, providing the entry responsive to the load request from the one or more MdArns.
15. The computer readable medium of claim 14, the method further comprising allocating a free destination in a physical register file (PRF).
16. The computer readable medium of claim 14, the method further comprising on a hit, preventing the LS unit from returning data responsive to load request.
17. The computer readable medium of claim 16 wherein the preventing is performed by a scheduler.
18. The computer readable medium of claim 17 wherein the scheduler comprises an arithmetic logic unit (ALU) scheduler (ALSQ).
19. The computer readable medium of claim 17 wherein the scheduler comprises an address generation unit (AGU) scheduler (AGSQ).
20. The computer readable medium of claim 14 wherein the checking is performed by a scheduler and/or execution unit.
PCT/US2017/045640 2016-08-19 2017-08-04 Tracking stores and loads by bypassing load store units WO2018034876A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2019510366A JP7084379B2 (en) 2016-08-19 2017-08-04 Tracking stores and loads by bypassing loadstore units
KR1020197005603A KR102524565B1 (en) 2016-08-19 2017-08-04 Store and load tracking by bypassing load store units
CN201780050033.9A CN109564546B (en) 2016-08-19 2017-08-04 Tracking stores and loads by bypassing load store units
EP17841859.6A EP3500936A4 (en) 2016-08-19 2017-08-04 Tracking stores and loads by bypassing load store units

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201662377301P 2016-08-19 2016-08-19
US62/377,301 2016-08-19
US15/380,778 2016-12-15
US15/380,778 US10331357B2 (en) 2016-08-19 2016-12-15 Tracking stores and loads by bypassing load store units

Publications (1)

Publication Number Publication Date
WO2018034876A1 true WO2018034876A1 (en) 2018-02-22

Family

ID=61191635

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/045640 WO2018034876A1 (en) 2016-08-19 2017-08-04 Tracking stores and loads by bypassing load store units

Country Status (6)

Country Link
US (1) US10331357B2 (en)
EP (1) EP3500936A4 (en)
JP (1) JP7084379B2 (en)
KR (1) KR102524565B1 (en)
CN (1) CN109564546B (en)
WO (1) WO2018034876A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10846095B2 (en) * 2017-11-28 2020-11-24 Advanced Micro Devices, Inc. System and method for processing a load micro-operation by allocating an address generation scheduler queue entry without allocating a load queue entry
US11099846B2 (en) 2018-06-20 2021-08-24 Advanced Micro Devices, Inc. Apparatus and method for resynchronization prediction with variable upgrade and downgrade capability
US11281466B2 (en) 2019-10-22 2022-03-22 Advanced Micro Devices, Inc. Register renaming after a non-pickable scheduler queue
US11573891B2 (en) 2019-11-25 2023-02-07 SK Hynix Inc. Memory controller for scheduling commands based on response for receiving write command, storage device including the memory controller, and operating method of the memory controller and the storage device
KR102456176B1 (en) * 2020-05-21 2022-10-19 에스케이하이닉스 주식회사 Memory controller and operating method thereof
US11301351B2 (en) 2020-03-27 2022-04-12 International Business Machines Corporation Machine learning based data monitoring
US20230195466A1 (en) * 2021-12-17 2023-06-22 Arm Limited Move elimination

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050149702A1 (en) * 2003-12-29 2005-07-07 Intel Cororation Method and system for memory renaming
US20060107021A1 (en) * 2004-11-12 2006-05-18 International Business Machines Corporation Systems and methods for executing load instructions that avoid order violations
US20140337581A1 (en) * 2013-05-09 2014-11-13 Apple Inc. Pointer chasing prediction
US20140380023A1 (en) 2013-06-25 2014-12-25 Advaned Micro Devices, Inc. Dependence-based replay suppression
US9164900B1 (en) * 2012-05-23 2015-10-20 Marvell International Ltd. Methods and systems for expanding preload capabilities of a memory to encompass a register file
US20150309792A1 (en) * 2014-04-29 2015-10-29 Apple Inc. Reducing latency for pointer chasing loads

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100248903B1 (en) * 1992-09-29 2000-03-15 야스카와 히데아키 System and method for handling load and/or store operating in a superscalar microprocessor
US6735685B1 (en) * 1992-09-29 2004-05-11 Seiko Epson Corporation System and method for handling load and/or store operations in a superscalar microprocessor
US6035394A (en) * 1998-02-17 2000-03-07 International Business Machines Corporation System for providing high performance speculative processing of complex load/store instructions by generating primitive instructions in the load/store unit and sequencer in parallel
US6651161B1 (en) * 2000-01-03 2003-11-18 Advanced Micro Devices, Inc. Store load forward predictor untraining
US7937569B1 (en) * 2004-05-05 2011-05-03 Advanced Micro Devices, Inc. System and method for scheduling operations using speculative data operands
US7263600B2 (en) * 2004-05-05 2007-08-28 Advanced Micro Devices, Inc. System and method for validating a memory file that links speculative results of load operations to register values
US7987343B2 (en) * 2008-03-19 2011-07-26 International Business Machines Corporation Processor and method for synchronous load multiple fetching sequence and pipeline stage result tracking to facilitate early address generation interlock bypass
US8521982B2 (en) * 2009-04-15 2013-08-27 International Business Machines Corporation Load request scheduling in a cache hierarchy
US9378021B2 (en) * 2014-02-14 2016-06-28 Intel Corporation Instruction and logic for run-time evaluation of multiple prefetchers
US10108548B2 (en) * 2014-08-19 2018-10-23 MIPS Tech, LLC Processors and methods for cache sparing stores

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050149702A1 (en) * 2003-12-29 2005-07-07 Intel Cororation Method and system for memory renaming
US20060107021A1 (en) * 2004-11-12 2006-05-18 International Business Machines Corporation Systems and methods for executing load instructions that avoid order violations
US9164900B1 (en) * 2012-05-23 2015-10-20 Marvell International Ltd. Methods and systems for expanding preload capabilities of a memory to encompass a register file
US20140337581A1 (en) * 2013-05-09 2014-11-13 Apple Inc. Pointer chasing prediction
US20140380023A1 (en) 2013-06-25 2014-12-25 Advaned Micro Devices, Inc. Dependence-based replay suppression
US20150309792A1 (en) * 2014-04-29 2015-10-29 Apple Inc. Reducing latency for pointer chasing loads

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3500936A4

Also Published As

Publication number Publication date
JP7084379B2 (en) 2022-06-14
US10331357B2 (en) 2019-06-25
US20180052613A1 (en) 2018-02-22
CN109564546B (en) 2023-02-17
KR20190033084A (en) 2019-03-28
JP2019525355A (en) 2019-09-05
CN109564546A (en) 2019-04-02
EP3500936A4 (en) 2020-04-22
EP3500936A1 (en) 2019-06-26
KR102524565B1 (en) 2023-04-21

Similar Documents

Publication Publication Date Title
US10331357B2 (en) Tracking stores and loads by bypassing load store units
US11048506B2 (en) Tracking stores and loads by bypassing load store units
US6718440B2 (en) Memory access latency hiding with hint buffer
US6349382B1 (en) System for store forwarding assigning load and store instructions to groups and reorder queues to keep track of program order
US9256433B2 (en) Systems and methods for move elimination with bypass multiple instantiation table
US20130117543A1 (en) Low overhead operation latency aware scheduler
US9069609B2 (en) Scheduling and execution of compute tasks
US9558127B2 (en) Instruction and logic for a cache prefetcher and dataless fill buffer
US11068271B2 (en) Zero cycle move using free list counts
US20090300319A1 (en) Apparatus and method for memory structure to handle two load operations
US20100325631A1 (en) Method and apparatus for increasing load bandwidth
US7043626B1 (en) Retaining flag value associated with dead result data in freed rename physical register with an indicator to select set-aside register instead for renaming
US7321964B2 (en) Store-to-load forwarding buffer using indexed lookup
EP3497558B1 (en) System and method for load and store queue allocations at address generation time
CN115640047B (en) Instruction operation method and device, electronic device and storage medium
US10846095B2 (en) System and method for processing a load micro-operation by allocating an address generation scheduler queue entry without allocating a load queue entry
US10866809B2 (en) Method, apparatus, and system for acceleration of inversion of injective operations
US10853070B1 (en) Processor suspension buffer and instruction queue
US20130046961A1 (en) Speculative memory write in a pipelined processor
US20210157598A1 (en) Register write suppression

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17841859

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019510366

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20197005603

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2017841859

Country of ref document: EP

Effective date: 20190319