CASHE LINE FLUSH MICRO-ARCHITECTURAL IMPLEMENTATION METHOD AND SYSTEM
BACKGROUND
Field of the Invention
The present invention relates in general to computer architecture, and in particular to a method and system that allow a processor to flush a cache line associated with a linear memory address from all caches in the coherency domain.
Description of the Related Art
A cache memory device is a small, fast memory that is available to contain the most frequently accessed data (or ''words'1 from a larger, slower memory.
Dynamic random access memory (DRAM) provides large amounts of storage capacity at a relatively low cost. Unfortunately, access to dynamic random access memory is slow relative to the processing speed of modern microprocessors. A cost- effective solution providing cache memory is to provide a static random access memory (SRAM) cache memory, or cache memory physically located on the processor. Even though the storage capacity of the cache memory may be relatively small, it provides high-speed access to the data stored therein.
The operating principle behind cache memory is as follows. The first time an instruction or data location is addressed, it must be accessed from the lower speed memory. The instruction or data is then stored in cache memory. Subsequent accesses to the same instruction or data are done via the faster cache memory, thereby minimizing access time and enhancing overall system performance. However, since the storage capacity of the cache is limited, and typically is much smaller than the storage capacity of system memory, the cache is often filled and some of its contents must be changed as new instructions or data are accessed.
The cache is managed, in various ways, so that it stores the instruction or data most likely to be needed at a given time. When the cache is accessed and contains the requested data, a cache "hit" occurs. Otherwise, if the cache does not contain the requested data, a cache "miss' occurs. Thus, the cache contents are typically managed in an attempt to maximize the cache hit-to-miss ratio.
With current systems, flushing a specific memory address in a cache requires knowledge of the cache memory replacement algorithm.
A cache, in its entirety, may be flushed periodically, or when certain predefined conditions are met. Furthermore, individual cache lines may be flushed as part of a replacement algorithm. In systems that contain a cache, a cache line is the complete data portion that is exchanged between the cache and the main memory. In each case, dirty data is written to main memory. Dirty data is defined as data, not yet written to main memory, in the cache to be flushed or in the cache line to be flushed. Dirty bits, which identify blocks of a cache line containing dirty data, are then cleared. The flushed cache or flushed cache lines can then store new blocks of data.
If a cache flush is scheduled or if predetermined conditions for a cache flush are met, the cache is flushed. That is, all dirty data in the cache is written to the main memory.
For the Intel family of P6 microprocessors (e.g., Pentium II, Celeron), for example, there exists a set of micro-operations used to flush cache lines at specified cache levels given a cache set and way; however, there is not such a micro-operation to flush a cache line given its memory address.
Systems that require high data access continuously flush data as it becomes dirty.
The situation is particularly acute in systems that require high data flow between the processor and system memory, such as the case in high- end graphics pixel manipulation for 3-D and video performances. The problems with current systems are that high bandwidth between the cache and system memory is required to accommodate the copies from write combining memory and write back memory.
Thus, what is needed is a method and system that allow a processor to flush the cache line associated with a linear memory address from all caches in the coherency domain. -
SUMMARY
The cache line flush (CLFLUSH) micro-architectural implementation process and system allow a processor to flush a cache line associated with a linear memory address from all caches in the coherency domain. The processor receives a memory address. Once the memory address is received, it is determined whether the memory address is stored within a cache memory. If the memory address is stored within the cache, the memory address is flushed from the cache.
BRIEF DESCRIPTION OF THE DRAWINGS
The inventions claimed herein will be described in detail with reference to the drawings in which reference characters identify correspondingly throughout and wherein: FIG. 1 illustrates a microprocessor architecture; and FIG. 2 flowcharts an embodiment of the cache line flush process.
DETAILED DESCRIPTION
By definition, a cache line is either completely valid or completely invalid; a cache line may never been partially valid. For example, when the processor only wishes to read one byte, all the bytes of an applicable cache line must be stored in the cache; otherwise, a cache miss will occur. The cache line forms the actual cache memory. A cache directory is used only for cache management. Cache lines usually contain more data than it is possible to transfer in a single bus cycle. For this reason, most cache controllers implement a burst mode, in which pre-set address sequences enable data to be transferred more quickly through a bus. This is used for cache line fills, or for writing back cache lines, because such these cache lines represent a continuous and aligned address area.
A technique to flush the cache line can be associated with a linear memory address. Upon execution, the technique flushes the cache line associated with the operand from all caches in the coherency domain. In a multi-processor environment, for example, the specified cache line is flushed from all cache hierarchy levels in all microprocessors in the system (i.e. the coherency domain), depending on processor state. The MESI (Modified, Exclusive, Shared, Invalid) protocol, a write-invalidate protocol, gives every cache line one of four states which are managed by two MESI-bits.
The four states also identify the four possible states of a cache line. If the processor is found in "exclusive" or "shared" states, the flushing equates to the cache line being invalidated. Another example is true when the processor is found in "modified" state. If a cache controller implements a write-back strategy and, with a cache hit, only writes data from the processor to its cache, the cache line content must be transferred to the main memory, and the cache line is invalidated.
When compared to other memory macroinstructions, the cache line flush (CLFLUSH) method is not strongly ordered, regardless of the memory type associated with the cache line flush macroinstruction. In contrast, the behavior in the memory sub-system of the processor is weakly ordered. Other macro-instructions, can be used to strongly order and guarantee memory access loads, stores, fences, and other serializing instructions, immediately prior to and right after CLFLUSH.
A micro-operation, named "clflush_micro_op" is used to implement the CLFLUSH macroinstruction.
Moving to FIG. I, an example microprocessor's memory and bus subsystems is shown with the flow of loads and stores. In FIG. 1, two cache levels are assumed in the microprocessor: an on-chip ("L1") cache being the cache level closest to the processor, and second level ("L2") cache being the cache level farthest from the processor. An instruction fetch unit 102 fetches macroinstructions for an instructions decoder unit 104.
The decoder unit 104 decodes the macroinstructions into a stream of microinstructions, which are forwarded to a reservation station 106, and a reorder buffer and register file 108. As an instruction enters the memory subsystem, it is allocated in the load 112 or store buffer 114, depending on whether it is a read or a write memory macroinstruction, respectively. In the unit of the memory subsystem where such buffers reside, the instruction goes through memory ordering checks by the memory ordering unit 1 10. If no memory dependencies exist, the instruction is dispatched to the next unit in the memory subsystem after undergoing the physical address translation. At the L1 cache controller 120, it is determined whether there is an L1 cache hit or miss. In the case of a miss, the instruction is allocated into a set of buffers, from where it is dispatched to the bus sub-system 140 of the microprocessor. case of a cacheable load miss, the instruction is sent to read buffers, 122, or in the case of a cacheable store miss, the instruction is sent to write buffers 130. The write buffers may be either weakly ordered write combining buffers 132 or non-write combining buffers 134. In the bus controller unit 140, the read or write micro- operation is allocated into an out-of-order queue 144. If the micro- operation is cacheable, the L2 cache 146 is checked for a hit/miss. If a miss, the instruction is sent through an in-order queue 142 to the frontside bus 150 to retrieve or update the desired data from main memory.
The flow of the "clflush_micro_op" micro-operation through the processor memory subsystem is also described in FIG. 2. Initially, the instruction fetch unit 102 I O retrieves a cache line flush instruction, block 202. In block 204, the cache line flush instruction is decoded into the "clflush_micro_op" micro-operation by the instructions decoder unit 104. The micro-operation is then forwarded to a reservation station 106, and a recorder buffer and register file 108, block 206. The "clflush_micro_op" micro-operation is dispatched to the memory subsystem on a load port, block 208. It is 1 5 allocated an entry in the load buffer 112 in the memory ordering unit 110. For split accesses calculation in the memory ordering unit 110, the data size of the micro operation is masked to one byte in order to avoid cache line splits; however, upon execution, the whole cache line will be flushed.
The behavior ofthe "clflush_micro_op" in the memory-ordering unit 110 is speculative. Simply put, this means that the "clflush micro_op" can execute out of order respect to other CLFLUSH macroinstructions, loads and stores. Unless memory access fencing (termed "MFENCE'] instructions are used appropriately, (immediately before and after of the CLFLUSH macro-instruction), execution of the "clflush_micro_op" with respect other memory loads and stores is not guaranteed to be in order, provided there are no address dependencies. The behavior of CLFLUSH through the memory subsystem is weakly ordered. The following tables list the ordering constraints on CLFLUSH. Table 1 lists the ordering constraint affects of later memory access commands compared to an earlier CLFLUSH. Table 2 lists the converse oftable 1, displaying the ordering constraint affects of earlier memory access commands compared to a later CLFLUSH instruction. The memory access types listed are uncacheable (UC) memory, w.ite back (WB) memory, and uncacheable speculative write combining (TJSWC) memory accesses.
Earlier Later access access UC memo y WB memory USWC memory CLFLUS MFENCE Load Store Load Store Load Store
CLFLUS N N Y Y Y Y Y N
Note: N = Cannot pass, Y = can pass.
Table 1: Memory ordering of instructions with respect to an older CLFLUSH Earlier access Later access
CLFLUSH
UC memory Load Y Stom Y WB memory Load l USWC memory Store Y
CLFLUSH
MFENCE N
Table 2: Memory ordering, of instructions with respect to a younger CLFLUSlI From the memory-ordering unit 110, the "clflush_micro op" microoperation is dispatched to the Ll cache controller unit 120, block 210. The "clflush_micro_op" micro-operation is dispatched on the load port; however, it is allocated in a write combining buffer 132, as if it were a store. From the L1 cache controller unit forward, the "clflush_micro_op" is switched from the load to the store pipe.
Decision block 212 determines whether no write combining buffers 132 are available. If none are available, flow returns to block 210. Otherwise, flow continues into block 214. Regardless of t'ne memory type and whether it hits or misses the L cache, a write combining buffer 132 is allocated to service an incoming "clflush_rnicro_op," block 214. A control field is added to each write combining buffer 132 in the Ll cache controller unit to determine which self-snoop attributes need to be sent to the bus controller 140. This control bit, named "clflush_nuss," is set exclusively for a "clflush micro_op" that misses the Ll cache.
Upon entering the me nory sub-system of the microprocessor, several bits of the address that enable cache line access of a "clflush_micro_op" are zeroed out, block 216.
In the Pentium pro family of microprocessors, these would be the lower five bits of the address (address[4:0]). This is done in both the L1 cache and L2 cache controller units 120, upon executing the flush command. The zeroing out helps to determine a cache line hit or miss. Since only tag match determines a hit or miss, no byte enable comparison is needed. Note that by definition, no partial hit is possible. A hit or miss is always full line hit or miss. Zeroing out the address bits [4:0] also provides an alternative mechanism to the one used in the memory ordering unit 110 to mask line split accesses. In split accesses the data size of the transaction is masked one byte.
Another control bit added to each write combining buffer 132 in the L1 cache controller unit 120 is used to differentiate between a write combining buffer 132 allocated for a "clflush_micro_op" and another one allocated for a write combining store, block 218. This control bit, named "clilush_op," is exclusively set for those write combining buffers allocated to service a "clflush_micro_op". It is used to select the request type and flush attributes sent from the Ll cache controller 120 to the bus controller 140.
In the case of an Ll cache hit, as determined by decision block 222, both "flush Ll" and "flush L2" attributes are sent to the bus controller 140 upon dispatch from the Ll cache controller unit 120, blocks 224 and 226. The bus controller 140 contains both the L2 cache 146 and external bus controller units.
Alternatively, in the case of a L1 cache miss, as determined by decision block 222,the "clflush_miss" control bit is set, and only the "flush L2" attribute is sent blocks 228 and 232. This helps improve performance by omitting the internal self- snoop to the Ll cache.
Upon its dispatch from the memory-ordering unit 110, the "clflush_micro_op" micro-operation is blocked by the Ll cache controller unit 120 if there are no write combining buffers 132 available, block 212. In such a case, it also evicts a write combining buffer 132, as pointed by the write-combining circular allocation pointer.
This guarantees no deadlock conditions due to the lack of free write combining buffers 132. If blocked, the "clflush_micro_op" is redispatched once the blocking condition is removed. An example that would cause the dispatching of the "clflush_micro_op" instruction is the completed eviction of a previously allocated write-combining buffer 132.
The "clflush_micro_op" micro-operation is retired by the memory subsystem upon being allocated into a write-combining buffer 132 in the L1 cache controller 120.
This allows pipelining: subsequent instructions to proceed with execution prior to completion of the "clflush_micro_op" micro-operation. The pipelining improves the overall performance of the system.
There are two methods to evict a write-combining buffer servicing a "clflush_micro_op" micro-operation.
A write combining buffer 132 servicing a "clflush_micro_op" will be evicted by the same current eviction conditions that currently apply to write combining buffers 132 l 0 in the family of Intel P6 microprocessors. Moreover, fencing macroinstructions also evict a write-combining buffer that services a "clflush_micro_op" micro-operation.
Additionally, some embodiments evict a "clflush_micro_op" exclusively. This is done to avoid leaving stranded (pending) a write combining buffer servicing a "clflush_micro_op" for a long period of time, when the programmer does not want to enforce ordering, and a fencing instruction is not used. A control bit, named "clflush_evict", is associated with each write-combining buffer 132 servicing a "clflush_micro op". This control bit is set when a write combining buffer 132 is allocated to a "clflush micro_op." Once the "clflush_evict" bit is set, the corresponding write combining buffer is marked for eviction and the control bit is reset, block 230.
This eviction condition applies exclusively to write combining buffers 132 servicing a "clflush_micro_op" micro-operation. It improves performance of programs using CLFLUSH by not allowing "clflush_micro_op" micro-operations to take up the write combining buffer 132 resources for extended periods of time, and consequently, freeing them up for other write combining operations.
Cmush miss "Cmush op" Request type "Flush L1" "Flush L27' New " control bit control bit attribute attribute transaction 0 NO Non-CLFLUSH NO O CLFLUSH. _ YES _.0 N/A N/A N/A Illegal combination I I CLFLUSH.0 YES Table 3: Memory to Bus Transactions for CLUSH Note that if "Clflush_miss" = "clflush_op" = '0,' the request type is any of the existing transactions in the P6 family of microprocessors (but not CLFLUSH), and the flush attributes will be set/cleared accordingly.
Table 4 below shows the conditions under which the three write combining buffer 132 control bits are set and reset. The "clflush_evict" control bit can only be set after the "clflush_micro_op" control bit. The "clflush_micro_op" control bit will be set on speculative write combining buffer 132 allocations, while "clflush_evict" will exclusively be set on a real write combining buffer 132 allocation for a "clflush_op".
The "clflush_miss" control bit is also set on speculative write combining buffer 132 allocations, if the "clflush_micro_op" misses the Ll cache. Both, the "clflush_miss" and "clflush_op" control bits are cleared upon speculative allocation of a write- combining buffer 132 to service any instruction other than a "clflush_micro_op." Functionally, this is similar to clearing such control bits upon deallocation of a write- combining buffer servicing a "clflush_micro_op." In a processor implementation, the same write buffers 130 are shared for write combining and non-write combining micro operations. The "clflush_rniss" and "clflush_micro_op" bits are cleared upon speculative allocation of any write buffer 130, not just a write combining buffer 132.
This behavior ensures that the three control bits can never be set for a write buffer 130 not servicing a "clflush_rnicro_op." In a processor implementation, where all L1 cache controller buffers are shared for both reads and writes, such as in the family of P6 microprocessors, the "clflush_miss" and "clflush_rnicro_op" control bits only need to be cleared upon allocation of a buffer to service a store, block 234. Buffers allocated to service loads ignore the value of these three new control bits.
Control bit Set Clear "Clflush_op" Upon allocation of a Ante comb fig buffer Upon allocation of a write buffer to service a "clflush micro_op" for something other Han a "clflush micro op" "Clflush evic Immediately after allocation of a wnte Upon eviction of writs t" combining buffer to service a combining buffer (i.e., 'NC "clflush micro_op" (i.e., WC buffer allocated, mode" control bit set) "in use", and "clflush op" control bit set) "Clflush_miss Upon allocation in a write combining buffer Upon allocation of a write buffer of a "clflush micro op" Mat misses the Ll for sonedung other Can a cache"clflush miss" "clflush_nucro_op" Note that all three new WC buffer control bits are cleared upon a "reset" sequence as well.
Table 4: Conditions to seVclear the new control bits of a writcombining buffer in the L1 cache controller Embodiments may be implemented utilizing the bus controller 140. When a write-combining buffer 132 servicing a "clflush_rucro_op" is marked for eviction, it is dispatched to the bus controller 140, block 236. The request sent is the same as if it was for a full line cacheable write combining transaction, except for the self-snoop attributes.
Snooping is used to verify the presence of a specific memory address is present in the applicable cache. For a "clflush_micro_op" eviction, the bus controller 140 self-snoops the L1 and L2 caches based on the "flush L1" and "flush L2" request attributes, block 250. Furthermore, the bus controller 140 issues a "bus read invalidate line" on the external bus, block 236. If the L1 cache controller unit 120 determines an L1 cache miss, for example, no "flush Ll" message is sent. The "bus read invalidate line" transaction flushes hits to the same line in any other caches in the coherency domain.
On the external bus transaction, all byte enables are Reasserted, masking the data phase from the core. Decision blocks 238 and 252 determine whether a hit for a modified cache line (HITM) has occurred in another cache within the coherency domain (i.e., not the L1 or L2 caches in the requesting microprocessor). If the HITM occurs, the cache that is hit does a write back to main memory, and data is resumed to the requesting microprocessor in blocks 244 and 254. The write combining buffer 132 in the L1 cache controller unit 120 remains allocated until completion of the snoop phase and possible transfer of data back from another cache in the coherency domain, for example, a HITM on an external bus. Data coming back to the write-combining buffer 132 as a result of the snoop phase or inquiry cycle is ignored, blocks 246 and 248.
All flushes are then completed, and the write combining buffers 132 are deallocated in block 260.
Table 5 below shows how the external bus controller 140 treats all wntecombining evictions. The request from the L1 cache to the bus controller 140 for a "clflush_micro_op" eviction, such as the CLFLUSH macro-instruction, can be overloaded on the same request signals as that for a full line cacheable write combining eviction; however, the self-snoop attributes differ.
Table 5: External bus controller transactions for write combining evictions Request External Transa Byte Flush Ll F1ush L2 New type bus ction enables transactio length n Partial Read 32 byte All byte NO NO NO cacheable Invalidate enables write asserted combining Full hne Invalidate 32 byte All byte NO NO NO cacheable enables write deasserted combining Partial Memory ≤8 Byte enables NO Only non- NO uncacheab write byte as sent from temporal le write (write Ll cache stores that combining type) controller miss Ll unit cache Full hne Memory 32 byte All byte NO Only non- NO uncacheab write enables temporal le write (writeback asserted stores that combining type) miss Ll l cache CLFLUS Bus Read 32 byte AR byte Only Ll YES YES H Invalidate enables hits Reasserted Note: USWC stores are not memory aliased in the P6 family of microprocessors, and therefore, they are not self-snooped.
For testability and debug purposes, a non-user visible mode bit can be added to enable/disable the CLFLUSH macroinstruction. If disabled, the L1 cache controller unit treats the incoming "clflush_micro_op" microoperation as a No-Operation-Opcode ("NOP"), and it never allocates a write-combining buffer 132. This NOP behavior can be implemented on uncacheable data prefetches.
The previous description of the embodiments is provided to enable any person skilled in the art to make or use the system and method. It is well understood by those in the art, that the preceding embodiments may be implemented using hardware, firmware, or instructions encoded on a computer-readable medium. The various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.