US9043556B2 - Optimizing a cache back invalidation policy - Google Patents

Optimizing a cache back invalidation policy Download PDF

Info

Publication number
US9043556B2
US9043556B2 US13/723,345 US201213723345A US9043556B2 US 9043556 B2 US9043556 B2 US 9043556B2 US 201213723345 A US201213723345 A US 201213723345A US 9043556 B2 US9043556 B2 US 9043556B2
Authority
US
United States
Prior art keywords
cache
line
lru
bits
recently used
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US13/723,345
Other versions
US20130111139A1 (en
Inventor
Ganesh Balakrishnan
Anil Krishna
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US13/723,345 priority Critical patent/US9043556B2/en
Publication of US20130111139A1 publication Critical patent/US20130111139A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KRISHNA, ANIL, BALAKRISHNAN, GANESH
Application granted granted Critical
Publication of US9043556B2 publication Critical patent/US9043556B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0808Multiuser, multiprocessor or multiprocessing cache systems with cache invalidating means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0864Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • G06F12/123Replacement control using replacement algorithms with age lists, e.g. queue, most recently used [MRU] list or least recently used [LRU] list

Definitions

  • the present invention generally relates to cache memory systems and in particular to replacement policies of cache coherence systems.
  • Caches are typically organized in a set-associative fashion, with each cache divided into a certain number of “sets”. Each set holds one or more cache-lines arranged in one or more “ways”. The “ways” and “sets” of a cache are depicted in the illustration of FIG. 4 .
  • Each set has Least Recently Used (LRU) bits associated with the set that describes the level of recent use of each cache-line within a set. These LRU bits are used to make replacement decisions when removing a line from a set, in order to make space for a new line.
  • LRU Least Recently Used
  • each cache-line may have other associated state bits. For example, if the cache participates in the coherence protocol (for example, the Modified Exclusive Shared Invalid (MESI) protocol), the cache maintains the coherence protocol state bits per cache-line.
  • the coherence protocol for example, the Modified Exclusive Shared Invalid (MESI) protocol
  • Caches are, often, hierarchically organized into multiple levels which include Level 1 caches (L1) and Level 2 caches (L2).
  • L1 caches Level 1 caches
  • L2 caches Level 2 caches
  • “inclusion” is maintained between cache levels. Inclusion means that all the data stored in a higher level cache is also present in a lower level cache. For example, all data in the (higher level) L1 caches are present in the (lower level) L2 cache that the L1s share.
  • the main motivation for inclusion is that inclusion makes cache coherency easier to manage. A request from the bus may initiate an inspection/search of a lower level cache only and be assured that the search has not missed any data that the higher level caches may contain.
  • the traditional cache replacement policies results in performance degrading back-invalidation caused by the policies' eviction of hot cache-lines from caches closer to cores (e.g. L1) due to back-invalidation from inclusive caches that are further from the cores (e.g. L2).
  • a cache back invalidation (CBI) utility evicts cache-lines by using presence bits to avoid replacing a cache-line in a lower level cache that is also present in a higher level cache. Furthermore, the CBI utility selects the cache-line for eviction from an LRU group. The CBI utility ensures that dormant cache-lines in the higher level caches do not retain corresponding presence bits set in the lower level caches by unsetting the presence bits in the lower level cache when a line is replaced in the higher level cache. Additionally, when a processor core becomes idle, the CBI utility invalidates the corresponding higher level cache by unsetting the corresponding presence bits in the lower level cache.
  • FIG. 1 provides a block diagram representation of a data processing system within which the invention is practices, according to one embodiment of the invention
  • FIG. 2 illustrates an example of a traditional cache replacement policy in a processor memory sub-system, according to the prior art
  • FIG. 3 illustrates the use of presence bits in a cache hierarchy, according to one embodiment of the invention
  • FIG. 4 illustrates the use of LRU and Pseudo LRU bits within a processor cache, according to one embodiment of the invention
  • FIG. 5 illustrates the modification to LRU bits in response to a hit on a more recently used cache-line of a way pair, according to one embodiment of the invention
  • FIG. 6 illustrates the modification to LRU bits in response to a cache miss, according to one embodiment of the invention
  • FIG. 7 illustrates the modification to LRU bits in response to a hit on a less recently used cache-line of a way pair, according to one embodiment of the invention
  • FIG. 8 illustrates a cache replacement policy which uses presence bits in the lower level cache to reduce back-invalidate traffic in a processor and memory-subsystem, according to one embodiment of the invention
  • FIG. 9 illustrates the use of Pseudo LRU bits and presence bits in selecting cache-lines for eviction, according to one embodiment of the invention.
  • FIG. 10 is a flow chart illustrating the process of enhancing cache replacement and back-invalidation policies, according to one embodiment of the invention.
  • the illustrative embodiments provide a method, a system and a computer program product for enhancing a cache back invalidation policy by utilizing least recently used (LRU) bits and presence bits in selecting cache-lines for eviction.
  • a cache back invalidation (CBI) utility evicts cache-lines by using presence bits to avoid replacing a cache-line in a lower level cache that is also present in a higher level cache. Furthermore, the CBI utility selects the cache-line for eviction from an LRU group.
  • the CBI utility ensures that dormant cache-lines in the higher level caches do not retain corresponding presence bits set in the lower level caches by unsetting the presence bits in the lower level cache when a line is replaced in the higher level cache. Additionally, when a processor core becomes idle, the CBI utility invalidates the corresponding higher level cache by unsetting the corresponding presence bits in the lower level cache.
  • DPS data processing system
  • DPS may be a computer, a portable device, such as a personal digital assistant (PDA), a smart phone, and/or other types of electronic devices that may generally be considered processing devices.
  • DPS 100 comprises at least one processor or central processing units (CPUs) connected to system memory 106 via system interconnect/bus 102 .
  • CPU 101 is illustrated with a two (2) processing cores, Core-0 103 and Core-1 104 .
  • CPU 101 further comprises at least one higher level cache of which L1 105 is illustrated. Illustrated within CPU 101 is lower level cache L2 109 .
  • Cache controller 112 is also connected to system bus 102 . Also connected to system bus 102 is input/output (I/O) controller 115 , which provides connectivity and control for input devices, of which pointing device (or mouse) 116 and keyboard 117 are illustrated. I/O controller 120 also provides connectivity and control for output devices, of which display 118 is illustrated. Additionally, a multimedia drive 119 (e.g., compact disk read/write (CDRW) or digital video disk (DVD) drive) and USB (universal serial bus) port 121 are illustrated, coupled to I/O controller 115 .
  • multimedia drive 119 e.g., compact disk read/write (CDRW) or digital video disk (DVD) drive
  • USB universal serial bus
  • Multimedia drive 119 and USB port 121 enable insertion of a removable storage device (e.g., optical disk or “thumb” drive) on which data/instructions/code may be stored and/or from which data/instructions/code may be retrieved.
  • DPS 100 also comprises storage 107 , within/from which data/instructions/code may also be stored/retrieved.
  • DPS 100 is also illustrated with a network interface device (NID) 125 , by which DPS 100 may connect to one or more access/external networks 130 , of which the Internet is provided as one example.
  • NID network interface device
  • the Internet represents/is a worldwide collection of networks and gateways that utilize the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • NID 125 may be configured to operate via wired/or wireless connection to an access point of the network.
  • Network 130 may be an external network such as the Internet or wide area network (WAN), or an internal network such as an Ethernet (local area network—LAN) or a Virtual Private Network (VPN). Connection to the external network 130 may be established with one or more servers 133 , which may also provide data/instructions/code for execution on DPS 100 , in one embodiment.
  • WAN wide area network
  • VPN Virtual Private Network
  • OS operating system
  • CRP Enhanced Cache Replacement Policy
  • CBI Cache Back-Invalidation
  • CBI utility 110 may be combined with Coherence Protocol 111 , Pseudo-LRU Policy 113 and Enhanced Cache Replacement Policy (CRP) 124 to provide a single executable component, collectively providing the various functions of each individual software component when the corresponding combined code is executed by the CPU 101 .
  • CBI utility 110 is illustrated and described as a stand alone or separate software/firmware component, which provides specific functions, as described below.
  • servers 133 includes a software deploying server
  • DPS 100 communicates with the software deploying server ( 133 ) via network (e.g., Internet 130 ) using network interface device 125 .
  • CBI utility 110 may be deployed from/on the network, via software deploying server 133 .
  • software deploying server performs all of the functions associated with the execution of CBI utility 110 . Accordingly, DPS 100 is not required to utilize internal computing resources of DPS 100 to execute CBI utility 110 .
  • CPU 101 executes CBI utility 110 as well as OS 108 , which supports the user interface features of CBI utility 110 .
  • CBI utility 110 generates/provides several graphical user interfaces (GUI) to enable user interaction with, or manipulation of, the functional features of CBI utility 110 .
  • GUI graphical user interfaces
  • Certain functions supported and/or implemented by CBI utility generate processing logic executed by processor and/or device hardware to complete the implementation of that function.
  • the collective body of code that enables these various features is referred to herein as CBI utility 110 .
  • CBI utility 110 Among the software code/instructions/logic provided by CBI utility 110 , and which are specific to the invention, are: (a) code/logic for responding to data requests using an enhanced/optimized cache replacement policy which initiates a search for data in a higher level cache; (b) code/logic for selecting for eviction a cache-line from a lower level cache by using one or more of presence bits, presence indicators and LRU bits, according to the optimized cache replacement policy; (c) code/logic for updating LRU bits, according to the optimized cache replacement policy; and (d) code/logic for executing enhanced cache back invalidation.
  • DPS 100 when CPU 101 executes CBI utility 110 , DPS 100 initiates a series of functional processes that enable the above functional features as well as additional features/functionality. These features/functionality are described in greater detail below within the description of FIGS. 3-10 .
  • FIG. 1 may vary.
  • the illustrative components within DPS 100 are not intended to be exhaustive, but rather are representative to highlight essential components that are utilized to implement the present invention.
  • other devices/components may be used in addition to or in place of the hardware depicted.
  • the depicted example is not meant to imply architectural or other limitations with respect to the presently described embodiments and/or the general invention.
  • the data processing system depicted in FIG. 1 may be, for example, an IBM eServer pSeries system, a product of International Business Machines Corporation in Armonk, N.Y., running the Advanced Interactive Executive (AIX) operating system or LINUX operating system.
  • AIX Advanced Interactive Executive
  • FIG. 2 illustrates a cache replacement policy in a processor and memory-subsystem, according to the prior art.
  • System 200 comprises processor 202 which further comprises two cores, Core-0 203 and Core-1 204 .
  • Illustrated within Processor 202 is first L1 205 , the independent higher level cache of Core-0 203 .
  • Second L1 207 represents the independent higher level cache of Core-1 204 .
  • Also illustrated within system 200 is lower level (L2) cache 209 .
  • L2 cache 209 is shared by Core-0 203 and Core-1 204 .
  • Processor 202 connects to the memory subsystem (comprising memory controller 216 and memory 206 to which access is controlled by memory controller 216 ) via a memory bus.
  • System 200 illustrates an example of processor 201 with two cores, Core-0 203 and Core-1 204 .
  • System 200 has L2 209 that is shared by a pair of L1s, first L1 205 and second L1 207 , respectively, and may be used to highlight performance issues with traditional cache eviction policies.
  • the chance of eviction of the same/replicated data from the L1 205 is also increased because core0's data in L2 209 is not recent.
  • core-1 204 runs a workload that has a large working set, continuous fetching of new lines into L2 209 and core-1's (second) L1 207 occurs.
  • core-1's data becomes more recent in L2 than core-0's data, eventually causing eviction of core-0's data from L2 209 and core-0's L1 205 due to back-invalidation.
  • core-0 203 Even though the data is heavily reused by core-0 203 , core-0's data gets evicted because core-1 204 “hogs” L2 209 . Consequently, the workload on core-0 203 pays a significant performance penalty because core-0 203 experiences a lot of cache misses even though core-0 203 is actually very cache friendly.
  • the traditional cache replacement policies results in performance degrading back-invalidation caused by the policies' eviction of hot cache-lines from caches closer to cores (e.g. L1 205 ) due to back-invalidation from inclusive caches that are further from the cores (e.g. L2 209 ).
  • FIG. 3 illustrates the use of presence bits in a cache hierarchy, according to one embodiment of the invention.
  • FIG. 3 comprises DPS 301 and DPS 100 .
  • caches may be hierarchically organized into multiple levels including Level 1 caches (L1) and Level 2 caches (L2).
  • L1 caches Level 1 caches
  • L2 caches Level 2 caches
  • inclusion is maintained between cache levels. This means that all the data stored in a higher level cache is also present in a lower level cache. For example, all data in the L1 caches is present in the L2 cache that the L1s share.
  • first presence bit- 2 326 has a value of “0” indicating that no back-invalidate is sent to corresponding L1 307 whereas second presence bit- 2 324 has a value of “1” indicating that a back-invalidate is sent to corresponding L1 105 . Therefore, using the presence bits only the necessary back-invalidates need to be sent to the L1s. Of course, to keep the presence bits up to date in the L2, upon an L1 replacement the corresponding presence bits are unset in the L2. To keep the impact of this extra traffic to the L2 low and to parallelize the data access due to the L1 miss, and the update of the “presence bits” due to the replacement, presence bits may be maintained in a separate structure in the L2.
  • FIG. 4 illustrates the use of LRU and Pseudo LRU bits within a processor cache, according to one embodiment of the invention.
  • L2 cache 109 comprises 4 “ways” illustrated by way-A 402 , way-B 404 , way-C 406 and way-D 408 , respectively.
  • L2 cache 109 also comprises LRU bits 407 further comprising LRU-2 410 , LRU-1 412 and LRU-0 414 .
  • Pseudo-LRU is an approximation of True LRU used for the eviction of cache-lines.
  • True LRU maintains a true ranking of how recently a cache-line in a cache set is used.
  • True LRU uses “n multiplied by log 2n” bits per set for an n-way cache because each of the n ways uses “log 2n” bits to encode the corresponding ranking (which ranges in value from 0 through n ⁇ 1).
  • Pseudo-LRU uses “n ⁇ 1” bits per set for an n-way cache.
  • L2 cache 109 is a single cache set in 4-way cache.
  • a cache consists of a number of sets which is indicative of the cache size.
  • the scheme indicates 3 possibilities. Of the way pairs A-B and C-D, the policy indicates which way pair is more recently used. Of ways A 402 and B 404 , the policy indicates which way is more recently used. Of ways C 406 and D 408 , the policy indicates which way is more recently used. Thus, in this example of FIG.
  • Pseudo-LRU policy/scheme uses only 3 bits which are described as follows: LRU1 412 “remembers”/indicates which way-pair between pair A-B and pair C-D is LRU; LRU2 410 remembers which way between ways A 402 and B 404 is LRU; and LRU0 414 remembers which way between ways C 406 and D 408 is LRU.
  • LRUO and LRU2 form, what may be termed, level-1 bits of Pseudo-LRU and LRU1 forms level-2 bits.
  • level-1 bits of Pseudo-LRU and LRU1 forms level-2 bits.
  • an 8-way cache requires 7 pseudo-LRU bits, with 4 bits in level-1, 2 bits in level-2 and 1 in level-3.
  • the Pseudo-LRU concept similarly extends to higher levels of associativity, for example, to the level of associativities of a 32-way cache and a 64-way cache.
  • FIG. 5 , FIG. 6 and FIG. 7 show modifications to the LRU bits in response to cache hits and cache misses.
  • FIG. 5 illustrates the modification to LRU bits in response to a hit on a more recently used cache-line of a way pair, according to one embodiment of the invention.
  • arrows are used to indicate the meaning/value of the LRU bits. The arrows point towards less recently used regions of the cache set.
  • the arrows i.e., following the path of the arrows illustrated with a distinctively bolder appearance and depicted by LRU1 504 and LRU2 508 , respectively
  • lead to way B 404 indicating that way B 404 is the least recently used way according to pseudo-LRU bits 407 .
  • pseudo-LRU bits 407 When a cache hit is detected in way A 402 , there is no change to LRU bits 407 because way B 404 continues to be the least recently used way.
  • CBI utility 110 modifies/updates (one or more of) LRU bits 407 following the replacement of an evicted cache-line (after a cache miss) similar to the method in which updates are performed following a cache hit in the same cache-line.
  • the cache-line represents the most recently used cache-line (following cache-line replacement or cache hit).
  • the LRU cache-line within the LRU group continues to be the least recently used way and there is no change to LRU bits 407 .
  • FIG. 6 illustrates the modification to LRU bits in response to a cache miss, according to one embodiment of the invention.
  • L2 109 of FIG. 6 illustrates a modified set of pseudo-LRU bits following a cache miss in L2 109 of FIG. 5 .
  • way B 404 contains the LRU cache-line during the cache miss
  • the cache-line in way B 404 is evicted and replaced.
  • way B 404 currently contains a line that is recently brought in, and therefore becomes the most recently used way.
  • LRU bits 407 are rearranged. All bits that lead to way B 404 are flipped, as illustrated by arrows 602 and 604 , respectively.
  • the intent is here is that since way B 404 is now the most recently used way, LRU2 410 points to way A 402 .
  • the value of LRU1 412 is flipped/changed to point to the less recently used way pair.
  • FIG. 7 illustrates the modification to LRU bits in response to a hit on a less recently used cache-line of a way pair, according to one embodiment of the invention.
  • L2 109 of FIG. 7 illustrates a modified set of pseudo-LRU bits following a cache hit on way C 406 in L2 109 of FIG. 5 .
  • way B 404 continues to be the least recently used way.
  • LRU1 412 and LRU2 410 which lead to way B 404 , are left unchanged.
  • ways C 406 and D 408 way C 406 is currently more recently used and hence the value of LRU0 414 is flipped/changed.
  • Now LRU0 414 points to way D 408 which is indeed the less recently used way amongst way C 406 and D 408 .
  • FIG. 8 illustrates a cache replacement policy which uses presence bits in the lower level cache to reduce back-invalidate traffic in a processor and memory-subsystem, according to one embodiment of the invention.
  • DPS 100 comprises processor 101 which further comprises two cores, Core-0 103 and Core-1 104 . Illustrated with Processor 101 is first L1 105 , the independent higher level cache of Core-0 103 . Second L1 309 represents the independent higher level cache of Core-1 104 . Also illustrated within DPS 100 is lower level (L2) cache 109 . L2 cache 109 is shared by Core-0 103 and Core-1 104 . DPS 100 also includes cache controller 112 .
  • Processor 101 is connected to the memory subsystem (comprising memory controller 804 and memory 106 to which access is controlled by memory controller 804 ).
  • Memory 106 includes CBI utility 110 .
  • the optimized cache replacement/eviction policy may be described with reference to DPS 100 .
  • the optimized replacement policy also applies to systems (having L3 caches, for example) which may include cache levels which are lower than the L2 cache level.
  • the optimized replacement policy is described in relation to the LRU policy.
  • CBI utility 110 executes the data search as outlined in the following enumerated steps.
  • CBI utility 110 searches in L1 105 . If data is available or “hits” in L1 105 , then
  • CBI utility 110 searches for data in L2 109 . If data is available in L2 109 , then
  • FIG. 9 illustrates the use of Pseudo LRU bits and presence bits in selecting cache-lines for eviction, according to one embodiment of the invention.
  • L2 Cache 109 previously defined in FIGS. 4-7 is further illustrated with presence bits 910 .
  • P0 912 is one element of presence bits 910 .
  • Also illustrated in L2 cache 109 is way pair A-B/LRU group 906 and way pair C-D/MRU group 908 .
  • CBI utility 110 utilizes presence bits 910 to indicate which higher level caches a cache line is present in. Unlike typical caches where only the LRU bits are used for selection of cache-lines for eviction, CBI utility 110 employs both LRU bits 407 and presence bits 910 in the process of selecting a cache-line for eviction. Conversely, in traditional LRU policy, the cache-line pointed to by LRU bits as being least recently used is evicted even if the corresponding presence bit indicates that the cache-line is present in a higher level (L1) cache.
  • every cache-line is associated with Pseudo-LRU bits 407 and p presence bits (e.g., presence bits 910 ), where p is the number of processors.
  • p is the number of processors.
  • “p” has a value of 4 in L2 cache 109 .
  • a presence bit of 1 indicates that the cache-line is present in the corresponding L1.
  • CBI utility 110 provides an optimized cache eviction policy.
  • Pseudo-LRU partitions the “n” ways in a set into 2 groups: a more recently used (MRU) group (e.g., MRU 908 ); and a less recently used (LRU) group (e.g., LRU 906 ).
  • the LRU group comprises all the ways in the set that are pointed to by the LRU bits (e.g., A 402 , B 404 ). All other ways comprise the MRU group (e.g., C 406 , D 408 ).
  • CBI utility 110 selects from cache-lines that are in LRU group 906 .
  • CBI utility 110 responds to a second cache miss in L2 109 which second cache miss is preceded by a first cache miss in L1 105 by utilizing presence bits (e.g., presence bits 910 ) in the eviction decision and incorporating the pseudo-LRU policy.
  • CBI utility 110 continues the execution of the optimized cache eviction policy as described in the following (continuation of) enumerated steps:
  • FIG. 10 is a flow chart illustrating the process of enhancing cache replacement and back-invalidation policies, according to one embodiment of the invention.
  • FIG. 10 is a flow chart illustrating the method by which the above process of the illustrative embodiment is completed.
  • the method illustrated in FIG. 10 may be described with reference to components shown in FIGS. 1 , 3 - 9 , it should be understood that this is merely for convenience and alternative components and/or configurations thereof can be employed when implementing the various methods. Key portions of the methods may be completed by CBI utility 110 executing on processor 101 within DPS 100 ( FIG. 1 ) and controlling specific operations of/on DPS 100 , and the methods are thus described from the perspective of either/both CBI utility 110 and DPS 100 .
  • CBI utility 110 detects a read request.
  • CBI utility 110 searches for data in L1 105 .
  • CBI utility 110 detects a cache miss in L1 105 , as shown at block 1008 .
  • CBI utility 110 searches for data in L2 109 .
  • CBI utility 110 detects a cache miss in L2 109 , as shown at block 1012 .
  • CBI utility 110 selects a cache-line for eviction from LRU group 906 based on the values of presence bits 910 , as shown at block 1014 .
  • CBI utility 110 updates LRU bits 407 appropriately.
  • CBI utility 110 initiates eviction of an L1 cache-line, as shown at block 1018 .
  • CBI utility 110 initiates appropriate back-invalidations.
  • a new cache-line is brought in to L1(s) and L2 109 , as shown at block 1022 .
  • the presence bits, LRU bits and state bits are appropriately set in L1(s) and L2 109 , as shown at block 1024 .
  • the process ends at block 1026 .
  • the invention applies to any cache level which maintains presence bits to indicate the presence of a cache-line at a higher level of cache.
  • the invention may be similarly extended to any other replacement policy (even non-LRU based policies).
  • the invention uses the presence bits to avoid replacing a line that is still in a higher level cache.
  • the invention refers to selection of a replacement candidate from the LRU group (less recently used lines) as one embodiment, in another embodiment, a replacement candidate may be selected from a group of cache-lines within the LRU to MRU order, as long as the cache-line does not have any presence bits set.
  • the invention places a higher priority on a line in a lower level cache (L2) which is stored in any of the higher level caches (L1s) than a line which is not. Further, given that a line present in any higher level of cache is more important than a line that is not present in any of the higher level caches, the invention ensures that stale/dormant lines in the higher level caches do not continue to keep the corresponding presence bits set in the lower level caches. Therefore, the processor cores firstly unset the presence bits in the lower level cache when a line is replaced in the higher level cache. Secondly, when a processor core goes idle (e.g., completes running a particular program), the core invalidate all the corresponding higher level cache (for example, the L1 cache). Thus, the corresponding presence bits in the lower level cache (for example L2) are unset, preventing those stale lines in the higher level cache from unnecessarily retaining space in the lower level cache.
  • L2 lower level cache
  • the presence bits are described as being encoded as one full bit-vector, i.e., one bit per higher level cache that shares the lower level cache (which maintains the presence bits), the invention applies to any other type of presence indicators.
  • coarser vectors may be used in designs embodying the invention. With coarser vectors, a single presence bit indicates the presence of the line in, potentially, more than one higher level cache.
  • the unsetting of presence bits in the lower level cache on a replacement in the higher level cache is more complex with coarser vectors, the invention, which focuses on the application of presence bits/indicators to eviction decisions, may be applied to designs with coarser vectors as well.
  • one or more of the methods are embodied in a computer readable medium containing computer readable code such that a series of steps are performed when the computer readable code is executed (by a processing unit) on a computing device.
  • certain processes of the methods are combined, performed simultaneously or in a different order, or perhaps omitted, without deviating from the spirit and scope of the invention.
  • the method processes are described and illustrated in a particular sequence, use of a specific sequence of processes is not meant to imply any limitations on the invention. Changes may be made with regards to the sequence of processes without departing from the spirit or scope of the present invention. Use of a particular sequence is therefore, not to be taken in a limiting sense, and the scope of the present invention extends to the appended claims and equivalents thereof.
  • the present invention may be embodied as a method, system, and/or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module,” “logic”, or “system.” Furthermore, the present invention may take the form of a computer program product on a computer-usable storage medium having computer-usable program code embodied in or on the medium.
  • the processes in embodiments of the present invention may be implemented using any combination of software, firmware, microcode, or hardware.
  • the programming code (whether software or firmware) will typically be stored in one or more machine readable storage mediums such as fixed (hard) drives, diskettes, magnetic disks, optical disks, magnetic tape, semiconductor memories such as RAMs, ROMs, PROMs, etc., thereby making an article of manufacture in accordance with the invention.
  • the article of manufacture containing the programming code is used by either executing the code directly from the storage device, by copying the code from the storage device into another storage device such as a hard disk, RAM, etc., or by transmitting the code for remote execution using transmission type media such as digital and analog communication links.
  • the medium may be electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Further, the medium may be any apparatus that may contain, store, communicate, propagate, or transport the program for use by or in connection with the execution system, apparatus, or device.
  • the methods of the invention may be practiced by combining one or more machine-readable storage devices containing the code according to the described embodiment(s) with appropriate processing hardware to execute the code contained therein.
  • An apparatus for practicing the invention could be one or more processing devices and storage systems containing or having network access (via servers) to program(s) coded in accordance with the invention.
  • the term computer, computer system, or data processing system can be broadly defined to encompass any device having a processor (or processing unit) which executes instructions/code from a memory medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A method, a system and a computer program product for enhancing a cache back invalidation policy by utilizing least recently used (LRU) bits and presence bits in selecting cache-lines for eviction. A cache back invalidation (CBI) utility evicts cache-lines by using presence bits to avoid replacing a cache-line in a lower level cache that is also present in a higher level cache. Furthermore, the CBI utility selects the cache-line for eviction from an LRU group. The CBI utility ensures that dormant cache-lines in the higher level caches do not retain corresponding presence bits set in the lower level caches by unsetting the presence bits in the lower level cache when a line is replaced in the higher level cache. Additionally, when a processor core becomes idle, the CBI utility invalidates the corresponding higher level cache by unsetting the corresponding presence bits in the lower level cache.

Description

PRIORITY CLAIM
The present application is a continuation of and claims priority from U.S. patent application Ser. No. 12/358,873, filed on Jan. 23, 2009 titled “Optimizing a Cache Back Invalidation Policy,” which is incorporated by reference herein in its entirety and for all purposes.
BACKGROUND
1. Technical Field
The present invention generally relates to cache memory systems and in particular to replacement policies of cache coherence systems.
2. Description of the Related Art
Caches are typically organized in a set-associative fashion, with each cache divided into a certain number of “sets”. Each set holds one or more cache-lines arranged in one or more “ways”. The “ways” and “sets” of a cache are depicted in the illustration of FIG. 4. Each set has Least Recently Used (LRU) bits associated with the set that describes the level of recent use of each cache-line within a set. These LRU bits are used to make replacement decisions when removing a line from a set, in order to make space for a new line. In addition, each cache-line may have other associated state bits. For example, if the cache participates in the coherence protocol (for example, the Modified Exclusive Shared Invalid (MESI) protocol), the cache maintains the coherence protocol state bits per cache-line.
Caches are, often, hierarchically organized into multiple levels which include Level 1 caches (L1) and Level 2 caches (L2). Typically, in a multi-processor system (or a processor designed with the intent of being usable in a multi-processor system), “inclusion” is maintained between cache levels. Inclusion means that all the data stored in a higher level cache is also present in a lower level cache. For example, all data in the (higher level) L1 caches are present in the (lower level) L2 cache that the L1s share. The main motivation for inclusion is that inclusion makes cache coherency easier to manage. A request from the bus may initiate an inspection/search of a lower level cache only and be assured that the search has not missed any data that the higher level caches may contain.
In the example of an L2 shared by a few L1 caches, inclusiveness imposes that upon replacement of a cache line in the L2, the line is consequently invalidated in the L1s that share the L2. The invalidation of the cache-line in the L1 is called back-invalidation. Since the size of an L2 is typically much larger than the combined sizes of all the L1s that share the L2, most back-invalidation messages may end up searching the L1 for a line that the L1 does not have.
In general, the traditional cache replacement policies results in performance degrading back-invalidation caused by the policies' eviction of hot cache-lines from caches closer to cores (e.g. L1) due to back-invalidation from inclusive caches that are further from the cores (e.g. L2).
SUMMARY
Disclosed are a method, a system and a computer program product for enhancing a cache back invalidation policy by utilizing least recently used (LRU) bits and presence bits in selecting cache-lines for eviction. A cache back invalidation (CBI) utility evicts cache-lines by using presence bits to avoid replacing a cache-line in a lower level cache that is also present in a higher level cache. Furthermore, the CBI utility selects the cache-line for eviction from an LRU group. The CBI utility ensures that dormant cache-lines in the higher level caches do not retain corresponding presence bits set in the lower level caches by unsetting the presence bits in the lower level cache when a line is replaced in the higher level cache. Additionally, when a processor core becomes idle, the CBI utility invalidates the corresponding higher level cache by unsetting the corresponding presence bits in the lower level cache.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention itself, as well as advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
FIG. 1 provides a block diagram representation of a data processing system within which the invention is practices, according to one embodiment of the invention;
FIG. 2 illustrates an example of a traditional cache replacement policy in a processor memory sub-system, according to the prior art;
FIG. 3 illustrates the use of presence bits in a cache hierarchy, according to one embodiment of the invention;
FIG. 4 illustrates the use of LRU and Pseudo LRU bits within a processor cache, according to one embodiment of the invention;
FIG. 5 illustrates the modification to LRU bits in response to a hit on a more recently used cache-line of a way pair, according to one embodiment of the invention;
FIG. 6 illustrates the modification to LRU bits in response to a cache miss, according to one embodiment of the invention;
FIG. 7 illustrates the modification to LRU bits in response to a hit on a less recently used cache-line of a way pair, according to one embodiment of the invention;
FIG. 8 illustrates a cache replacement policy which uses presence bits in the lower level cache to reduce back-invalidate traffic in a processor and memory-subsystem, according to one embodiment of the invention;
FIG. 9 illustrates the use of Pseudo LRU bits and presence bits in selecting cache-lines for eviction, according to one embodiment of the invention; and
FIG. 10 is a flow chart illustrating the process of enhancing cache replacement and back-invalidation policies, according to one embodiment of the invention.
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
The illustrative embodiments provide a method, a system and a computer program product for enhancing a cache back invalidation policy by utilizing least recently used (LRU) bits and presence bits in selecting cache-lines for eviction. A cache back invalidation (CBI) utility evicts cache-lines by using presence bits to avoid replacing a cache-line in a lower level cache that is also present in a higher level cache. Furthermore, the CBI utility selects the cache-line for eviction from an LRU group. The CBI utility ensures that dormant cache-lines in the higher level caches do not retain corresponding presence bits set in the lower level caches by unsetting the presence bits in the lower level cache when a line is replaced in the higher level cache. Additionally, when a processor core becomes idle, the CBI utility invalidates the corresponding higher level cache by unsetting the corresponding presence bits in the lower level cache.
In the following detailed description of exemplary embodiments of the invention, specific exemplary embodiments in which the invention may be practiced are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that logical, architectural, programmatic, mechanical, electrical and other changes may be made without departing from the spirit or scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and equivalents thereof.
Within the descriptions of the figures, similar elements are provided similar names and reference numerals as those of the previous figure(s). Where a later figure utilizes the element in a different context or with different functionality, the element is provided a different leading numeral representative of the figure number. The specific numerals assigned to the elements are provided solely to aid in the description and not meant to imply any limitations (structural or functional or otherwise) on the described embodiment.
It is understood that the use of specific component, device and/or parameter names (such as those of the executing utility/logic described herein) are for example only and not meant to imply any limitations on the invention. The invention may thus be implemented with different nomenclature/terminology utilized to describe the components/devices/parameters herein, without limitation. Each term utilized herein is to be given its broadest interpretation given the context in which that terms is utilized.
With reference now to the figures, and beginning with FIG. 1, there is depicted a block diagram representation of an example data processing system (DPS), as utilized within one embodiment. DPS may be a computer, a portable device, such as a personal digital assistant (PDA), a smart phone, and/or other types of electronic devices that may generally be considered processing devices. As illustrated, DPS 100 comprises at least one processor or central processing units (CPUs) connected to system memory 106 via system interconnect/bus 102. In particular, CPU 101 is illustrated with a two (2) processing cores, Core-0 103 and Core-1 104. CPU 101 further comprises at least one higher level cache of which L1 105 is illustrated. Illustrated within CPU 101 is lower level cache L2 109. Cache controller 112 is also connected to system bus 102. Also connected to system bus 102 is input/output (I/O) controller 115, which provides connectivity and control for input devices, of which pointing device (or mouse) 116 and keyboard 117 are illustrated. I/O controller 120 also provides connectivity and control for output devices, of which display 118 is illustrated. Additionally, a multimedia drive 119 (e.g., compact disk read/write (CDRW) or digital video disk (DVD) drive) and USB (universal serial bus) port 121 are illustrated, coupled to I/O controller 115. Multimedia drive 119 and USB port 121 enable insertion of a removable storage device (e.g., optical disk or “thumb” drive) on which data/instructions/code may be stored and/or from which data/instructions/code may be retrieved. DPS 100 also comprises storage 107, within/from which data/instructions/code may also be stored/retrieved.
DPS 100 is also illustrated with a network interface device (NID) 125, by which DPS 100 may connect to one or more access/external networks 130, of which the Internet is provided as one example. In this implementation, the Internet represents/is a worldwide collection of networks and gateways that utilize the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. NID 125 may be configured to operate via wired/or wireless connection to an access point of the network. Network 130 may be an external network such as the Internet or wide area network (WAN), or an internal network such as an Ethernet (local area network—LAN) or a Virtual Private Network (VPN). Connection to the external network 130 may be established with one or more servers 133, which may also provide data/instructions/code for execution on DPS 100, in one embodiment.
In addition to the above described hardware components of DPS 100, various features of the invention are completed/supported via software (or firmware) code or logic stored within memory 106 or other storage (e.g., storage 107) and executed by CPU 101. Thus, for example, illustrated within memory 106 are a number of software/firmware/logic components, including operating system (OS) 108 (e.g., Microsoft Windows®, a trademark of Microsoft Corp, or GNU®/Linux®, registered trademarks of the Free Software Foundation and The Linux Mark Institute), applications 114, Coherence Protocol 111, Pseudo-LRU Policy 113, Enhanced Cache Replacement Policy (CRP) 124 and Cache Back-Invalidation (CBI) utility 110 (which executes on CPU 101 or cache controller 112). In actual implementation, CBI utility 110 may be combined with Coherence Protocol 111, Pseudo-LRU Policy 113 and Enhanced Cache Replacement Policy (CRP) 124 to provide a single executable component, collectively providing the various functions of each individual software component when the corresponding combined code is executed by the CPU 101. For simplicity, CBI utility 110 is illustrated and described as a stand alone or separate software/firmware component, which provides specific functions, as described below.
In one embodiment, servers 133 includes a software deploying server, and DPS 100 communicates with the software deploying server (133) via network (e.g., Internet 130) using network interface device 125. Then, CBI utility 110 may be deployed from/on the network, via software deploying server 133. With this configuration, software deploying server performs all of the functions associated with the execution of CBI utility 110. Accordingly, DPS 100 is not required to utilize internal computing resources of DPS 100 to execute CBI utility 110.
CPU 101 executes CBI utility 110 as well as OS 108, which supports the user interface features of CBI utility 110. In the described embodiment, CBI utility 110 generates/provides several graphical user interfaces (GUI) to enable user interaction with, or manipulation of, the functional features of CBI utility 110. Certain functions supported and/or implemented by CBI utility generate processing logic executed by processor and/or device hardware to complete the implementation of that function. For simplicity of the description, the collective body of code that enables these various features is referred to herein as CBI utility 110. Among the software code/instructions/logic provided by CBI utility 110, and which are specific to the invention, are: (a) code/logic for responding to data requests using an enhanced/optimized cache replacement policy which initiates a search for data in a higher level cache; (b) code/logic for selecting for eviction a cache-line from a lower level cache by using one or more of presence bits, presence indicators and LRU bits, according to the optimized cache replacement policy; (c) code/logic for updating LRU bits, according to the optimized cache replacement policy; and (d) code/logic for executing enhanced cache back invalidation. According to the illustrative embodiment, when CPU 101 executes CBI utility 110, DPS 100 initiates a series of functional processes that enable the above functional features as well as additional features/functionality. These features/functionality are described in greater detail below within the description of FIGS. 3-10.
Those of ordinary skill in the art will appreciate that the hardware components and basic configuration depicted in FIG. 1 may vary. The illustrative components within DPS 100 are not intended to be exhaustive, but rather are representative to highlight essential components that are utilized to implement the present invention. For example, other devices/components may be used in addition to or in place of the hardware depicted. The depicted example is not meant to imply architectural or other limitations with respect to the presently described embodiments and/or the general invention. The data processing system depicted in FIG. 1 may be, for example, an IBM eServer pSeries system, a product of International Business Machines Corporation in Armonk, N.Y., running the Advanced Interactive Executive (AIX) operating system or LINUX operating system.
FIG. 2 illustrates a cache replacement policy in a processor and memory-subsystem, according to the prior art. System 200 comprises processor 202 which further comprises two cores, Core-0 203 and Core-1 204. Illustrated within Processor 202 is first L1 205, the independent higher level cache of Core-0 203. Second L1 207 represents the independent higher level cache of Core-1 204. Also illustrated within system 200 is lower level (L2) cache 209. L2 cache 209 is shared by Core-0 203 and Core-1 204. Processor 202 connects to the memory subsystem (comprising memory controller 216 and memory 206 to which access is controlled by memory controller 216) via a memory bus.
In traditional cache eviction policies, selection of the cache-line for eviction is based upon identifying the LRU line. System 200 illustrates an example of processor 201 with two cores, Core-0 203 and Core-1 204. System 200 has L2 209 that is shared by a pair of L1s, first L1 205 and second L1 207, respectively, and may be used to highlight performance issues with traditional cache eviction policies.
When one of the cores, say core-0 203, runs a workload that is cache-intensive and core-0's data set fits in L1 205, the data remains “hot” (i.e., is actively fetched) in L1 205. However, the data starts “cooling” (i.e., is infrequently fetched) in L2 209 because the LRU bits in L2 209 are never updated (because the data is always hit in core-0's L1 205). As a result, the chances of eviction of core-0's hot data from L2 209 are increased. In addition, the chance of eviction of the same/replicated data from the L1 205 (due to back-invalidation) is also increased because core0's data in L2 209 is not recent. When core-1 204 runs a workload that has a large working set, continuous fetching of new lines into L2 209 and core-1's (second) L1 207 occurs. As a result, core-1's data becomes more recent in L2 than core-0's data, eventually causing eviction of core-0's data from L2 209 and core-0's L1 205 due to back-invalidation. So, even though the data is heavily reused by core-0 203, core-0's data gets evicted because core-1 204 “hogs” L2 209. Consequently, the workload on core-0 203 pays a significant performance penalty because core-0 203 experiences a lot of cache misses even though core-0 203 is actually very cache friendly.
In general, the traditional cache replacement policies results in performance degrading back-invalidation caused by the policies' eviction of hot cache-lines from caches closer to cores (e.g. L1 205) due to back-invalidation from inclusive caches that are further from the cores (e.g. L2 209).
FIG. 3 illustrates the use of presence bits in a cache hierarchy, according to one embodiment of the invention. FIG. 3 comprises DPS 301 and DPS 100. As shown in DPS 100 and DPS 301, caches may be hierarchically organized into multiple levels including Level 1 caches (L1) and Level 2 caches (L2). In a multi-processor system (or a processor designed with the intent of being usable in a multi-processor system), inclusion is maintained between cache levels. This means that all the data stored in a higher level cache is also present in a lower level cache. For example, all data in the L1 caches is present in the L2 cache that the L1s share.
The main motivation for inclusion is that is makes cache coherency easier to manage. A request from the bus only needs to check a lower level cache and be assured that the check has not missed any data that the higher level caches may contain. Let us take the example of an L2 shared by a few L1 caches. Inclusiveness imposes that upon replacement of a cache line in the L2, the line is invalidated in the L1s that share the L2. This is called back-invalidation. To reduce the back-invalidation traffic, L2s maintain “presence bits” to indicate which L1s do have the cache line. Back-invalidation is illustrated in DPS 301. In DPS 301 a single (higher level cache) L1 308 uses L2 314. In this case only 1 presence bit 322 (maintained within cache-line 318) suffices in the L2. In DPS 100, there are two (2) L1s (L1 307, L1 105) sharing L2 109. There are 2 presence bits (324, 326) per cache-line maintained in L2 316. In DPS 100, the highlighted cache-line (312) is present in L1 105 and not in L1 307. Therefore, the presence bits indicate that presence appropriately. When the highlighted cache-line in the L2 needs to be replaced, only a back-invalidate needs to be sent to L1 105, which has the line, as indicated by the presence bits. In particular, first presence bit-2 326 has a value of “0” indicating that no back-invalidate is sent to corresponding L1 307 whereas second presence bit-2 324 has a value of “1” indicating that a back-invalidate is sent to corresponding L1 105. Therefore, using the presence bits only the necessary back-invalidates need to be sent to the L1s. Of course, to keep the presence bits up to date in the L2, upon an L1 replacement the corresponding presence bits are unset in the L2. To keep the impact of this extra traffic to the L2 low and to parallelize the data access due to the L1 miss, and the update of the “presence bits” due to the replacement, presence bits may be maintained in a separate structure in the L2.
FIG. 4 illustrates the use of LRU and Pseudo LRU bits within a processor cache, according to one embodiment of the invention. L2 cache 109 comprises 4 “ways” illustrated by way-A 402, way-B 404, way-C 406 and way-D 408, respectively. L2 cache 109 also comprises LRU bits 407 further comprising LRU-2 410, LRU-1 412 and LRU-0 414.
Pseudo-LRU is an approximation of True LRU used for the eviction of cache-lines. True LRU maintains a true ranking of how recently a cache-line in a cache set is used. True LRU uses “n multiplied by log 2n” bits per set for an n-way cache because each of the n ways uses “log 2n” bits to encode the corresponding ranking (which ranges in value from 0 through n−1). Pseudo-LRU uses “n−1” bits per set for an n-way cache. As illustrated, L2 cache 109 is a single cache set in 4-way cache. In L2 cache 109, 3 (i.e., n=4−1=3) bits are used per set to maintain “recency”, i.e., a measure of recent use of a cache-line. A cache consists of a number of sets which is indicative of the cache size.
In a Pseudo-LRU policy/scheme with 4-ways, the scheme indicates 3 possibilities. Of the way pairs A-B and C-D, the policy indicates which way pair is more recently used. Of ways A 402 and B 404, the policy indicates which way is more recently used. Of ways C 406 and D 408, the policy indicates which way is more recently used. Thus, in this example of FIG. 4, Pseudo-LRU policy/scheme uses only 3 bits which are described as follows: LRU1 412 “remembers”/indicates which way-pair between pair A-B and pair C-D is LRU; LRU2 410 remembers which way between ways A 402 and B 404 is LRU; and LRU0 414 remembers which way between ways C 406 and D 408 is LRU.
LRUO and LRU2 form, what may be termed, level-1 bits of Pseudo-LRU and LRU1 forms level-2 bits. Similarly, an 8-way cache requires 7 pseudo-LRU bits, with 4 bits in level-1, 2 bits in level-2 and 1 in level-3. The Pseudo-LRU concept similarly extends to higher levels of associativity, for example, to the level of associativities of a 32-way cache and a 64-way cache.
The following three illustrations (FIG. 5, FIG. 6 and FIG. 7) show modifications to the LRU bits in response to cache hits and cache misses.
FIG. 5 illustrates the modification to LRU bits in response to a hit on a more recently used cache-line of a way pair, according to one embodiment of the invention. In L2 109 of FIG. 5, arrows are used to indicate the meaning/value of the LRU bits. The arrows point towards less recently used regions of the cache set. In particular, the arrows (i.e., following the path of the arrows illustrated with a distinctively bolder appearance and depicted by LRU1 504 and LRU2 508, respectively) lead to way B 404, indicating that way B 404 is the least recently used way according to pseudo-LRU bits 407. When a cache hit is detected in way A 402, there is no change to LRU bits 407 because way B 404 continues to be the least recently used way.
In one embodiment, CBI utility 110 modifies/updates (one or more of) LRU bits 407 following the replacement of an evicted cache-line (after a cache miss) similar to the method in which updates are performed following a cache hit in the same cache-line. In either case (i.e., replacement of an evicted cache-line or a cache hit on the same cache-line, which cache-line represents, for example, the more recently used cache-line of an LRU pair), the cache-line represents the most recently used cache-line (following cache-line replacement or cache hit). Thus, the LRU cache-line within the LRU group continues to be the least recently used way and there is no change to LRU bits 407.
FIG. 6 illustrates the modification to LRU bits in response to a cache miss, according to one embodiment of the invention. L2 109 of FIG. 6 illustrates a modified set of pseudo-LRU bits following a cache miss in L2 109 of FIG. 5. Since way B 404 contains the LRU cache-line during the cache miss, the cache-line in way B 404 is evicted and replaced. As a result, way B 404 currently contains a line that is recently brought in, and therefore becomes the most recently used way. Hence LRU bits 407 are rearranged. All bits that lead to way B 404 are flipped, as illustrated by arrows 602 and 604, respectively. The intent is here is that since way B 404 is now the most recently used way, LRU2 410 points to way A 402. The value of LRU1 412 is flipped/changed to point to the less recently used way pair.
FIG. 7 illustrates the modification to LRU bits in response to a hit on a less recently used cache-line of a way pair, according to one embodiment of the invention. L2 109 of FIG. 7 illustrates a modified set of pseudo-LRU bits following a cache hit on way C 406 in L2 109 of FIG. 5. Following the cache hit in way C 406, way B 404 continues to be the least recently used way. Thus, in L2 109 of FIG. 7, LRU1 412 and LRU2 410, which lead to way B 404, are left unchanged. However, between ways C 406 and D 408, way C 406 is currently more recently used and hence the value of LRU0 414 is flipped/changed. Now LRU0 414 points to way D 408 which is indeed the less recently used way amongst way C 406 and D 408.
FIG. 8 illustrates a cache replacement policy which uses presence bits in the lower level cache to reduce back-invalidate traffic in a processor and memory-subsystem, according to one embodiment of the invention. DPS 100 comprises processor 101 which further comprises two cores, Core-0 103 and Core-1 104. Illustrated with Processor 101 is first L1 105, the independent higher level cache of Core-0 103. Second L1 309 represents the independent higher level cache of Core-1 104. Also illustrated within DPS 100 is lower level (L2) cache 109. L2 cache 109 is shared by Core-0 103 and Core-1 104. DPS 100 also includes cache controller 112. Processor 101 is connected to the memory subsystem (comprising memory controller 804 and memory 106 to which access is controlled by memory controller 804). Memory 106 includes CBI utility 110.
The optimized cache replacement/eviction policy may be described with reference to DPS 100. The optimized replacement policy also applies to systems (having L3 caches, for example) which may include cache levels which are lower than the L2 cache level. The optimized replacement policy is described in relation to the LRU policy.
When a core (e.g., Core-0 103) initiates a read request for data, CBI utility 110 executes the data search as outlined in the following enumerated steps.
(1) CBI utility 110 searches in L1 105. If data is available or “hits” in L1 105, then
    • (A) The data is read into the core's internal registers.
    • (B) The LRU bits and state bits for the cache-line are updated in L1 105.
    • (C) CBI utility 110 concludes the data search in L1 105.
(2) If a cache miss occurs in L1 105, CBI utility 110 searches for data in L2 109. If data is available in L2 109, then
    • (A) Data is copied into the corresponding L1 105 by evicting the LRU cache-line.
    • (B) Data is copied into the core's internal registers for further use.
    • (C) LRU bits and state bits for the cache-line in are updated in L1 105 and L2 109.
    • (D) The presence bits are updated in L2 109 for the cache-line that moved into L1 105.
    • (E) The presence bits are also updated for the cache-line evicted from L1 105 to indicate that the cache-line is no longer present in L1 105.
    • (F) CBI utility 110 concludes the data search in L2 109.
Further discussion of the optimized cache replacement policy is enhanced with the illustration of FIG. 9.
FIG. 9 illustrates the use of Pseudo LRU bits and presence bits in selecting cache-lines for eviction, according to one embodiment of the invention. L2 Cache 109 previously defined in FIGS. 4-7 is further illustrated with presence bits 910. P0 912 is one element of presence bits 910. Also illustrated in L2 cache 109 is way pair A-B/LRU group 906 and way pair C-D/MRU group 908.
CBI utility 110 utilizes presence bits 910 to indicate which higher level caches a cache line is present in. Unlike typical caches where only the LRU bits are used for selection of cache-lines for eviction, CBI utility 110 employs both LRU bits 407 and presence bits 910 in the process of selecting a cache-line for eviction. Conversely, in traditional LRU policy, the cache-line pointed to by LRU bits as being least recently used is evicted even if the corresponding presence bit indicates that the cache-line is present in a higher level (L1) cache.
In L2 cache 109, every cache-line is associated with Pseudo-LRU bits 407 and p presence bits (e.g., presence bits 910), where p is the number of processors. In the example of FIG. 9, “p” has a value of 4 in L2 cache 109. A presence bit of 1 indicates that the cache-line is present in the corresponding L1.
By incorporating presence bits 910 into eviction decisions, CBI utility 110 provides an optimized cache eviction policy. Pseudo-LRU partitions the “n” ways in a set into 2 groups: a more recently used (MRU) group (e.g., MRU 908); and a less recently used (LRU) group (e.g., LRU 906). The LRU group comprises all the ways in the set that are pointed to by the LRU bits (e.g., A 402, B 404). All other ways comprise the MRU group (e.g., C 406, D 408). For the purpose of cache eviction, CBI utility 110 selects from cache-lines that are in LRU group 906.
Referring again to the optimized cache replacement policy of FIG. 8, CBI utility 110 responds to a second cache miss in L2 109 which second cache miss is preceded by a first cache miss in L1 105 by utilizing presence bits (e.g., presence bits 910) in the eviction decision and incorporating the pseudo-LRU policy. CBI utility 110 continues the execution of the optimized cache eviction policy as described in the following (continuation of) enumerated steps:
(3) In response to a cache miss in L1 105 followed by a cache miss in L2 109:
    • (A) The selection of the line to be evicted in the L2 set is made based on the following:
      • (I) CBI utility 110 determines whether the LRU cache-line has the associated presence bit set (e.g., P0 912).
      • (II) If the LRU cache-line (e.g., B 404) does not have the associated presence bit set:
        • (IIa) CBI utility 110 selects the LRU cache-line for eviction; and
        • (IIb) CBI utility 110 updates/modifies the LRU bits in response to a cache miss as described for L2 109 in FIG. 6.
      • (III) If CBI utility 110 previously determined that the LRU cache-line has the associated bit set, CBI utility 110 determines whether all (remaining) cache-lines in the LRU group have the corresponding presence bits set.
      • (IV) If all cache-lines in LRU group 906 have the corresponding presence bits set:
        • (IVa) CBI utility 110 selects the LRU cache-line for eviction; and
        • (IVb) CBI utility 110 updates/modifies the LRU bits in response to a cache miss as described for L2 109 in FIG. 6.
      • (V) If CBI utility 110 previously determined that the LRU cache-line has the associated bit set and all (remaining) cache-lines in the LRU group do not have the corresponding presence bits set:
        • (Va) CBI utility 110 randomly selects a cache-line from LRU group 906 that does not have a corresponding presence bit set; and
        • (Vb) CBI utility 110 updates/modifies the LRU bits in response to a cache hit on a less recently used cache-line of a way pair as described for L2 109 in FIG. 7.
    • (B) CBI utility 110 inspects the presence bit for the evicted line in L2 109.
      • (I) If the evicted cache-line is present in any L1, the evicted cache-line is invalidated in the L1(s) to maintain the inclusive policy.
      • (II) If the evicted cache-line is not present in any L1, then no back-invalidates are sent to the L1(s). A cache-line is evicted from the requesting core's L1 using LRU policy of L1, to make space for the incoming line.
    • (C) Once a cache-line is evicted from both L1 105 and L2 109, the new line is brought into L1 105, L2 109 and internal core registers.
    • (D) The presence bits, LRU bits and state bits are appropriately set in L1 and L2 109.
FIG. 10 is a flow chart illustrating the process of enhancing cache replacement and back-invalidation policies, according to one embodiment of the invention. FIG. 10 is a flow chart illustrating the method by which the above process of the illustrative embodiment is completed. Although the method illustrated in FIG. 10 may be described with reference to components shown in FIGS. 1, 3-9, it should be understood that this is merely for convenience and alternative components and/or configurations thereof can be employed when implementing the various methods. Key portions of the methods may be completed by CBI utility 110 executing on processor 101 within DPS 100 (FIG. 1) and controlling specific operations of/on DPS 100, and the methods are thus described from the perspective of either/both CBI utility 110 and DPS 100.
The process of FIG. 10 begins at initiator block 1002 and proceeds to block 1004, at which CBI utility 110 detects a read request. At block 1006, CBI utility 110 searches for data in L1 105. CBI utility 110 detects a cache miss in L1 105, as shown at block 1008. At block 1010, CBI utility 110 searches for data in L2 109. CBI utility 110 detects a cache miss in L2 109, as shown at block 1012. CBI utility 110 selects a cache-line for eviction from LRU group 906 based on the values of presence bits 910, as shown at block 1014. At block 1016, CBI utility 110 updates LRU bits 407 appropriately. CBI utility 110 initiates eviction of an L1 cache-line, as shown at block 1018. At block 1020, CBI utility 110 initiates appropriate back-invalidations. A new cache-line is brought in to L1(s) and L2 109, as shown at block 1022. The presence bits, LRU bits and state bits are appropriately set in L1(s) and L2 109, as shown at block 1024. The process ends at block 1026.
Although the invention is described with respect to a level 2 cache, the invention applies to any cache level which maintains presence bits to indicate the presence of a cache-line at a higher level of cache. Though described with respect to using pseudo-LRU replacement policy as the base replacement policy, the invention may be similarly extended to any other replacement policy (even non-LRU based policies). The invention uses the presence bits to avoid replacing a line that is still in a higher level cache. Though the invention refers to selection of a replacement candidate from the LRU group (less recently used lines) as one embodiment, in another embodiment, a replacement candidate may be selected from a group of cache-lines within the LRU to MRU order, as long as the cache-line does not have any presence bits set.
The invention places a higher priority on a line in a lower level cache (L2) which is stored in any of the higher level caches (L1s) than a line which is not. Further, given that a line present in any higher level of cache is more important than a line that is not present in any of the higher level caches, the invention ensures that stale/dormant lines in the higher level caches do not continue to keep the corresponding presence bits set in the lower level caches. Therefore, the processor cores firstly unset the presence bits in the lower level cache when a line is replaced in the higher level cache. Secondly, when a processor core goes idle (e.g., completes running a particular program), the core invalidate all the corresponding higher level cache (for example, the L1 cache). Thus, the corresponding presence bits in the lower level cache (for example L2) are unset, preventing those stale lines in the higher level cache from unnecessarily retaining space in the lower level cache.
Though the presence bits are described as being encoded as one full bit-vector, i.e., one bit per higher level cache that shares the lower level cache (which maintains the presence bits), the invention applies to any other type of presence indicators. For example, to reduce the overhead of presence bits, coarser vectors may be used in designs embodying the invention. With coarser vectors, a single presence bit indicates the presence of the line in, potentially, more than one higher level cache. Although the unsetting of presence bits in the lower level cache on a replacement in the higher level cache is more complex with coarser vectors, the invention, which focuses on the application of presence bits/indicators to eviction decisions, may be applied to designs with coarser vectors as well.
In the flow charts above, one or more of the methods are embodied in a computer readable medium containing computer readable code such that a series of steps are performed when the computer readable code is executed (by a processing unit) on a computing device. In some implementations, certain processes of the methods are combined, performed simultaneously or in a different order, or perhaps omitted, without deviating from the spirit and scope of the invention. Thus, while the method processes are described and illustrated in a particular sequence, use of a specific sequence of processes is not meant to imply any limitations on the invention. Changes may be made with regards to the sequence of processes without departing from the spirit or scope of the present invention. Use of a particular sequence is therefore, not to be taken in a limiting sense, and the scope of the present invention extends to the appended claims and equivalents thereof.
As will be appreciated by one skilled in the art, the present invention may be embodied as a method, system, and/or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module,” “logic”, or “system.” Furthermore, the present invention may take the form of a computer program product on a computer-usable storage medium having computer-usable program code embodied in or on the medium.
As will be further appreciated, the processes in embodiments of the present invention may be implemented using any combination of software, firmware, microcode, or hardware. As a preparatory step to practicing the invention in software, the programming code (whether software or firmware) will typically be stored in one or more machine readable storage mediums such as fixed (hard) drives, diskettes, magnetic disks, optical disks, magnetic tape, semiconductor memories such as RAMs, ROMs, PROMs, etc., thereby making an article of manufacture in accordance with the invention. The article of manufacture containing the programming code is used by either executing the code directly from the storage device, by copying the code from the storage device into another storage device such as a hard disk, RAM, etc., or by transmitting the code for remote execution using transmission type media such as digital and analog communication links. The medium may be electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Further, the medium may be any apparatus that may contain, store, communicate, propagate, or transport the program for use by or in connection with the execution system, apparatus, or device. The methods of the invention may be practiced by combining one or more machine-readable storage devices containing the code according to the described embodiment(s) with appropriate processing hardware to execute the code contained therein. An apparatus for practicing the invention could be one or more processing devices and storage systems containing or having network access (via servers) to program(s) coded in accordance with the invention. In general, the term computer, computer system, or data processing system can be broadly defined to encompass any device having a processor (or processing unit) which executes instructions/code from a memory medium.
Thus, it is important that while an illustrative embodiment of the present invention is described in the context of a fully functional computer (server) system with installed (or executed) software, those skilled in the art will appreciate that the software aspects of an illustrative embodiment of the present invention are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the present invention applies equally regardless of the particular type of media used to actually carry out the distribution. By way of example, a non exclusive list of types of media, includes recordable type (tangible) media such as floppy disks, thumb drives, hard disk drives, CD ROMs, DVDs, and transmission type media such as digital and analogue communication links.
While the invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular system, device or component thereof to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiments disclosed for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims. Moreover, the use of the terms first, second, etc. do not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another.

Claims (18)

What is claimed is:
1. In a data processing system having one or more processors and multiple levels of cache, including a lower level cache and a higher level cache, a method comprising:
detecting a data request at the lower level cache;
in response to a cache miss in the lower level cache, selecting a cache-line for eviction based upon (a) presence bits and (b) less recently used (LRU) bits, wherein said selecting further comprises:
partitioning multiple “cache-ways” of a cache set into a less recently used (LRU) group and a more recently used (MRU) group using one or more of (a) pseudo-LRU bits; and (b) non-LRU based replacement policy parameters;
in response to a cache miss in the lower level cache, initiating a process to determine which cache line is consequently selected for eviction in the lower level cache, based on values of one or more of: (a) a presence bit; (b) a LRU bit or a pseudo LRU bit; and (c) parameters from non-LRU based replacement policies, wherein the initiating comprises: (a) checking the value of the presence bits; (b) in response to the value of the presence bits being set to a first value, receiving an indication that a copy of the cache-line is not present in a corresponding higher level cache; and (c) in response to the value of the presence bits being set to a second value, receiving an indication that a copy of the cache-line is present in the corresponding higher level cache;
identifying a least recently used cache-line in said lower level cache using said pseudo-LRU bits;
in response to the least recently used cache-line not having a corresponding presence bit set to the second value or a group of cache-lines constituting the LRU group not having the corresponding presence bits set to the second value, executing one or more of: (a) a selection of the least recently used cache-line for eviction; (b) replacement of the least recently used cache-line with a new cache-line; and (c) a change to the LRU bits; and
in response to the least recently used cache-line having a corresponding presence bit set to the second value and one or more elements of the LRU group not having the corresponding presence bit set to the second value, executing one or more of: (a) a random selection of a less recently used cache-line from the LRU group for eviction, wherein said less recently cache-line that is randomly selected is not the least recently used cache-line; (b) replacement of said less recently used cache-line with a new cache-line; and (c) a modification of the LRU bits;
determining whether a copy of the cache-line selected for eviction is present in the higher level cache; and
in response to the copy of the cache-line selected for eviction being present in the higher level cache, invalidating the copy of the cache-line selected for eviction; and updating pseudo-LRU bits.
2. The method of claim 1, wherein said updating further comprises:
in response to (a) a cache hit within a LRU group, wherein said cache hit does not occur in a least recently used cache-way or (b) the eviction of a cache-line from the LRU group followed by a replacement of the evicted cache-line with a new cache-line, wherein the least recently used cache-line is not evicted, modifying a first level-1 LRU bit value to a current level-1 LRU bit value to (1) indicate that a first cache-way corresponding to the cache hit or the replacement of the evicted cache-line with the new cache-line is currently a member of a more recently used (MRU) group and is no longer a member of the LRU group and (2) identify a second cache-way which replaces the first cache-way as a member of the LRU group.
3. The method of claim 1, further comprising:
in response to a cache miss, determining whether to evict the cache-line in the least recently used way by using one or more of presence bits, LRU bits and a cache level location of the cache miss; in response to said cache-line in said least recently used cache-way being selected for eviction: replacing the evicted cache-line with a new cache-line; switching one or more corresponding higher level LRU bit values to indicate the modified and current designation of one or more cache-way pairs, as a result of the eviction of the least recently used cache-line, wherein said current designation is defined for one or more pairs as (a) more recently used (MRU) pairs or (b) less recently used (LRU) pairs; changing lower level LRU bit values to: (a) indicate that a cache-way previously containing the least recently used cache-line and currently containing the new cache-line is currently a more recently used cache way within a MRU group; (b) identify a cache-line as a currently designated LRU cache-line as a result of a replacement of the previously designated least recently used cache-line of a LRU group; and
in response to a cache hit in a MRU group, maintaining a same set of values of the LRU bits.
4. The method of claim 1, wherein said invalidating further comprises:
in response to the selection of the cache-line for eviction from the lower level cache: determining whether one or more copies of the cache-line selected for eviction is present in one or more higher level caches based on an inspection of a presence bit in the lower level cache associated with the selected cache-line; in response to the one or more copies of the selected cache-line being present in corresponding higher level caches, invalidating the one or more copies in corresponding lower level caches to maintain an inclusive policy;
in response to no copies of the selected cache-line being present in corresponding higher level caches, evicting a cache-line from a requesting core's higher level cache using LRU policy for higher level caches in order to make space for a new incoming cache-line;
in response to the eviction of cache-lines from one or more higher level caches and the lower level cache: placing new lines into (a) the one or more higher level caches, (b) the lower level cache, and (c) internal core registers; and setting (a) the respective presence bits in the one or more higher level caches to a second value and (b) the respective LRU bits and state bits in the corresponding one or more higher level caches and the lower level cache, appropriately; and
in response to a replacement of a cache-line in a higher level cache, re-setting the corresponding presence bit in the lower level cache to a first value.
5. The method of claim 1, further comprising: in response to a processor core initiating a process to attain an idle status, invalidating the higher level cache associated with said processor core in order to re-set presence bits in the lower level cache to the corresponding first values, wherein said process is selected from among processes which include a process to conclude an execution of a particular program.
6. A data processing system comprising:
one or more processors;
a cache memory system hierarchically organized into multiple levels including a highest level having one or more level 1 (L1) caches and an adjacent lower level cache having one or more level 2 (L2) caches, wherein the level 1 cache is dedicated to one processor core or shared by multiple processor cores;
a cache controller;
logic which executes on a processor and causes the processor to:
detect a data request at the lower level cache;
in response to a cache miss in the lower level cache: select a cache-line for eviction based upon (a) presence bits and (b) less recently used (LRU) bits; determine whether a copy of the cache-line selected for eviction is present in a higher level cache based on an inspection of a presence bit in the lower level cache associated with the selected cache-line; in response to the copy of the cache-line selected for eviction being present in the higher level cache, invalidate the copy of the cache-line selected for eviction; and update pseudo-LRU bits;
wherein said logic that causes the processor to invalidate the copy of the cache-line further comprises logic that causes the system to: in response to determining that one or more copies of the selected cache-line is present in corresponding higher level caches, invalidate the one or more copies in corresponding lower level caches to maintain an inclusive policy.
7. The data processing system of claim 6, wherein said logic that causes the processor to invalidate the copy of the cache-line further comprises logic that causes the system to:
in response to no copies of the selected cache-line being present in corresponding higher level caches, evict a cache-line from a requesting core's higher level cache using LRU policy for higher level caches in order to make space for a new incoming cache-line.
8. The data processing system of claim 6, wherein said logic that causes the processor to invalidate the copy of the cache-line further comprises logic that causes the system to:
in response to the eviction of cache-lines from one or more higher level caches and the lower level cache: placing new lines into (a) the one or more higher level caches, (b) the lower level cache, and (c) internal core registers; and set (a) the respective presence bits in the one or more higher level caches to the second value and (b) the respective LRU bits and state bits in the corresponding one or more higher level caches and the lower level cache, appropriately; and
in response to a replacement of a cache-line in a higher level cache, re-set the corresponding presence bit in the lower level cache to the first value.
9. The data processing system of claim 6, wherein said logic that causes the processor to select the cache line for eviction further comprises logic that causes the processor to:
partition multiple “cache-ways” of a cache set into a less recently used (LRU) group and a more recently used (MRU) group using one or more of: (a) pseudo-LRU bits; and (b) non-LRU based replacement policy parameters; and
in response to a cache miss in the lower level cache, initiate a process to determine which cache line is consequently selected for eviction in the lower level cache, based on values of one or more of: (a) a presence bit; (b) a LRU bit or a pseudo LRU bit; and (c) parameters from non-LRU based replacement policies; wherein the processor: (a) checks the value of the presence bits; (b) in response to the value of the presence bits being set to a first value, receives an indication that a copy of the cache-line is not present in a corresponding higher level cache; and (c) in response to the value of the presence bits being set to a second value, receives an indication that a copy of the cache-line is present in the corresponding higher level cache.
10. The data processing system of claim 6, wherein said logic that causes the processor to select the cache line for eviction further comprises logic that causes the processor to:
identify a least recently used cache-line in said lower level cache using said pseudo-LRU bits;
in response to the least recently used cache-line not having a corresponding presence bit set to the second value or a group of cache-lines constituting the LRU group does have the corresponding presence bits set to the second value, execute one or more of: (a) a selection of the least recently used cache-line for eviction; (b) replacement of the least recently used cache-line with a new cache-line; and (c) a change to the LRU bits.
11. The data processing system of claim 10, wherein said logic that causes the processor to select the cache line for eviction further comprises logic that causes the processor to:
in response to the least recently used cache-line having a corresponding presence bit set to the second value and one or more elements of the LRU group not having the corresponding presence bit set to the second value, execute one or more of: (a) a random selection of a less recently used cache-line from the LRU group for eviction, wherein said less recently cache-line that is randomly selected is not the least recently used cache-line; (b) replacement of said less recently used cache-line with a new cache-line; and (c) a modification of the LRU bits.
12. The data processing system of claim 6, wherein said logic that causes the processor to update pseudo-LRU bits further comprises logic that causes the processor to:
in response to (a) a cache hit within a LRU group, wherein said cache hit does not occur in a least recently used cache-way or (b) the eviction of a cache-line from the LRU group followed by a replacement of the evicted cache-line with a new cache-line, wherein the least recently used cache-line is not evicted, modify a first level-1 LRU bit value to a current level-1 LRU bit value to (i) indicate that a first cache-way corresponding to the cache hit or the replacement of the evicted cache-line with the new cache-line is currently a member of a more recently used (MRU) group and (ii) identify a second cache-way which replaces the first cache-way as a member of the LRU group.
13. The data processing system of claim 6, wherein said logic that causes the processor to update pseudo-LRU bits further comprises logic that causes the processor to:
in response to the cache-line in the least recently used cache-way being selected for eviction:
replace the evicted cache-line with a new cache-line; switch one or more corresponding higher level LRU bit values to indicate the modified and current designation of one or more cache-way pairs, as a result of the evicted cache-line, wherein said current designation is defined for one or more pairs as (a) more recently used (MRU) pairs or (b) less recently used (LRU) pairs;
change lower level LRU bit values to: (a) indicate that a cache-way previously containing the least recently used cache-line and currently containing the new cache-line is currently a more recently used cache way within a MRU group; and (b) identify a cache-line as a currently designated LRU cache-line as a result of a replacement of the previously designated least recently used cache-line of a LRU group; and
in response to a cache hit in a MRU group, maintaining a same set of values of the LRU bits.
14. The data processing system of claim 6, further comprising logic that causes the processor to:
in response to a processor core initiating a process to attain an idle status, invalidate the higher level cache associated with said processor core in order to re-set presence bits in the lower level cache to the corresponding first values, wherein said process is selected from among processes which include a process to conclude an execution of a particular program.
15. A processor chip comprising:
one or more processors each having a higher level cache;
a lower level cache associated with one or more higher level caches; and
a cache controller which includes logic that:
detects a data request at the lower level cache;
in response to a cache miss in the lower level cache, selects a cache-line for eviction based upon (a) presence bits and (b) less recently used (LRU) bits;
determines whether a copy of the cache-line selected for eviction is present in the higher level cache of at least one of the one or more processors based on an inspection of a presence bit in the lower level cache associated with the selected cache-line;
in response to the copy of the cache-line selected for eviction being present in the higher level cache, invalidates the copy of the cache-line selected for eviction; and
updates pseudo-LRU bits; and
in response to a processor core initiating a process to attain an idle status, invalidates the higher level cache associated with said processor core in order to re-set presence bits in the lower level cache to the corresponding first values, wherein said process is selected from among processes which include a process to conclude an execution of a particular program.
16. The processor chip of claim 15, wherein said logic that invalidates the copy of the cache line selected for eviction further comprises logic that:
in response to the selection of the cache-line for eviction from the lower level cache: determines whether one or more copies of the cache-line selected for eviction is present in one or more higher level caches based on an inspection of the presence bit in the lower level cache associated with the selected cache-line; in response to the one or more copies of the selected cache-line being present in corresponding higher level caches, invalidates the one or more copies in corresponding lower level caches to maintain an inclusive policy; in response to no copies of the selected cache-line being present in corresponding higher level caches, evicts a cache-line from a requesting core's higher level cache using LRU policy for higher level caches in order to make space for a new incoming cache-line; in response to the eviction of cache-lines from one or more higher level caches and the lower level cache: places new lines into (a) the one or more higher level caches, (b) the lower level cache and (c) internal core registers; and sets (a) the respective presence bits in the one or more higher level caches to the second value and (b) the respective LRU bits and state bits in the corresponding one or more higher level caches and the lower level cache, appropriately; and in response to a replacement of a cache-line in a higher level cache, re-sets the corresponding presence bit in the lower level cache to the first value.
17. The processor chip of claim 15, wherein said logic that causes the controller to select further comprises logic that causes the controller to:
partition multiple “cache-ways” of a cache set into a less recently used (LRU) group and a more recently used (MRU) group using one or more of: (a) pseudo-LRU bits; and (b) non-LRU based replacement policy parameters;
in response to a cache miss in the lower level cache, initiate a process to determine which cache line is consequently selected for eviction in the lower level cache, based on values of one or more of (a) a presence bit; (b) a LRU bit or a pseudo LRU bit; and (c) parameters from non-LRU based replacement policies, wherein the logic that initiates comprises logic that:
(a) checks the value of the presence bits; (b) in response to the value of the presence bits being set to a first value, receives an indication that a copy of the cache-line is not present in a corresponding higher level cache; (c) in response to the value of the presence bits being set to a second value, receives an indication that a copy of the cache-line is present in the corresponding higher level cache; identifies a least recently used cache-line in said lower level cache using said pseudo-LRU bits; in response to the least recently used cache-line not having a corresponding presence bit set to the second value or a group of cache-lines constituting the LRU group having the corresponding presence bits set to the second value, executes one or more of: (a) a selection of the least recently used cache-line for eviction; (b) replacement of the least recently used cache-line with a new cache-line; and (c) a change to the LRU bits; and in response to the least recently used cache-line having a corresponding presence bit set to the second value and one or more elements of the LRU group not having the corresponding presence bit set to the second value, executes one or more of: (a) a random selection of a less recently used cache-line from the LRU group for eviction, wherein said less recently cache-line that is randomly selected is not the least recently used cache-line; (b) replacement of said less recently used cache-line with a new cache-line; and (c) a modification of the LRU bits.
18. The processor chip of claim 15, wherein said logic that updates pseudo-LRU bits further comprises logic that:
in response to (a) a cache hit within a LRU group, wherein said cache hit does not occur in a least recently used cache-way or (b) the eviction of a cache-line from the LRU group followed by a replacement of the evicted cache-line with a new cache-line, wherein the least recently used cache-line is not evicted, modifies a first level-1 LRU bit value to a current level-1 LRU bit value to (i) indicate that a first cache-way corresponding to the cache hit or the replacement of the evicted cache-line with the new cache-line is currently a member of a more recently used (MRU) group and (ii) identify a second cache-way which replaces the first cache-way as a member of the LRU group; in response to the cache-line in the least recently used cache-way is selected for eviction: replacing the evicted cache-line with a new cache-line; switches one or more corresponding higher level LRU bit values to indicate the modified and current designation of one or more cache-way pairs, as a result of the evicted cache-line, wherein said current designation is defined for one or more pairs as (a) more recently used (MRU) pairs or (b) less recently used (LRU) pairs; changing lower level LRU bit values to: (a) indicate that a cache-way previously containing the least recently used cache-line and currently containing the new cache-line is currently a more recently used cache way within a MRU group; (b) identify a cache-line as a currently designated LRU cache-line as a result of a replacement of the previously designated least recently used cache-line of a LRU group; and in response to a cache hit in a MRU group, maintains a same set of values of the LRU bits.
US13/723,345 2009-01-23 2012-12-21 Optimizing a cache back invalidation policy Expired - Fee Related US9043556B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/723,345 US9043556B2 (en) 2009-01-23 2012-12-21 Optimizing a cache back invalidation policy

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US12/358,873 US8364898B2 (en) 2009-01-23 2009-01-23 Optimizing a cache back invalidation policy
US13/723,345 US9043556B2 (en) 2009-01-23 2012-12-21 Optimizing a cache back invalidation policy

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US12/358,873 Continuation US8364898B2 (en) 2009-01-23 2009-01-23 Optimizing a cache back invalidation policy

Publications (2)

Publication Number Publication Date
US20130111139A1 US20130111139A1 (en) 2013-05-02
US9043556B2 true US9043556B2 (en) 2015-05-26

Family

ID=42355079

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/358,873 Expired - Fee Related US8364898B2 (en) 2009-01-23 2009-01-23 Optimizing a cache back invalidation policy
US13/723,345 Expired - Fee Related US9043556B2 (en) 2009-01-23 2012-12-21 Optimizing a cache back invalidation policy

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US12/358,873 Expired - Fee Related US8364898B2 (en) 2009-01-23 2009-01-23 Optimizing a cache back invalidation policy

Country Status (1)

Country Link
US (2) US8364898B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180276125A1 (en) * 2017-03-27 2018-09-27 Nec Corporation Processor
US20230004331A1 (en) * 2014-02-24 2023-01-05 Kioxia Corporation Nand raid controller
US20230143181A1 (en) * 2019-08-27 2023-05-11 Micron Technology, Inc. Write buffer control in managed memory system

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9952977B2 (en) * 2009-09-25 2018-04-24 Nvidia Corporation Cache operations and policies for a multi-threaded client
US10152423B2 (en) * 2011-10-31 2018-12-11 International Business Machines Corporation Selective population of secondary cache employing heat metrics
US9348591B2 (en) * 2011-12-29 2016-05-24 Intel Corporation Multi-level tracking of in-use state of cache lines
US8806139B2 (en) 2012-01-20 2014-08-12 International Business Machines Corporation Cache set replacement order based on temporal set recording
US20140136784A1 (en) * 2012-11-09 2014-05-15 International Business Machines Corporation Enhanced cache coordination in a multi-level cache
US8949535B1 (en) * 2013-02-04 2015-02-03 Amazon Technologies, Inc. Cache updating
US9594700B2 (en) * 2013-04-17 2017-03-14 Nvidia Corporation Speculative memory controller
US9176879B2 (en) * 2013-07-19 2015-11-03 Apple Inc. Least recently used mechanism for cache line eviction from a cache memory
WO2016097805A1 (en) 2014-12-14 2016-06-23 Via Alliance Semicoductor Co., Ltd. Cache memory budgeted by ways on memory access type
EP3055775B1 (en) * 2014-12-14 2019-08-21 VIA Alliance Semiconductor Co., Ltd. Cache replacement policy that considers memory access type
WO2016097812A1 (en) * 2014-12-14 2016-06-23 Via Alliance Semiconductor Co., Ltd. Cache memory budgeted by chunks based on memory access type
EP3230874B1 (en) * 2014-12-14 2021-04-28 VIA Alliance Semiconductor Co., Ltd. Fully associative cache memory budgeted by memory access type
WO2016097813A1 (en) * 2014-12-14 2016-06-23 Via Alliance Semiconductor Co., Ltd. Set associative cache memory with heterogeneous replacement policy
US10255183B2 (en) 2015-07-23 2019-04-09 Arteris, Inc. Victim buffer for cache coherent systems
US12026095B2 (en) 2014-12-30 2024-07-02 Arteris, Inc. Cache coherent system implementing victim buffers
CN105228197A (en) * 2015-09-29 2016-01-06 青岛海尔智能家电科技有限公司 The system of selection of buffer memory business device, equipment and system
US10055360B2 (en) 2015-12-19 2018-08-21 Intel Corporation Apparatus and method for shared least recently used (LRU) policy between multiple cache levels
US10102129B2 (en) * 2015-12-21 2018-10-16 Intel Corporation Minimizing snoop traffic locally and across cores on a chip multi-core fabric
WO2017190266A1 (en) * 2016-05-03 2017-11-09 华为技术有限公司 Method for managing translation lookaside buffer and multi-core processor
EP3258383A1 (en) * 2016-06-13 2017-12-20 Advanced Micro Devices, Inc. Dynamically adjustable inclusion bias for inclusive caches
US20170357596A1 (en) * 2016-06-13 2017-12-14 Advanced Micro Devices, Inc. Dynamically adjustable inclusion bias for inclusive caches
EP3258382B1 (en) * 2016-06-14 2021-08-11 Arm Ltd A storage controller
US10326855B2 (en) 2016-07-28 2019-06-18 International Business Machines Corporation Trending topic driven cache eviction management
EP3572946B1 (en) * 2017-03-08 2022-12-07 Huawei Technologies Co., Ltd. Cache replacement method, device, and system
US10417135B2 (en) * 2017-09-28 2019-09-17 Intel Corporation Near memory miss prediction to reduce memory access latency
US10528483B2 (en) * 2017-10-23 2020-01-07 Advanced Micro Devices, Inc. Hybrid lower-level cache inclusion policy for cache hierarchy having at least three caching levels
US10534721B2 (en) * 2017-10-23 2020-01-14 Advanced Micro Devices, Inc. Cache replacement policy based on non-cache buffers
US10891228B2 (en) * 2018-02-12 2021-01-12 International Business Machines Corporation Cache line states identifying memory cache
CN109189592B (en) * 2018-08-31 2022-03-25 重庆大学 Shared cache replacement algorithm and device based on hybrid memory structure
US11010306B2 (en) * 2018-10-22 2021-05-18 Arm Limited Apparatus and method for managing a cache hierarchy
US11216387B2 (en) * 2019-09-16 2022-01-04 Taiwan Semiconductor Manufacturing Company, Ltd. Hybrid cache memory and method for controlling the same

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4797814A (en) * 1986-05-01 1989-01-10 International Business Machines Corporation Variable address mode cache
US6202129B1 (en) * 1998-03-31 2001-03-13 Intel Corporation Shared cache structure for temporal and non-temporal information using indicative bits
US6240489B1 (en) * 1999-02-24 2001-05-29 International Business Machines Corporation Method for implementing a pseudo least recent used (LRU) mechanism in a four-way cache memory within a data processing system
US20030135696A1 (en) * 2002-01-17 2003-07-17 Rankin Linda J. Pseudo least-recently-used (PLRU) replacement method for a multi-node snoop filter
US20030159003A1 (en) * 2001-10-23 2003-08-21 Ip-First, Llc Associative cache memory with replacement way information integrated into directory
US20040003184A1 (en) * 2002-06-28 2004-01-01 Safranek Robert J. Partially inclusive snoop filter
US20040078526A1 (en) * 2002-10-21 2004-04-22 Silicon Graphics, Inc. System and method for hierarchical approximation of least recently used replacement algorithms
US20040143708A1 (en) * 2003-01-21 2004-07-22 Paul Caprioli Cache replacement policy to mitigate pollution in multicore processors
US6868485B1 (en) * 2002-09-27 2005-03-15 Advanced Micro Devices, Inc. Computer system with integrated directory and processor cache
US20050071566A1 (en) * 2003-09-30 2005-03-31 Ali-Reza Adl-Tabatabai Mechanism to increase data compression in a cache
US20060064547A1 (en) * 2004-09-23 2006-03-23 Sailesh Kottapalli Method and apparatus for run-ahead victim selection to reduce undesirable replacement behavior in inclusive caches
US7096323B1 (en) * 2002-09-27 2006-08-22 Advanced Micro Devices, Inc. Computer system with processor cache that stores remote cache presence information
US7373466B1 (en) * 2004-04-07 2008-05-13 Advanced Micro Devices, Inc. Method and apparatus for filtering memory write snoop activity in a distributed shared memory computer
US20080320235A1 (en) * 2007-06-22 2008-12-25 Microsoft Corporation Processor cache management with software input via an intermediary
US20090043966A1 (en) * 2006-07-18 2009-02-12 Xiaowei Shen Adaptive Mechanisms and Methods for Supplying Volatile Data Copies in Multiprocessor Systems
US20090083489A1 (en) * 2005-02-10 2009-03-26 Leo James Clark L2 cache controller with slice directory and unified cache structure
US7925834B2 (en) * 2007-12-29 2011-04-12 Intel Corporation Tracking temporal use associated with cache evictions

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6598123B1 (en) 2000-06-28 2003-07-22 Intel Corporation Snoop filter line replacement for reduction of back invalidates in multi-node architectures
US7386679B2 (en) 2004-04-15 2008-06-10 International Business Machines Corporation System, method and storage medium for memory management
US20070005899A1 (en) 2005-06-30 2007-01-04 Sistla Krishnakanth V Processing multicore evictions in a CMP multiprocessor

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4797814A (en) * 1986-05-01 1989-01-10 International Business Machines Corporation Variable address mode cache
US6202129B1 (en) * 1998-03-31 2001-03-13 Intel Corporation Shared cache structure for temporal and non-temporal information using indicative bits
US6240489B1 (en) * 1999-02-24 2001-05-29 International Business Machines Corporation Method for implementing a pseudo least recent used (LRU) mechanism in a four-way cache memory within a data processing system
US20030159003A1 (en) * 2001-10-23 2003-08-21 Ip-First, Llc Associative cache memory with replacement way information integrated into directory
US20030135696A1 (en) * 2002-01-17 2003-07-17 Rankin Linda J. Pseudo least-recently-used (PLRU) replacement method for a multi-node snoop filter
US6959364B2 (en) * 2002-06-28 2005-10-25 Intel Corporation Partially inclusive snoop filter
US20040003184A1 (en) * 2002-06-28 2004-01-01 Safranek Robert J. Partially inclusive snoop filter
US7096323B1 (en) * 2002-09-27 2006-08-22 Advanced Micro Devices, Inc. Computer system with processor cache that stores remote cache presence information
US6868485B1 (en) * 2002-09-27 2005-03-15 Advanced Micro Devices, Inc. Computer system with integrated directory and processor cache
US20040078526A1 (en) * 2002-10-21 2004-04-22 Silicon Graphics, Inc. System and method for hierarchical approximation of least recently used replacement algorithms
US20040143708A1 (en) * 2003-01-21 2004-07-22 Paul Caprioli Cache replacement policy to mitigate pollution in multicore processors
US20050071566A1 (en) * 2003-09-30 2005-03-31 Ali-Reza Adl-Tabatabai Mechanism to increase data compression in a cache
US7373466B1 (en) * 2004-04-07 2008-05-13 Advanced Micro Devices, Inc. Method and apparatus for filtering memory write snoop activity in a distributed shared memory computer
US20080215820A1 (en) * 2004-04-07 2008-09-04 Conway Patrick N Method and apparatus for filtering memory write snoop activity in a distributed shared memory computer
US20060064547A1 (en) * 2004-09-23 2006-03-23 Sailesh Kottapalli Method and apparatus for run-ahead victim selection to reduce undesirable replacement behavior in inclusive caches
US7669009B2 (en) * 2004-09-23 2010-02-23 Intel Corporation Method and apparatus for run-ahead victim selection to reduce undesirable replacement behavior in inclusive caches
US20090083489A1 (en) * 2005-02-10 2009-03-26 Leo James Clark L2 cache controller with slice directory and unified cache structure
US20090043966A1 (en) * 2006-07-18 2009-02-12 Xiaowei Shen Adaptive Mechanisms and Methods for Supplying Volatile Data Copies in Multiprocessor Systems
US20080320235A1 (en) * 2007-06-22 2008-12-25 Microsoft Corporation Processor cache management with software input via an intermediary
US7925834B2 (en) * 2007-12-29 2011-04-12 Intel Corporation Tracking temporal use associated with cache evictions

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230004331A1 (en) * 2014-02-24 2023-01-05 Kioxia Corporation Nand raid controller
US20180276125A1 (en) * 2017-03-27 2018-09-27 Nec Corporation Processor
US10565111B2 (en) * 2017-03-27 2020-02-18 Nec Corporation Processor
US20230143181A1 (en) * 2019-08-27 2023-05-11 Micron Technology, Inc. Write buffer control in managed memory system

Also Published As

Publication number Publication date
US20130111139A1 (en) 2013-05-02
US8364898B2 (en) 2013-01-29
US20100191916A1 (en) 2010-07-29

Similar Documents

Publication Publication Date Title
US9043556B2 (en) Optimizing a cache back invalidation policy
CN108701092B (en) Self-healing coarse grain snoop filter
US10725923B1 (en) Cache access detection and prediction
JP4226057B2 (en) Method and apparatus for pre-sacrificial selection to reduce undesirable replacement behavior in an inclusive cache
US7925840B2 (en) Data processing apparatus and method for managing snoop operations
JP4486750B2 (en) Shared cache structure for temporary and non-temporary instructions
US8176255B2 (en) Allocating space in dedicated cache ways
US20080022049A1 (en) Dynamically re-classifying data in a shared cache
US9372803B2 (en) Method and system for shutting down active core based caches
KR20190058316A (en) Systems and methods for efficient cache line handling based on predictions
US7194586B2 (en) Method and apparatus for implementing cache state as history of read/write shared data
US20100217937A1 (en) Data processing apparatus and method
KR102453192B1 (en) Cache entry replacement based on availability of entries in other caches
US20160062905A1 (en) Hierarchical cache structure and handling thereof
KR19980079433A (en) Method and system for implementing cache coherency mechanism for use in non-cache cache hierarchy
US10282295B1 (en) Reducing cache footprint in cache coherence directory
WO2018161272A1 (en) Cache replacement method, device, and system
KR20180122969A (en) A multi processor system and a method for managing data of processor included in the system
US6345339B1 (en) Pseudo precise I-cache inclusivity for vertical caches
US11526449B2 (en) Limited propagation of unnecessary memory updates
US8473686B2 (en) Computer cache system with stratified replacement
EP4409418A1 (en) Re-reference interval prediction (rrip) with pseudo-lru supplemental age information
JP5976225B2 (en) System cache with sticky removal engine
US7543112B1 (en) Efficient on-chip instruction and data caching for chip multiprocessors
US9053030B2 (en) Cache memory and control method thereof with cache hit rate

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BALAKRISHNAN, GANESH;KRISHNA, ANIL;SIGNING DATES FROM 20140630 TO 20141125;REEL/FRAME:034266/0509

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20190526