WO2001044947A1

WO2001044947A1 - Method and apparatus for monitoring a cache for garbage collection

Info

Publication number: WO2001044947A1
Application number: PCT/US2000/033439
Authority: WO
Inventors: Timothy Heil; Mario Wolczko
Original assignee: Sun Microsystems, Inc.
Priority date: 1999-12-17
Filing date: 2000-12-06
Publication date: 2001-06-21
Also published as: AU2257301A

Abstract

A method and apparatus for monitoring a cache for garbage collection are described. In a computer system comprising a cache and memory, a software and/or hardware flush monitor monitors cache flushes of dirty cache lines to memory, whereas cache flushes of clean lines and cache line fills are performed separately by hardware to permit cache optimizations normally precluded by software handlers. The flush monitor implements a write barrier between the cache and memory, scanning dirty cache lines for references to objects within the cache. One or more flush buffers may be used to temporarily store dirty cache lines before those dirty cache lines are flushed to memory. Multiple cache lines may then be handled by a single pass of the flush monitor. Alternatively, copies of flushed cache lines may be stored in a buffer for deferred handling by the flush monitor. Within the cache, objects are marked as non-local objects if those objects are at least partially resident in memory or have been referenced from memory. The marking of non-local objects enables garbage collection of first generation objects to be performed within the cache without accessing objects in memory. For example, local objects that are not referenced directly or indirectly from a root set of local objects, or from non-local objects within the cache, may be collected.

Description

METHOD AND APPARATUS FOR MONITORING A CACHE FOR GARBAGE COLLECTION

BACKGROUND OF THE INVENTION

1. FIELD OF THE INVENTION

This invention relates to the field of computer memory management, and, more specifically, to garbage collection processes in computer memory.

2. BACKGROUND ART

One aspect of memory management in any computer system is garbage collection. Garbage collection (GC) refers to the process of reclaiming data storage resources (e.g., cache, main memory, etc.) that are no longer in use by the system or any running applications. In an object-oriented system, for example, garbage collection is typically carried out to reclaim storage resources allocated to objects and other data structures (e.g., arrays, etc.) that are no longer referenced by an application. The reclaimed storage can then be re-allocated to store new objects or data structures.

An object is a programming unit that groups together a data structure (one or more instance variables) and the operations (methods) that can use or affect that data. An object can be instructed to perform one of its methods when it receives a "message" from another object. A message tells the receiving object what operations to perform. Objects contain references (also referred to herein as pointers) to other objects to facilitate inter-object messaging for method invocations or requests. With these references, an object web is formed which may be traversed by following the object references. Once an object is no longer part of an active web, that object is unreachable and inactive, and thus may be collected as garbage.

Garbage collection schemes generally treat all memory as a uniform storage resource, assuming from a software point of view that each object or data structure is stored in the same manner as every other object or data structure. However, when implemented within a computer system's physical memory, particularly in a virtual memory environment comprising several levels of physical memory with disparate access parameters, garbage collection suffers from several performance penalties.

Most garbage collection schemes require that some form of "reachability analysis" be performed. Reachability analysis refers to the act of determining the set of objects that may be reached (i.e., are referenced directly or indirectly) from a root set of objects. This analysis may be performed by examining an object for references to other objects and tracing those references to locate the other objects. The tracing of references continues from those other objects until no new references are found (i.e., the object web is completely traced). Unfortunately, due to poor spatial locality of objects in physical storage, tracing object references during garbage collection can result in inefficient memory performance. Those inefficiencies slow down the garbage collection process and may impact the performance of other processes within the system.

To provide a better understanding of the problems associated with implementing garbage collection in a computer system, an overview of garbage collection and a virtual memory hierarchy are provided below. Garbage Collection

One standard scheme for performing garbage collection is referred to as a "mark and sweep" garbage collection process. Many garbage collectors employ some variation of the "mark and sweep" process as herein described. In the "mark and sweep" process, a root set of objects is initially determined which represent those objects known (or assumed) to be active. Each element of the root set is marked, and iterative reachability analysis is performed to determine those objects reachable from the root set, i.e., those other objects that are referenced either directly or indirectly by one or more elements of the root set. Those objects that are reachable are also marked. A sweep is then carried out on all objects under consideration, and those objects that have not been marked are collected (e.g., by placing those unmarked objects or their respective storage resources on a free list for new allocation).

Figure 1 is a flow diagram of a "mark and sweep" collection process. In step 100, the root set of objects is determined. The root set comprises those objects that the garbage collection process assumes are live objects, such as objects referenced from processor registers just prior to initiation of the collection process. Those objects included in the root set are marked as live objects in step 101. To begin the reachability analysis, a first marked object (e.g., from the root set) is selected for analysis in step 102. In step 103, the current object under analysis is scanned for references to other objects, and, in step 104, those objects referenced by the current object are marked as live. In step 105, if further marked objects remain unscanned, the process selects an unscanned, marked object in step 106, and returns to step 103 to continue reachability analysis on the selected object. If, in step 105, all marked objects have been scanned, the process continues at step 107. In step 107, the garbage collector sweeps all unmarked (and thus unreachable) objects. The step of sweeping may comprise, for example, adding the addresses of the swept objects to a list of free storage locations. In step 108, the marker on each of the marked objects is reset in preparation for a subsequent garbage collection cycle.

The mark and sweep process appears straightforward, yet the reachability analysis performed in steps 103-106 is memory intensive. Each object involved in the analysis must be accessed from physical memory. As will be described below, such memory access operations may prove complex and inefficient when carried out within the context of a virtual memory hierarchy.

Memory Hierarchy

In a computer system implementing virtual memory, the memory used by active applications is comprised of two or more levels of storage components. In most current systems, the levels of storage components comprise cache memory, main memory (RAM) and mass storage. The cache memory itself may also comprise one or more levels (e.g., LI, L2, etc.) located on-chip and/or off- chip with respect to the processor. The capacity of each storage component (e.g., number of megabytes of storage) is typically dependent upon hardware design factors such as size, performance parameters (e.g., average access time), and cost. There may exist, for example, as much as one or two orders of magnitude difference in storage space and access time between each level of storage components, typically with cache being the smallest and fastest and mass storage being the largest and slowest. Ideally, virtual memory systems provide application memory with the large storage capability of a mass storage device and access performance approaching that of cache memory.

Figure 2 is a block diagram illustrating an example memory configuration. In Figure 2, processor 200 is coupled to a level one (LI) cache 203, which is in turn coupled to a level two (L2) cache 204. In this example, LI and L2 caches 203 and 204 are on-chip with processor 200. L2 cache 204 is coupled off- chip to a level three (L3) cache 205. L3 cache 205 is coupled to main memory (e.g., RAM: random access memory) 206, which is further coupled to a mass storage device 207, such as a magnetic disk drive.

Data is exchanged between mass storage 207 and main memory 206 in the form of memory pages, a number of which may reside in main memory 206 at any time. Data is exchanged between main memory 206 and L3 cache 205, and between any of the L1-L3 caches, in the form of cache lines. The size of respective cache lines may vary for different caches and cache levels. Data is exchanged between the lowest level cache (e.g., LI cache 203) and processor 200 in the form of data words (e.g., 32 or 64-bit data words).

In a best case data access scenario, desired data is located in cache memory (e.g., L1-L3) providing the quickest data access performance. If the desired data is not within cache memory, but is resident within main memory 206, data access will be delayed by the time required to load the cache line containing the desired data from main memory 206 into the cache (herein referred to as a "cache line fill"). Further, if the data is also not in main memory 206, a further delay is incurred while the relevant page of data is loaded from mass storage 207 into main memory 206. These delays include time spent identifying the relevant page in mass storage 207 or the relevant cache line of data in main memory 206. Identification can include address translation depending on whether the respective level of the memory hierarchy is virtually or physically addressable.

When a cache line is loaded into cache memory (203-205), another cache line within the cache memory may need to be evicted to make room for the new cache line. If the evicted cache line has not been modified by its associated application, the evicted cache line may be discarded without concern. However, if the evicted cache line contains modifications (e.g., additions, alterations or deletions of data), the evicted cache line must be written back to next highest level of the memory hierarchy. A modified cache line is referred to as "dirty." Similarly, if a page is being loaded from mass storage into main memory, another page may need to be evicted from main memory, and, if the evicted page is dirty, the evicted page must be written back to mass storage. In general usage, more frequently used data will linger in cache memory, and access to main memory will be infrequent, with access to mass storage less frequent still. Memory performance will thus approximate that of the cache memory.

The physical memory hierarchy typically has no knowledge or awareness of the data it is accessing. The structure of the data and any internal relationships are transparent to the physical implementation. Thus, with respect to the storage of objects, it is not uncommon for objects within the same object web, or even portions of the same object, to be stored in separate levels of the memory hierarchy, or within separate lines or pages within the same level of the memory hierarchy. Some referenced objects are used frequently whereas other objects may be needed only intermittently. The intermittently needed objects are likely to propagate to the higher levels of memory (main memory and mass storage), whereas the more frequently used objects will remain in cache memory. Thus, when object references must be traced, as is done in garbage collection, time- consuming accesses outside of the cache memory may be frequent, resulting in inefficient memory performance.

SUMMARY OF THE INVENTION

A method and apparatus for monitoring a cache for garbage collection are described. In a computer system comprising a cache and memory, a flush monitor monitors flushes of dirty cache lines to memory, whereas cache flushes of clean lines and cache line fills are performed separately to permit cache optimizations normally precluded by monolithic cache handlers implemented in software. The flush monitor implements a write barrier between the cache and memory, scanning dirty cache lines for references to objects within the cache. In some embodiments of the invention, one or more flush buffers may be used to temporarily store dirty cache lines before those dirty cache lines are flushed to memory, or to store copies of flushed cache lines for later scanning. Multiple cache lines may then be scanned by a single pass of the flush monitor.

Within the cache, objects are marked as non-local objects if those objects are at least partially resident in memory or have been referenced from memory. The marking of non-local objects enables garbage collection of first generation objects to be performed within the cache without accessing objects in memory. For example, local objects that are not referenced directly or indirectly from a root set of local objects, or from non-local objects within the cache, may be collected.

In an embodiment of the invention, a non-local bit is associated with an object upon that object's creation. The non-local bit has an initial state indicating that the associated object is a local object. When the flush monitor determines that a reference to an object is being written to memory by a cache flush of a dirty cache line, the flush monitor sets the associated non-local bit to indicate that the referenced object is now considered non-local. The non-local bits of objects in the cache are read during garbage collection to identify non-local objects.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a flow diagram of a "mark and sweep" garbage collection process.

Figure 2 is a block diagram of an example virtual memory hierarchy.

Figure 3 is a flow diagram of a generational garbage collection process in accordance with an embodiment of the invention.

Figure 4 is a diagram of objects in cache and memory configured as generations in accordance with an embodiment of the invention.

Figure 5 is a flow diagram of a process for handling a cache miss in accordance with an embodiment of the invention.

Figure 6A is a flow diagram of a process for handling a cache miss in a system comprising a flush buffer, in accordance with an embodiment of the invention.

Figure 6B is a flow diagram of a process for handling a cache miss wherein, in accordance with an embodiment of the invention, flushed cache lines are stored for deferred handling by a flush monitor.

Figure 7 is a block diagram of a cache configuration in accordance with an embodiment of the invention. Figure 8 is a block diagram of a cache configuration with a flush buffer in accordance with an embodiment of the invention.

DETAILED DESCRIPTION OF THE INVENTION

The invention is a method and apparatus for monitoring a cache for garbage collection. In the following description, numerous specific details are set forth to provide a more thorough description of embodiments of the invention. It will be apparent, however, to one skilled in the art, that the invention may be practiced without these specific details. In other instances, well known features have not been described in detail so as not to obscure the invention.

Generational Garbage Collection Within The Cache

In embodiments of the invention, a generational approach is applied to garbage collection. Generational garbage collection improves collection efficiency by focusing most of the collection activity on those objects that are most likely to be garbage. Objects are divided into generations according to age. The heuristic that most objects die soon after they are created indicates that most dead objects (i.e., those objects that are no longer referenced by other active objects and are thus garbage) can be collected from the youngest generations. Therefore, younger generations are collected more often. Older generations are collected rarely, saving the collector work.

In a generational scheme, object webs frequently transcend generational boundaries. I.e., older objects may hold references to younger objects, and vice versa. Since garbage collection is carried out on one generation at a time (or some subset of all of the generations), the object references between generations must be monitored to prevent the collection of objects that are referenced by objects in other generations. For this reason, the collector tracks inter- generational references using a mechanism referred to as a "write barrier." The write barrier allows the collector to identify those objects in the generation(s) currently being collected which should be maintained (i.e., kept alive) due to references from other generations. The write barrier is asserted at each reference store operation between a younger generation and an older generation.

In accordance with an embodiment of the invention, garbage collection may be performed over one or more of the first few layers of the memory hierarchy. For example, if both the LI and L2 caches are on chip with the processor, but the L3 cache is a slower off-chip cache, then the first generation boundary (and the write barrier) may be implemented between the L2 cache and the L3 cache. In this case, the LI and L2 cache combined would constitute the first generation for garbage collection. In other embodiments, the boundary may be placed between the L3 cache and main memory, or between any other layers of the memory hierarchy. The concepts discussed herein are applicable to any organization of physical memory.

For the purposes of describing the following embodiments, those levels of cache collected by the garbage collection system (i.e., the younger generation) are referred to as "the cache." Levels of the memory hierarchy beyond those cache levels are referred to as "memory." Further, objects that reside solely within the cache are considered "local objects" with respect to the cache. Objects that exist within memory (even if those objects also reside within the cache) are considered "non-local objects", as are those objects within the cache that are referenced from memory or have been referenced from memory in the past.

In generational garbage collection, newly instantiated objects begin in the youngest generation. Over time, if the object survives any garbage collection cycles applied to its current generation, the object matriculates into the next oldest generation (i.e., by satisfying one or more specified conditions of the next generation, the object becomes a member). This "generational matriculation" process (also referred to as "tenuring") continues until the object is either collected during a garbage collection cycle or becomes a member of the oldest generation.

To implement the concept of generational matriculation, in one embodiment of the invention, objects are initially created within the cache and classified as local objects. Local objects are members of the first generation for garbage collection purposes. When an object becomes non-local, either by being written to memory or by having a reference to the object written to memory, the object has matriculated beyond the first generation.

When an object is created within the cache, an object identifier is assigned to the object. This identifier may, for example, be obtained from a list of available identifiers, possibly categorized by object size. If and when the object is first evicted from the cache, storage space is allocated in memory and the object identifier is mapped to the location of the allocated space. Objects that are collected before being evicted from the cache are not allocated space in memory. The identifiers for collected objects may be placed back onto the list of available identifiers.

Figure 3 is a flow diagram of a generational garbage collection process implemented in accordance with an embodiment of the invention. In step 300, a root set of objects is determined from those local objects currently in use by applications (e.g., those objects whose methods are presently invoked). For example, this root set may comprise those local objects currently referenced from registers in the processor. In step 301, those objects currently in the root set are marked as "live." In step 302, the cache is scanned for non-local objects (i.e., those objects existing in memory or referenced from memory), which are then added to the root set. Step 302 may occur before step 300 or step 301. Depending upon the implementation, the non-local objects may or may not be marked as "live."

In step 303, iterative reachability analysis is performed within the bounds of the cache, tracing object references from the root set to identify reachable local objects. Those local objects found during reachability analysis are marked as "live." In step 304, all unmarked local objects are swept from the cache, leaving only non-local objects and marked local objects. In step 305, the collector resets the marks on the local objects. By confining the reachability analysis to the cache in step 303, time consuming accesses to memory are avoided. All access operations are performed at the speed of the cache.

Figure 4 illustrates a set of objects (A-L) separated into "older" and "younger" generations within memory and the cache, respectively. Objects within the younger generation may matriculate into the older generation over time, given that those objects survive garbage collection.

In Figure 4, the younger generation comprises objects A-F and J, whereas the older generation comprises objects G-I and K-L. Object A holds a reference to object C, and object C holds references to objects B and G. Object B holds a reference to object I, which in turn holds a reference to object J. Object G holds a reference to object L. Object L holds a reference to object F, which in turn holds references to objects C and D. Object E holds a reference to object D. Object H holds references to objects G, I and L. Object K holds a reference to object I. The ,

16

dashed line between representations of object L in the cache and in memory indicate that at least a portion of object L has been loaded into the cache. Because of its matriculation into memory at some previous time, object L is considered to be part of the older generation.

Assuming that object A is referenced from a register within the processor, garbage collection of the cache in accordance with an embodiment of the invention would occur as follows. Object A is identified as the root set and is marked as a live object. The cache is scanned to identify non-local objects F, J and L, which are added to the root set. Objects F and J are considered non-local because those objects are referenced from memory. Object L is non-local because object L resides, at least partially, in memory.

Reachability analysis from object A identifies and marks local object C, and via object C identifies and marks object B. Performing the same analysis from object F identifies and marks object D (object C is also identified once more). Reachability analysis from objects L and J identify no further local objects. Reachability analysis is performed by scanning the root object in the cache to identify object references. References to objects outside of the cache are ignored (e.g., from object C to object G), but references to local objects are traced to those local objects unless the referenced object was previously identified and marked. A sweep of the cache results in collection of object E, which is not referenced by local or non-local objects. Any other unreferenced objects would also be collected in the sweep process. After the sweep, the "live" marks for objects A-D are reset. Resetting may alternatively be performed at the beginning of each garbage collection cycle. The tracing of object references is more easily performed for embodiments in which the locations of object headers are known or readily identifiable. Header identification is possible, for example, in embodiments using virtually addressed caches and an object table. An object table is similar to a page table, but is extended to allow objects to start on any word boundary, and to keep track of the object size. Object headers can then be identified directly by their virtual address.

Identification of headers is also possible in virtually or physically addressed caches without an object table. In such an embodiment, software provides a mechanism for identifying headers. Methods for identifying headers are known. For example, one such method is used in the Boehm-Demers- Weiser conservative garbage collection library for C/C++ described by Hans- Juergen Boehm and Mark Weiser in "Garbage Collection in an Uncooperative Environment," in Software Practice and Experience, 18(9): pp. 807-820, 1988.

In embodiments where the locations of object headers are not known, scanning for non-local objects may also involve checking to see if a word is an object header. If the word is a header, the word is checked to see if the associated object is non-local. If the object is non-local, all reference words in the cache following the non-local header are added to the root set. If the word is not a header, the header corresponding to the word is sought. This may be done, for example, by scanning backwards through memory, looking at lines that are in the cache. If a header is found, it is the header for that word, and the word is non-local if the header specifies that it is non-local. While scanning backwards through memory, a cache line may be needed that is missing from the cache. Since the missing cache line could contain the header, and accesses to memory are undesired, the cache line is assumed to contain a non-local header. The word is considered non-local and is added to the root set.

Write Barrier Implementation In Cache Miss

An embodiment of the invention uses a software flush monitor during cache flushes of dirty cache lines, while cache flushes of clean lines and cache line fills are handled separately by hardware. In other embodiments, the flush monitor may be implemented in hardware, or with a combination of hardware and software. The flush monitor is used to implement the write barrier for tracking inter-generational references between older objects in memory and younger objects in the cache. Since cache line fills are handled by hardware, cache line fills may run at the full speed of the hardware, and advanced cache configurations, such as non-blocking caches, are not prohibited. The flushing of the cache line may be performed as part of the flush monitor, or the flushing may be performed separately, in hardware or software.

In an embodiment of the invention, a flag, referred to as a non-local bit, is associated with an object (e.g., within the object header). In one embodiment, the non-local bit is handled in software, and, thus, its implementation is flexible. An object with its respective non-local bit set is referred to as a non-local object. When set, this non-local bit indicates that references to the associated object may exist outside the cache (i.e., in memory). When reset, the non-local bit indicates that there are no references to the associated object that exist outside of the cache. There may or may not be references to the associated object from within the cache, regardless of the state of the non-local bit.

When an object is initially created, there are no references to the object from outside of the cache. No references existed beforehand which could have been flushed out to memory. Therefore, the non-local bit of a newly-created object is reset to false.

When a dirty cache line is flushed from the cache, the flush monitor scans the cache line for references to objects that are in the cache. The non-local bits of any such referenced objects are set to indicate that a reference to those objects may now exist outside of the cache. Also, any non-local bits within the flushed dirty cache line are also set. The write barrier is thus satisfied. When a clean line is evicted from the cache, there is no need to scan for references. Any objects referenced by a clean cache line must have already had their associated non-local bits set.

Preferably, during a scan of a cache line, the flush monitor is able to discern the difference between true references to objects and other types of values (such as integer data) that might match a reference to an object. Where references are discernible, only true references are identified for marking of nonlocal objects, in accordance with non-conservative garbage collection practices. However, in embodiments where references are not clearly discernible to the flush monitor, an object may be conservatively marked as non-local if a scanned value in an evicted cache line matches a reference to that object. Under the conservative scheme, some objects may be miscategorized as non-local due to the assumption that all matching values are object references, but no objects will be erroneously collected during garbage collection. The write barrier implementation of the flush monitor may be separated into a scanning operation and a setting operation for setting non-local bits when necessary. The scanning operation scans the cache line for references to other objects in the cache and marks any such referenced objects as non-local. The setting operation sets any non-local bits of objects that leave the cache. In accordance with one or more embodiments of the invention, the scanning and setting operations may both be implemented together or separately in hardware or software, or one operation may be performed in hardware while the other is implemented in software.

Figure 5 is a flow diagram illustrating the process for handling a cache miss in accordance with an embodiment of the invention. In step 500, a cache miss is detected, meaning that a requested piece of data is not within the cache, and that a cache line must be evicted in order to perform a cache line fill for the desired data. At step 501, it is determined whether the cache line to be evicted is dirty (i.e., whether it has been modified in some fashion since being placed in the cache). If the cache line is not dirty, then the current cache line may be freely evicted. In step 508, a cache line fill, under hardware control for example, is performed to obtain the desired data.

However, if the cache line is dirty in step 501, in step 502, the cache line flush is trapped and execution jumps to the cache flush monitor. In step 503, if objects within the cache line being flushed are local (e.g., the non-local bit is reset for objects within the given cache line), those objects are set to be non-local (e.g., the non-local bit is set for each object leaving the cache). Note that when an object header is flushed from the cache, or any portion of any auxiliary data structures used to identify the header are flushed from the cache, the non-local bit associated with that object are set. This is in addition to the setting of nonlocal bits as they themselves leave the cache.

In step 504, the cache line being evicted is scanned for references to other objects within the cache. Those objects thus referenced are set as non-local objects in step 505. In step 506, the cache line to be evicted is flushed to memory, and, in step 507, execution returns from the flush monitor. In subsequent step 508, the cache hardware fills the cache line with the desired data (object) from memory.

Steps 502-507 of the flush monitor may be performed in software to flexibly implement the write barrier by appropriate setting of non-local bits. However, in alternative embodiments, steps 502-507 may be implemented in hardware (e.g., where faster performance is desired) or in a combination of software and hardware. Cache line fills, regardless of whether the cache lines being evicted are dirty or clean, are performed in hardware. As shown, the flush monitor is activated whenever there is a cache line miss necessitating eviction of a dirty cache line, e.g., in response to a cache flush trap.

In an alternate embodiment, cache line flushes of dirty cache lines are accumulated in a flush buffer. Greater efficiency is then achieved by calling the flush monitor once to handle the cache flush and write barrier processes for multiple cache lines. In yet another embodiment, cache lines may be flushed immediately to memory, with a copy of the cache line stored for later processing by the flush monitor. The use of a cache flush buffer for delayed batch flushes and the use of cache line copies for monitoring of flushes after the fact are illustrated in Figures 6A and 6B, respectively. Figure 6A is a flow diagram illustrating an embodiment of a process for handling cache misses in a system comprising a flush buffer. In step 600, the cache detects a cache miss. In step 601, it is determined whether the cache line to be evicted is dirty (i.e., whether it has been modified in some fashion since being placed in the cache). If the cache line is not dirty, then the current cache line may be freely evicted. In step 610, a cache line fill, under hardware control, is performed to obtain the desired data.

If, in step 601, the cache line is dirty, then, in step 602, the cache line to be evicted is written into the flush buffer. In step 603, if the flush buffer threshold (e.g., a statically or dynamically determined number of buffered cache lines) is not met, the cache can proceed with filling the cache line from memory in step 610. If, however, the flush buffer threshold is met in step 603, the cache line flush is trapped and execution jumps to the flush monitor in step 604. In step 605, local objects within all of the cache lines in the flush buffer (or that subset of cache lines being flushed) are set as non-local objects. In step 606, each cache line in the flush buffer is scanned for references to other objects in the cache. In step 607, those objects in the cache that are referenced from any one of the cache lines in the flush buffer are set as non-local objects. In step 608, all (or a portion) of the cache lines in the flush buffer are flushed to memory, and, in step 609, execution returns from the cache flush monitor. In subsequent hardware step 610, the cache is free to fill the desired cache line from memory.

Figure 6B is a flow diagram illustrating an embodiment of a process for handling cache misses using immediate flushing and deferred monitoring. An advantage of this approach is that it more clearly decouples the cache flush and cache fill operations from the flush monitoring process. The cache flush and fill operations are handled immediately to provide better cache performance, whereas implementation of the write barrier by the flush monitor occurs at a more convenient time in the future. As long as a copy of the flushed data is available for processing by the flush monitor, the monitoring process may be deferred until the buffer containing the cache line copies is full, or the system is ready to perform a garbage collection sweep. Other mechanisms, such as buffer thresholds or timers, may be used to trigger the flush monitor before these conditions exist.

In step 620 of Figure 6B, the cache detects a cache miss. In step 621, it is determined whether the cache line to be evicted is dirty (i.e., whether it has been modified in some fashion since being placed in the cache). If the cache line is not dirty, then the current cache line may be freely evicted, and, in step 623, a cache line fill, under hardware control, is performed to obtain the desired data. If, in step 621, the cache line is dirty, then, in step 622, the cache line is flushed to memory and a copy of the cache line is stored in a flush buffer, after which a cache line fill is performed in step 623.

From step 623, the process continues at step 624. Steps 624-629 are substantially independent of the cache line fill in step 623, and, therefore, may alternatively precede or be performed in parallel with step 623. In step 624, the processor waits until conditions indicate that the flush monitor should operate on any outstanding copies of flushed cache lines, and then proceeds to step 625 where the cache line copies are processed in turn. As previously stated, these conditions may be based on buffer status, timing, or pendency of a garbage collection operation, for example.

In step 625, the flush monitor scans the cache line copy for references to objects in the cache. In step 626, any referenced objects are marked as non-local by setting their non-local bits. In step 627 (which may alternatively be performed before or in parallel with steps 625-626), the flush monitor undertakes the step of setting the non-local bits for the cache lines flushed to memory. This step may require accessing those flushed cache lines in memory to set the requisite non-local bits. In step 628, if necessary (e.g., if one or more of steps 625-627 is implemented as a software routine such as a trap handler), the flush monitor returns control to the parent process.

Because the flush monitor may be viewed as separate scanning and setting operations as previously described, it is also possible to carry out the setting operation on the non-local bits of the evicted cache line at the time the cache line is flushed to memory. The scanning operation may then be executed at a later time, based on the stored copy of the evicted cache line.

Cache Hardware Implementation

Figure 7 is a block diagram of a cache configuration in accordance with an embodiment of the invention. A direct-mapped cache configuration is shown as an example, though it will be obvious to one skilled in the art that other cache configurations, such as associative cache configurations, may also be implemented in embodiments of the invention.

The cache configuration of Figure 7 comprises two-port data RAM 703, key RAM 701 and comparator 707. Data RAM 703 is row addressable for cache line access via memory data port 713, and column addressable for data word access (within a specified cache line) via processor data port 712. Key RAM 701 stores keys (also referred to as "tags") associated with the data currently stored in each cache line. Key RAM 701 is addressable by the same row address applied _ns,

25

to data RAM 703. The selected key from key RAM 701 is output via bus 711 to comparator 714. Output 714 of comparator 707 indicates whether a current cache access is a cache "hit" or a cache "miss."

Assuming a virtually addressed cache, the processor accesses data by providing the object identifier (object ID) and the offset of the desired data relative to the beginning of the object. Given a physically addressed cache, the virtual address comprising the identifier and offset is translated into a physical address for presentation to the cache. The object identifier and offset are written, for example, into data address register 700. As shown, the object identifier and offset are partitioned into a key value 708, a cache line (row) address 709, and a data word (column) address 710. For example, data key 708 may comprise a first portion of the object ID (OID-A) and a first portion of the offset (O-A). Cache line address 709 may comprise a second portion of the object ID (OID-B) and a second portion of the offset (O-B). The data word address may comprise the remaining portion of the offset (O-C).

Cache line address 709 identifies key 702 in key RAM 701 and cache line 704 in data RAM 703. Data word address 710 identifies column 705 of data RAM 703. The combination of addresses 709 and 710 identify data word 706. Key value 708 is compared with selected key 702 in comparator 707. In the case of a cache hit, data word 706 is accessed via port 712. In the case of a cache miss, the selected cache line 704 is first evicted and replaced with the cache line from memory that contains the desired data.

Figure 8 is a block diagram illustrating an implementation of the cache between memory and a processor, in accordance with an embodiment of the invention. The implementation comprises memory 800, cache 801, flush buffer 802, translation look-aside buffer (TLB) 803, and processor 804. Data addresses, such as an object ID and offset, are sent from processor 804 to cache 801 via address bus 805. Data access is provided between processor 804 and cache 801 via data bus 806. Optional Flush buffer 802 and translation look-aside buffer 803 are shown for purposes of illustration, but need not be present in all embodiments of the invention.

Cache line fills from memory 800 to cache 801 are performed via bus 809. The virtual address for the desired cache line is translated into a physical address, e.g., by querying the mapping between the object identifier and allocated physical storage. That physical address is then provided to memory 800 to acquire the data for the cache line fill. The virtual-physical address pair may be stored in translation look-aside buffer (TLB) 803 for fast access if there is a subsequent flush of the same cache line back to memory 800.

Flush buffer 802 provides an accumulator for cache lines to be written back to memory 800. In the event of a cache line flush, the cache line is written to flush buffer 802 via bus 810. When flush buffer 802 is full (or meets some specified threshold in terms of the number of cache lines contained), all cache lines in flush buffer 802 (or some subset thereof) may be flushed via bus 808 in a single pass of a cache flush monitor. The physical addresses required for flushing the cache lines back to memory 800 may be obtained from TLB 803, or determined through translation. In normal cache access operations, flush buffer 802 may be treated as an extension of the main cache (i.e., as a transitory "victim cache" containing soon-to-be-evicted cache lines) that is checked in parallel with the main cache or checked only on cache misses to determine whether desired object data is currently resident and accessible in flush buffer 802. In an implementation using immediate flushing with deferred flush monitoring (as described with respect to Figure 6B), flush buffer 802 may be used to store copies of previously flushed cache lines. When the flush monitor is called, scanning and setting operations are performed using the cache line copies stored in flush buffer 802.

When the flush monitor has to translate a virtual address into a physical address, it may be necessary to traverse a multi-level page table (or object table) or follow a collision chain in an inverted page table. Such a traversal may require several memory accesses, each of which could result in a cache line flush. A deadlock may occur if the table data from multiple levels of the page table map onto the same cache line or onto the cache line of the object being flushed. This deadlock can occur if the cache is direct-mapped (and the flush buffers, if implemented, are full), or if the cache is set-associative with less associativity than levels in the page table.

One scheme for preventing deadlock in implementations that include a translation look-aside buffer is to maintain inclusion between TLB 803 and cache 801 (and flush buffer 802). That is, every cache line or object within the cache would have an entry within TLB 803, negating the need for translation in cache line flushes. It is also possible to implement address translation during flush handling by performing non-caching memory accesses. If an access hits in the cache, the cache services the access. Otherwise, the access is performed directly through memory. Latency is high for direct memory access, but the cache remains unperturbed and no deadlock occurs.

For processor architectures implementing non-caching store instructions that bypass the cache (e.g., for I/O purposes), copies of data written to memory by each store operation may be provided to the flush monitor for scanning and setting of non-local bits. Thus, the integrity of the write barrier may be maintained.

Thus, a method and apparatus for monitoring a cache for garbage collection have been described in conjunction with one or more specific embodiments. The invention is defined by the claims and their full scope of equivalents.

Claims

1. In a computer system, a method comprising: detecting a cache miss if a cache line in a cache does not contain a desired object; executing a flush monitor in response to a cache line flush; performing a cache line fill to obtain said desired object from memory.

2. The system of claim 1, wherein said cache line fill is implemented in hardware.

3. The method of claim 1, wherein executing said flush monitor comprises implementing a write barrier between said cache and said memory.

4. The method of claim 3, wherein implementing said write barrier comprises: in said cache line, setting one or more local objects to be non-local objects.

5. The method of claim 4 wherein implementing said write barrier further comprises: scanning said cache line for one or more references to one or more other objects within said cache; and setting said other objects to be non-local objects.

6. The method of claim 4, wherein setting one or more local objects to be non-local objects comprises toggling a non-local bit associated with said one or more local objects.

7. The method of claim 1, wherein at least a portion of said flush monitor is implemented in software.

8. The method of claim 1, wherein at least a portion of said flush monitor is implemented in hardware.

9. The method of claim 1, further comprising trapping said cache line flush if said cache line is dirty.

10. The method of claim 1, further comprising: writing said cache line to a flush buffer if said cache line is dirty; trapping a cache line flush when a buffer threshold is met.

11. The method of claim 1, further comprising: storing a copy of a flushed cache line for deferred processing by said flush monitor.

12. A computer system comprising: a cache comprising one or more cache lines, said cache configured to perform cache line fills in hardware; a memory coupled to said cache; a flush monitor responsive to a cache flush, said flush monitor configured to implement a write barrier between said cache and said memory.

13. The computer system of claim 12, further comprising a processor coupled to said cache, wherein said flush monitor comprises software executed by said processor.

14. The computer system of claim 12, wherein at least a portion of said flush monitor is implemented in hardware.

15. The computer system of claim 12, further comprising a plurality of objects stored in said cache, said plurality of objects comprising one or more local objects and one or more non-local objects.

16. The computer system of claim 15, wherein said flush monitor is configured to implement said write barrier by setting one or more local objects in said cache line to be non-local objects.

17. The computer system of claim 16 wherein said flush monitor is further configured to implement said write barrier by: scanning said cache line for one or more references to one or more other objects within said cache; and setting said other objects to be non-local objects.

18. The computer system of claim 15, wherein said plurality of objects have one or more respective non-local bits.

19. The computer system of claim 18, wherein said flush monitor is configured to set one or more local objects to be non-local objects by toggling one or more non-local bits associated with said one or more local objects.

20. The computer system of claim 12, further comprising a buffer configured to store one or more evicted cache lines.

21. The computer system of claim 20, wherein a cache flush trap is configured to be triggered when a buffer threshold is met.

22. An apparatus comprising: means for detecting a cache miss if a cache line in a cache does not contain a desired object; means for executing a flush monitor in response to a cache line flush; means for performing a cache line fill to obtain said desired object from memory.