WO2013186694A2

WO2013186694A2 - System and method for data classification and efficient virtual cache coherence without reverse translation

Info

Publication number: WO2013186694A2
Application number: PCT/IB2013/054755
Authority: WO
Inventors: Stefanos Kaxiras; Alberto ROS BARDISA; Mahdad DAVARI
Original assignee: Stefanos Kaxiras; Ros Bardisa Alberto; Davari Mahdad
Priority date: 2012-06-11
Filing date: 2013-06-10
Publication date: 2013-12-19
Also published as: WO2013186694A3

Abstract

An on-chip memory hierarchy organization for a multicore processing system is disclosed. The hierarchy supports virtual- addressed private caches and a physical-addressed shared cache. The hierarchy classifies cache line data as private or shared to support a one-directional request response protocol. The classification can be determined from the generational behavior of a cache line in the private caches. Cache lines having a single generation in a private cache are Private, and cache lines having overlapping generations in two or more private caches are Shared. The Private or Shared classification is performed dynamically at run-time in hardware using a single translation lookaside buffer at the interface between the private and shared caches. The coherence protocol uses the data classification in a dynamic write policy for both shared data race free data and private data, differentiating in when data is put back to the shared cache based on the classification.

Description

SYSTEM AND METHOD FOR DATA CLASSIFICATION AND EFFICIENT VIRTUAL CACHE COHERENCE WITHOUT REVERSE TRANSLATION

Stefanos Kaxiras

Alberto Ros Bardisa

Mahdad Davari

Technical Field

[0001] The present invention relates in general to the caching of data in

multiprocessor systems and, more particularly, to classifying data to private and shared and implementing virtual cache coherence in a multicore/manycore architecture in a manner that supports synonyms while eliminating the need for reverse translations to maintain coherence.

Background Art

[0002] In a multiple processor environment, two or more microprocessors (referred to as multiple core, multi-core and many-core) reside on the same chip and commonly share access to the same area of main memory via a cache hierarchy. Shared-memory microprocessors simplify parallel programming by providing a single address space even when memory is physically distributed across many processing nodes or cores. Most shared-memory multiprocessors use cache memories or "caches" to facilitate access to shared data, and to reduce the latency of a processor's access to memory. Small but fast individual caches are associated with each processor core to speed up access to a main memory. Caches, and the protocols controlling the data access to caches, are of highest importance in the multi-core parallel programming model.

[0003] A typical cache coherence protocol consists of two fundamental operations:

1. ) Upon a write, the cache coherence protocol must find all the read copies of the data and invalidate them; and

2. ) on a subsequent read, the cache coherence protocol must provide the latest value of the data by locating the last writer, and the last writer is downgraded to a reader. These two operations are straightforward if all the cache copies of the data (readers and writer) are identified by their unique physical address in a directory indexed by physical address. Alternatively, in snooping cache coherence solutions, the physical address of the reads and writes is broadcast to all caches. However, the same operations are problematic in virtual addresses. Shared-memory systems typically implement coherence with snooping or directory-based protocols. Directory-based cache coherence protocols are notoriously complex, requiring a directory to constantly track readers and writers and to send invalidations or global broadcasts or snoops. Directory protocols also require additional performance and transient states to cover every possible race that may arise.

[0004] Coherent shared virtual memory (cSVM) is highly coveted for heterogeneous architectures as a means of simplifying programming across different cores and manycore accelerators. In this context, virtual LI caches can be used to great advantage, e.g., saving energy consumption by eliminating address translation for hits. Unfortunately, implementing coherence for virtual caches is difficult because virtual caches must be accessed by virtual addresses, while coherence ultimately must use a unique physical address as a single point of reference. This implies the need for both forward and reverse translations for typical coherence protocols. The presence of synonyms complicates coherence with potentially multiple results per reverse translation.

[0005] A typical shared virtual memory implementation in a multicore provides virtual-to-physical address translation per core. The shared cache (also referred to herein as "LLC/SHC") is always accessed using physical addresses because coherence is implemented for physical addresses. The assumption is a private LI cache (hereinafter referred to as "LI") per core and a shared LLC/SHC. The virtual- to-physical address translation occurs either before the LI is accessed, or in parallel with the LI access. In the first case, the LI is accessed using physical addresses (physically indexed, physically tagged). In the second case, the Translation

Lookaside Buffer (hereinafter "TLB") is accessed in parallel with the tag access of the LI. The LI is accessed with a combination of the virtual part of the address and the physical (un-translated) part of the address which typically comprises the offset bits in the virtual or physical page. In this case, the cache is typically virtually indexed but physically tagged, as the address translation will be completed in parallel with the tag access, and the tag comparison is performed using physical addresses. Performing address translation per core requires the TLBs to be kept coherent, and every access from every core to be translated, with a corresponding access of the TLB. This is very expensive in terms of energy.

[0006] In a deeper private cache hierarchy, e.g., a private L1/L2, placing the TLB after the LI but before the private L2 saves some TLB accesses (since the TLB is accessed only on LI misses) but has a synonym problem in the Lis. Using a typical invalidation-based cache coherence protocol (directory MESI or MOESI) requires reverse translations in one form or another (reverse maps, L2 backpointers, etc.) for coherence actions going from the L2/TLB to the LI (for example, invalidations, downgrade requests— from the Modified state (M) to the Shared state (S)— or data requests— from the Owned state (O)). In addition, the reverse translation needs to expand to all the possible synonyms that may exist in the LI. Reverse translation introduces significant complexity and cost.

[0007] Alternatively, the address translation can be performed only when a miss reaches the LLC/SHC, placing the TLB between the private cache hierarchies and the LLC/SHC. In this case, the LI is virtually indexed, virtually tagged and operates using virtual addresses (similarly for a deeper private cache hierarchy). No address translation is needed for the LI hits, which saves significant energy. In a multicore, the TLB can be at the interface between the private cache hierarchies and the LLC/SHC, and shared by all cores. The problem with this approach is that it requires a reverse translation for all the coherence actions going from the LLC/SHC to the cores (coherence actions may also originate outside the multicore under

consideration), and leaves the synonym problem unsolved for all the private cache hierarchies. Every coherence action (such as, for example, invalidations, downgrade requests, or any request forwarded from the LLC/SHC to another core) requires a reverse translation to all the synonyms that may exist in the private caches.

[0008] For virtually-indexed, virtually-tagged (VIVT) LI caches, coherence requests from the virtual caches (reads or writes) undergo address translation via a TLB before they reach the directory. According to the two fundamental operations of a cache coherence protocol:

1. ) A write request reaching the directory can generate a number of invalidation requests. Each of these new requests requires its own reverse translation because of the possibility that in the target cache the data exists under a different virtual address than the one used by the write request.

Worse, if multiple synonyms are allowed to exist in the same cache, a single invalidation request may result in multiple translations to virtual addresses.

2. ) A read request that reaches the directory is forwarded to the last writer of the data that is tracked by the directory. This indirection also requires a reverse translation as the writer may use a different synonym than the reader.

[0009] To avoid reverse translation all coherence request traffic must be one-way from the virtual address domain to the physical address domain, and never from the physical address domain to the virtual address domain, or the virtual address domain to the virtual address domain. MESI and similar types of cache coherence protocols violate this condition with invalidations, request-forwardings, and downgrades.

[00010] Consequently, it is desirable to have a new solution for efficiently

implementing virtual-cache coherence within a multi-core architecture that reduces the cost and complexity in a shared memory processing environment, without sacrificing power and performance. In particular, it is desirable to have a system and method for implementing virtual cache coherence which supports synonyms, yet eliminates the need for reverse translations between the shared cache and private caches, and within the private cache hierarchies. Further, it is desirable to have a system and method for classifying data that enables use of a simple request-response protocol in a shared virtual memory hierarchy, and thereby eliminates the need to perform reverse translations even in the presence of synonyms.

Summary of the Invention [00011] In accordance with general aspects of the present invention there is provided systems and methods for efficiently implementing virtual cache coherence which utilize a simple request-response protocol to eliminate the need for reverse translations even in the presence of synonyms. The systems and methods described herein provide for replacing all per-core TLBs with a single TLB placed between the private caches and the last level or shared cache. The simplified request-response protocol and TLB placement simplifies the multicore memory organization and provides significant area, energy, and performance benefits. In the disclosed systems and methods, private cache hierarchies are accessed using virtual addresses, and virtual address synonyms are allowed to exist in the private caches without restriction and without the need for reverse physical-to-virtual address translation.

[00012] The disclosed invention uses a coherence protocol that operates without

invalidations and request-forwarding via the LLC/SHC, or, alternatively, without broadcasts and snoops. Coherence actions are restricted to be local to the private caches or to go only from private caches to the LLC/SHC (excluding responses to requests, returned acknowledgments, and data classification transactions), and not from the LLC/SHC to the Lis (private caches) or from an LI to another LI. Methods are disclosed herein for classifying data as either private or shared. It is then shown how the data classification can be utilized to implement a simple request response protocol between one or more virtual private caches and a shared memory. In the method disclosed, all coherence request traffic is one-way from the virtual address domain to the physical address domain.

Brief Description of the Drawings

[00013] While the specification concludes with claims which particularly point out and distinctly claim the invention, it is believed the present invention will be better understood from the following description of certain examples taken in conjunction with the accompanying drawings:

[00014] Fig. 1 is a schematic illustration of an exemplary multi-core and cache

architecture utilized in the present invention, and [00015] Fig. 2 is a schematic illustration of an alternative multi-core/cache architecture with per-core local translation lookaside buffers.

[00016] The drawings are not intended to be limiting in any way, and it is

contemplated that various embodiments of the invention may be carried out in a variety of other ways, including those not necessarily depicted in the drawings. The accompanying drawings incorporated in and forming a part of the specification illustrate aspects of the present invention and, together with the description, serve to explain the principles of the invention; it being understood, however, that this invention is not limited to the precise arrangement shown.

Detailed Description of the Invention

[00017] The following description of certain examples should not be used to limit the scope of the present invention. Other features, aspects, and advantages of the versions disclosed herein will become apparent to those skilled in the art from the following description, which is by way of illustration, one of the best modes contemplated for carrying out the invention. As will be realized, the versions described herein are capable of other different and obvious aspects, all without departing from the invention. Accordingly, the drawing and descriptions should be regarded as illustrative in nature and not restrictive.

[00018] The disclosure relates to implementing cache-coherent shared Virtual Memory (hereinafter also referred to as "cc-SVM") on a multicore/manycore architecture. The embodiments described herein allow all cores to use virtual addresses to access their private caches without requiring a translation from virtual to physical addresses, while at the same time solving the virtual address synonym problem. The described embodiments eliminate the need for reverse address translations that would be otherwise required for coherence, and provide efficient support for the data classification needed for cache coherence protocols that make this organization practical. The cache coherent shared virtual memory is implemented using very simple request-response protocols such as, for example, the VIPS-M coherence protocol described by A. Ros and S. Kaxiras in Complexity-Effective Multicore Coherence, In 21st Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 241-252, Sept. 2012. Alternatively, cc-SVM can be implemented using simple, GPU-specific coherence, or purely software-driven coherence (that puts the responsibility of maintaining coherence entirely on the program), or protocols based on a combination of software and simple hardware.

[00019] The embodiments herein will be described with respect to a generalized multi- core/many-core processing chip (also known as a Chip Multiprocessor, CMP) having two or more cores (processors) and an on-chip cache/memory hierarchy. The methods and systems relate to general-purpose multi-cores (few fat cores), many- cores (many thin cores) or GP-GPUs with coherent caches, accelerator multi-cores, and shared-address space heterogeneous architectures with a multi-core coupled to a many-core. Private caches are virtually indexed, virtually tagged, saving all the energy that would be otherwise needed for a virtual-to-physical address translation via a TLB prior to, or in parallel to, accessing the cache. The address translation occurs upon a private cache miss, when a request proceeds to the LLC/SHC. The LLC/SHC uses physical addresses, and is physically indexed and physically tagged.

Accordingly, a TLB holding the most useful virtual-to-physical address mappings is needed only at the interface between the private caches and the LLC/SHC.

[00020] The systems and methods described herein utilize a simple request response cache coherence protocol from the private cache hierarchy (LI, L2, etc.) to the LLC/SHC. All coherence decisions are taken independently at each private cache without any interaction with other cores. One such coherence protocol is based on a dynamic write policy in the private caches. Data in the private caches are classified as private (accessed by a single core or thread) or shared (accessed by more than one core or thread). The classification of data as private or shared may be determined dynamically at a page level granularity at the TLB (without software involvement) by observing all accesses that need translation. Alternatively, the classification may be performed at a cache-line granularity at the LLC/SHC cache-lines (again without software involvement) or, further, by a hybrid approach that involves both the TLB and the LLC/SHC, or even by the TLB alone but at a cache-line granularity.

[00021] Under the data classification scheme of the disclosed embodiments, cache lines holding private data can use a write-back policy. Cache lines holding shared data can use a write-through or delayed write-through policy. Shared data can be selectively flushed from a private cache, and kept coherent for Data-Race-Free operation, when the core that owns the private cache performs a synchronization or memory ordering operation. The disclosed virtual memory organization is practical, and provides significantly better performance and energy-efficiency than any alternative implementation, because the virtual-to-physical address translation is performed only upon cache misses, and this further allows the system to have a single logical TLB between the private caches and the LLC/SHC. The simple request- response coherence protocol eliminates the need to perform reverse translations (from physical addresses to virtual addresses). The disclosed invention services coherence protocols with operations strictly between the private caches of the cores and the LLC/SHC. In this way, virtual address synonyms are handled without any additional support, and synonyms are allowed to exist both within a private cache and among multiple private caches. 22] As shown in Fig. 1, the multiple processing cores share access to the same area of main memory via a cache hierarchy. Each processor core 20 is connected to its own small but fast level 1 local or private data cache 22. Each core 20 may also optionally include a level 1 instruction cache (not shown). A global or shared data cache 24 is associated with all the cores 20. This global cache is typically the last- level cache before the main memory 26. The LLC/SHC can be a single cache (possibly multi-banked) or partitioned to multiple slices that are distributed with the cores. The latter is known as a "tiled" architecture. In addition to private (LI) cache 22, each core may have one or more additional levels of private caches (not shown) attached below the LI cache. These additional, intermediate cache hierarchy levels are private to the cores, and are treated in the same manner as the first level, LI cache. Hierarchical multi-core organizations with multiple levels of intermediate shared caches for groups of cores are treated recursively. The cores 20 and all caches in the system can be interconnected with any network-on-chip, switch, or bus architecture (including multiple buses) as indicated at 30. Cache coherence for the methods described herein is interconnect-agnostic, meaning that the coherence protocols are the same whether implemented over a bus, a crossbar, or a packet-based, point-to- point, network-on-chip (NoC). This leads to seamless scaling from low-end to high- end parts or free intermixing of buses and NoCs on the same chip, e.g. in an heterogeneous multi-core/many-core chip.

[00023] The computer system illustrated in Fig. 1 includes a single, shared TLB

provided between the local cache hierarchy and the shared, last level cache. Fig. 2 illustrates an alternative shared memory architecture for a multiprocessor computer system having separate, per core TLB's located between each of the local cache memories and the shared, last level cache. The computer system in Fig. 2 also saves power by accessing the TLBs only on LI misses and, similarly to the computer system in Fig. 1, provides coherence without reverse translation by servicing coherence protocols with operations strictly between the private caches of the cores and the LLC/SHC.

Automatic (H/W) Data Classification at the Cache-Line Level by the LLC/SHC or a Directory

[00024] In a first embodiment, a system and method is provided for classifying data as Private or Shared based upon accesses of the data in the shared last-level cache (LLC/SHC) or a directory structure that exists at the same level of the hierarchy as the LLC/SHC. Using a directory structure allows classification of a different set of cache- lines than those residing in the LLC/SHC but otherwise the description of the system and the method is the same. Henceforth the system and method are described for the LLC/SHC. The system and method will be described for a block size equal to a cache line, however the system and method may also be generalized for other block sizes. Similarly, "ownership" of a cache line by a core as described herein, can be generalized to "ownership" by threads, by LI caches, etc. Each cache line in the private caches (Lis) uses one bit per entry to indicate the Private/Shared (P/S bit) status of its data. This information can be used by the cache coherence protocol, for example, to determine the write policy for the data. Cache lines are divided into "Private" and "Shared" by the LLC/SHC, depending upon the observed data accesses. The Private/Shared classification in the LLC/SHC is carried back to the Lis with the LLC/SHC response to an LI miss. Each line in the LLC/SHC is tagged with a Private/Shared bit and a "Private Owner" field that contains the ID of the core that "owns" the line when it is Private. An LLC/SHC line is Private when all accesses to it come from the same core. If a different core accessing a private line is detected (by detecting that the requestor ID is different from the current owner ID), the line is changed to Shared.

[00025] On an LI miss, a request is sent to the LLC/SHC and the following actions are performed:

1. ) If the cache line in the LLC/SHC is Shared, then the LLC/SHC response carries this information to the LI cache line which also tags its data as Shared;

2. ) If the cache line in the LLC/SHC is Private, then the private-owner field of the cache line is compared to the ID of the core that initiated the miss. If the IDs are the same, the cache line remains Private with the same private- owner. Otherwise, the cache line changes to Shared.

If the cache line changes from Private to Shared, before the LLC/SHC responds to the new requestor, the former private owner changes the classification of the cache line. The classification change is achieved by sending a request to the former private owner to change its classification of the cache line from Private to Shared. As a result of changing classification, the former private owner may perform a write-back of dirty data. The resulting classification information is carried, with the LLC/SHC response, to the new requestor.

[00026] The Private/Shared bit and private owner field exist only for the LLC/SHC lines. The status bit and private owner field are not saved externally, and are lost upon eviction of the cache line from the LLC/SHC. When a cache line is initially brought into the LLC/SHC, the initial state of the Private/Shared bit can vary, depending upon whether the System on Chip "SoC" hierarchy is inclusive or non- inclusive. For a non-inclusive hierarchy, when a line is brought into the LLC/SHC (as a result of an LI request), the Private/Shared state and private owner field of the line are unknown, and must be reconstructed by querying the Lis. A broadcast to all the Lis, or a snoop in all the Lis, establishes which (if any) private cache has the line. If more than one private cache has the line, then the line is shared and the private owner identity is irrelevant. If a single LI has the line, the LI holding the line replies to the broadcast with the LI cache's ID (or the ID is put on a shared bus). This ID becomes the initial value for the private owner field. If no LI has the line, i.e., an LLC/SHC cold- miss, the ID of the requesting LI cache is entered into the private owner field. Once the initial Private/Shared status is established, the classification is performed anew for the requesting LI. Broadcasts or snoops in the Lis, concern only LLC/SHC misses, which are significantly fewer than LI misses. For an inclusive hierarchy, when a cache line is evicted from the LLC/SHC, a broadcast or snoop is performed to all the LI caches to flush all copies of the cache line. When a cache line is brought into the LLC/SHC, the requesting LI cache becomes the line's private owner and the cache ID is added to the private owner field.

[00027] Similar to the Private/Shared classification, Shared cache lines can be

classified as Read-Only (RO) if the line has not been written, and Read-Write (RW) otherwise. Each line in the LLC/SHC is tagged with a RO/RW bit. A shared cache line starts as RO but transitions to RW on the first write. Because the line is shared, all the LI caches that have a copy must be notified of the change in Read-Write status with a broadcast. The Read-Only classification works with inclusive and non- inclusive cache hierarchies as described above for Private/Shared classification.

Reverse Adaptation (H/W) Data Classification at the Cache-Line Level

[00028] In a second embodiment, a system and method is provided for classifying data as Private or Shared according to the generational behavior of the data in the private caches (Lis). The system and method will be described for a block size equal to a cache line, however the method may also be generalized to any block size. A generation for a cache line starts when the cache line is brought into an LI cache as a result of an access by a core. In the LI cache, the cache line may be repeatedly accessed by the requesting core before entering a "dead" time awaiting eviction. A generation of a cache line ends when the cache line is evicted from the private cache and replaced by a new cache line, or by an explicit termination of the generation by a program instruction, or by a dead-block prediction (detecting or predicting when the cache line enters its dead time). In this embodiment, a cache line is classified as Private if a generation of the cache line in a first private cache does not overlap in time with a second, distinct generation of the same cache line in a different private cache. If two or more generations of the same cache line overlap (in time) in separate LI caches, the cache line is classified as Shared.

[00029] To track a generation of a cache line, the beginning and the end of the

generation is made visible to the classification mechanism described herein. The beginning of a generation occurs as the result of an LI miss, and a request to the LLC/SHC for the cache line. The end of a generation is not always known because some evictions from private caches with clean data are typically silent, meaning that no update of the LLC/SHC is required. Accordingly, explicit eviction notifications are used to notify the LLC/SHC classification mechanism that a cache line is evicted from an LI, if the cache line is already Shared. If the cache line is Private, no explicit eviction notification is required. Explicit termination of a generation by either the program or by a prediction mechanism also emits an eviction notification. According to the present method:

1. ) Private cache lines in LI caches evict silently if they are clean (contain unmodified data).

2. ) Private cache lines in LI caches evict by writing-back their data to the

LLC/SHC if the data is dirty (modified).

3. ) Shared cache lines in LI caches evict by sending an explicit eviction notification to the LLC/SHC if the shared cache lines are valid or invalid.

Invalid shared cache lines in LI caches must re-fetch the data from the LLC/SHC when accessed again by the core by using a special request, which will be referred to herein as a "refresh". A refresh indicates that a cache line is invalid in the LI, but not yet evicted. If the cache line does not exist in the LI, a normal request is sent to the LLC/SHC.

[00030] Further, each LLC/SHC cache line includes a Private/Shared bit and the

"Private Owner" field, as described in the first embodiment. In this second embodiment, the "Private Owner" field also functions as a count of the number or plurality of cores (threads) sharing the cache line when the cache line is Shared. In this embodiment, this field is referred to as the "PrivateOwner/SharerCount." The following convention holds for the Private/Shared bit and the

"PrivateOwner/SharerCount" field, where the first entry in the parenthesis is the Private/Shared status and the second entry is the PrivateOwner/SharerCount.

(P/S, PrivateOwner/SharerCount):

• (Shared, 0) : no sharers (NULL entry), by convention this represents the NULL entry

• (Private, X) : Private line, and X is the owner, X in {0 .. N-l } for N cores

• (Shared, n) : Shared line, with n sharers, where 1 < n <= N

• (Shared, 1) : Shared line, 1 sharer, but do not know the owner

[00031] A cache line is initialized to (Shared, 0). Upon the first access of the cache line from the LLC/SHC by an LI cache, the P/S bit to set to Private, and the

PrivateOwner/SharerCount field is set to the ID of the core (thread) that accessed the line, as shown in this example: access to a cache line by core X : (Shared, 0) (Private, X)

[00032] For any further access to the LLC/SHC cache line as a result of an LI cache miss, if the cache line P/S bit is set to Private, then the private-owner field of the line is compared to the ID of the core that initiated the miss. If the private-owner field matches the ID of the requesting core, the cache-line remains Private with the same private-owner. Otherwise, the cache line may change to Shared. Before the

LLC/SHC responds to the new core requesting the cache line, the former private owner must be notified by a request from the LLC/SHC. If the former private owner has the line in its cache, the former owner changes the classification of the line in its cache from Private to Shared. As a result of changing classification, the former owner either performs a write-back of dirty data, or sends an acknowledgement for the classification change. The LLC/SHC cache line then changes to Shared, and the PrivateOwner/SharerCount field is set to 2, denoting the number of sharers. If the former private owner has evicted the line from its cache (i.e. the cache line generation has ended), the former private owner replies with a negative acknowledgement to the LLC/SHC request. In this case, the LLC/SHC line remains Private, and the

PrivateOwner/SharerCount field is set to the ID of the new owner. The resulting classification information is carried, with the LLC/SHC response, to the new core that initiated the request. 33] For a request from core X that finds the LLC/SHC line as (Private, X) the classification remains the same. For a request from core Y that finds the LLC/SHC line as (Private, X):

• if core X does not have the cache line anymore, the core provides a negative acknowledgment (NACKS) the notification, and the LLC/SHC line goes to (Private, Y)

• if core X still has the cache line, core X sends a positive acknowledgment (ACKS) to the notification, or, alternatively, core X writes-through the cache line data to the LLC/SHC, and the LLC/SHC line goes to (Shared, 2).

For further requests:

• for any new access (read/write) to a shared cache line, the line remains shared and the PrivateOwner/SharerCount in the LLC/SHC line is incremented by 1 : (Shared, n) (Shared, n+1)

• for any eviction notification, the cache line remains shared and the

PrivateOwner/SharerCount in the LLC/SHC line is decremented by 1 :

(Shared , n) (Shared, n-1)

• for any refresh, the classification of the line in the LLC/SHC remains the same: (Shared, n) (Shared, n).

A shared cache line reverts back to the Private state if all of the generations for the cache line that overlap in the LI caches end. Because of the explicit eviction notifications sent by the eviction of shared LI lines, when all of the generations of a cache line end, its LLC/SHC classification goes to (Shared, 0), which denotes the NULL state (no sharers). At this state, the next access from core X takes the LLC/SHC line to (Private, X). [00034] When the LLC/SHC line reaches the classification state (Shared, 1), there is only one sharer. Because the identity of the LI cache "sharer" is unknown, the classification (P/S bit) does not transition to Private. The identity of the single sharer can be determined with a refresh request coming from core X: (Shared, 1)

(Private, X). The identity of the single sharer may also be determined through a write-through from a core such as, for example, for a write-through from core X: (Shared, 1) (Private, X). If any other access for the cache line is made at the LLC/SHC, then the sharer count is incremented: (Shared, 1) (Shared, 2). If an eviction notification arrives at the LLC/SHC then the sharer count is decremented: (Shared, 1) (Shared, 0) i.e., NULL. The shared-to-private transitions starting at (Shared, 1) are optional and can be enabled by the program or the OS depending on the sharing patterns of a workload.

Virtual-addressed Private Cache / Physical-addressed Shared Cache (VP-PS), cache coherent Shared Virtual Memory for Multicores/Manycores

[00035] In a further embodiment, a cache coherent shared- virtual-memory (cc-SVM) multicore/manycore architecture with virtual-addressed private caches and a physical- addressed shared cache is introduced. This architecture has only one (logical) TLB at the interface of the cores' private cache hierarchies and the LLC/SHC, as indicated by reference number 32 in Fig. 1. The use of only a single, shared TLB at the private cache to shared cache interface can be accomplished by using a directory- less/broadcast-less, snoop-less cache coherence protocol. The coherence protocol allows coherence decisions to be taken by the cores independently, without coordination, either distributed or centralized. This organization solves the synonym problem and requires no reverse address translations (physical-to-virtual).

A-posteriori Classification Cache-Coherence Protocols

[00036] According to the present invention, a computer system having a local or

private cache memory with a dynamic write policy, and a method of operating a cache memory hierarchy with a dynamic write policy is introduced for the purpose of implementing a VP-PS cc-SVM multicore architecture. Using a simple request response protocol such as, for example, VIPS-M, the private cache follows a different write policy on a per cache line basis. The selection between write polices in the private cache is determined from the classification of data as private or shared. Each cache line that is brought into the private cache follows one of these write policies. A write policy for a cache line is selected at the time the cache line is brought into the private cache from the classification performed at the LLC/SHC. A default write policy is assigned when no selection is performed. The write policy of a cache line can be changed dynamically, but external to, the cache action. Every cache-line has a corresponding Clean/Dirty (D) bit to be used by the write policies.

Automatic (H/W) Data Classification at the Page level by the TLB

[00037] In a first embodiment, the private cache selects from amongst write policies depending upon the classification of data as Private or Shared, with the data classification being at the page-level and performed at the TLB. The write policies implemented in the private caches include a write-back to the LI cache, a write- through to the LLC/SHC, or a delayed write-through to the LLC/SHC. When a physical page is first accessed by a core or thread it is private to that core or thread. The corresponding TLB entry is marked as Private (P/S bit), and the core or thread that "owns" the page is recorded in the TLB entry in a field called "PrivateOwner". An exemplary TLB entry for this embodiment is as follows:

VAI PA I Status IP/S I PrivateOwner

Where VA is the virtual address, PA is the physical address, Status includes a Dirty/Clean bit, a Lock bit, a Valid/Invalid bit and the typical virtual memory protection bits, P/S is the Private verses Shared data classification, and the

PrivateOwner is the ID of the core that owns the page. The TLB entry is indicated by reference number 34 in Fig. 1. If another core or thread accesses the same physical page (even using a synonym virtual address), the access is detected at the TLB (because the new core or thread is different than the private-owner that was recorded initially) and the TLB entry (or every synonym entry) is marked as Shared.

[00038] In this embodiment, the dynamic write policy is determined by page-level information supplied by the Translation Lookaside Buffer (TLB). TLB entries indicate the Private/Shared (P/S) status of the page, which is transferred to the private caches with the LLC/SHC responses. The P/S bit of the page controls the write policy of all the cache lines in the page and, thus, whether a write-through (or any other selected write policy) will take place for these cache-lines. A Lock bit in the TLB entries does not allow any changes to the TLB entries for the transition from Private to Shared or vice-versa.

TLB Classification and Management for Virtual-addressed Private Cache/ Physical-addressed Shared Cache (VP-PS), cache coherent Shared Virtual Memory

1. TLB Hit

[00039] An LI cache miss (in virtual address space) before going to the shared

LLC/SHC requires a virtual-to-physical address translation at the TLB. If a match is found in the TLB for the virtual address of the cache line and the TLB entry is tagged Private, then the core or thread ID of the LI cache miss is checked against the private- owner field of the TLB entry. If the core or thread ID matches the private owner field of the TLB entry, the TLB entry remains Private. If the core or thread ID and private owner field do not match, then the TLB entry is set to Shared. In the latter case, there would be only a single core or thread that had "owned" the virtual/physical page as Private prior to the access. The corresponding cache lines in the LI of the last private owner are converted to Shared and those that are dirty are written through (or delayed written through) to the LLC/SHC. Following the updating of the Private/Shared classification for the TLB entry, the physical address is sent to the LLC/SHC, and the cache miss is serviced.

2. TLB Miss

[00040] If a match is not found in the TLB for the virtual address (e.g., the virtual page number) of the cache line, the corresponding page table entry (hereinafter also referred to as "PTE") must be loaded from the page table. The loading of the entry from the page table may happen in hardware (with a "page table walker"), or in software by an operating system that can modify the TLB. A TLB entry is selected for replacement, and the missing page table entry (PTE) is brought in from the page table along with its corresponding classification: the P/S bit and the PrivateOwner field. When the PTE is brought into the TLB, a search takes place using the physical address part of the PTE to check against the physical address part of the other TLB entries. If a match is found, the PTE that is brought in the TLB is a synonym with any matching (on physical addresses) TLB entries.

[00041] If the new PTE is not a synonym with any other TLB entries, the

Private/Shared classification of the page is performed as described above for a TLB hit. If the new PTE is a synonym with two or more other TLB entries, then the other TLB entries must already be Shared (since they are all synonyms). If the new PTE is already tagged as Shared, no further action needs to be taken, and the TLB miss is resolved. If, however, the new PTE entry is Private, then the status of the entry is changed to Shared, the PTE is locked, and the classification state of the cache lines in the private-owner cache are changed to Shared. If the new PTE is a synonym with just one other TLB entry, then if either one or both (the new PTE and its TLB synonym) are in classification state Private, the status of the PTE and/or the TLB synonym changes to Shared, as described in the previous examples. In a different embodiment, the operating system classifies, by default, synonym pages as shared.

3. TLB Replacement

[00042] On a TLB miss, a TLB entry must be selected for replacement. Any

replacement algorithm is acceptable. Cache lines in the private cache hierarchies are allowed to exist even without a corresponding TLB entry. Because the private cache hierarchies hide the access behavior from the TLB, it is possible that with popular replacement algorithms such as LRU (Least Recently Used) or its variations, many cache lines will exist in the private caches without corresponding TLB entries, especially those cache lines that do not generate frequent LLC/SHC accesses. Upon replacement, the classification of a TLB entry that has changed Private/Shared classification state, or updated its private-owner field, is stored in memory.

Migration [00043] Migration of a thread from core X to core Y is handled by flushing the cache lines of the thread from the cache of core X and patching the TLB entries that are classified as "Private with PrivateOwner X" to "Private with PrivateOwner Y."

Private/Shared Classification for Synonyms

[00044] Pages accessed by more than one core are classified as Shared. In one

embodiment, inter-process synonyms that involve virtual pages in more than one page table are classified Shared by default. The Operating System classifies as Shared (by default), synonym pages at the moment of their mapping, if they are:

1. ) virtual pages that map on the same physical page with mappings that overlap in time (shared-memory semantics),

2. ) virtual pages that map sequentially on the same physical page, but preserve the data in the physical page between mappings (message-passing semantics), or

3. ) synonyms in the same address space.

All other pages start as Private and are subject to the normal classification technique described above.

Automatic (HAV) HYBRID Data Classification at the Page Level by the TLB and at the Cache Line Level by the LLC/SHC

[00045] In one alternative embodiment, data classification to Private or Shared is performed at a hybrid level: primarily at the cache line level (if a classification exists on this level), or at the page-level otherwise. If there is a miss in the LLC/SHC, the fetched LLC/SHC cache line starts with the Private/Shared and private-owner fields of the corresponding Page Table Entry (PTE) in the TLB. If the cache line in the LLC/SHC is Shared, then the LLC/SHC response carries this information to the LI cache line which also tags its data as Shared. If the cache line in the LLC/SHC is Private, then its private-owner field is compared to the core or thread that initiated the miss. If they are the same, the cache line remains Private with the same private owner. Otherwise, the cache line changes to Shared. The resulting Private or Shared information is carried with the LLC/SHC response to the LI cache. Because in the LLC/SHC the private owner is checked per cache line, while in the TLB the private owner is checked per page, it is possible that the classification of many LLC/SHC cache lines can differ from the classification of their corresponding TLB entry. For example, two private cache-lines (each with a different private owner) can coexist in the same page, which must be Shared (since more than one private owner was observed accessing this page).

[00046] Upon evicting an LLC/SHC cache line, its Private/Shared and PrivateOwner fields are discarded. If the cache line is private, the corresponding (single) cache line in the LI must be updated with the classification of the corresponding page found in the TLB. This is because the private owner information of the LLC/SHC cache line is discarded. If the line is fetched again and becomes shared because of the page classification, the corresponding LI cache line will remain, erroneously, in Private state. Thus, the LI cache line must be updated to the page classification if it is in state Private. When the cache line is fetched again, it will obtain the Private/Shared and PrivateOwner fields from its corresponding TLB entry.

Automatic (H/W) HYBRID Data Classification at the Cache Line Level by the TLB

[00047] In yet another embodiment, classification at cache line granularity is

performed in the TLB entries. Alternatively, the granularity can be set to sub-page blocks, with sizes that are multiples of the cache line size but smaller than the page size. The granularity can reflect the page size. For example, for small pages the granularity is set to cache lines, but for large pages the granularity is set to larger sub- page blocks. The TLB entries are augmented with additional information for cache line level classification or classification at larger sub-page blocks. For each cache line or sub-page block of a page (e.g., 64, 64-byte cache-lines in a 4KByte page, or alternatively 64, 32KByte sub-page blocks in a 2 MByte page ) the classification information needed is stored along with the TLB entry for that page.

[00048] The TLB entry contains the typical Page Table Entry (PTE) information

needed for address translation (VA: Virtual Address Page, PA: Physical Address Page/Frame, Status: Status and protection bits), a P/S bit, a RO/RW bit, and a PrivateOwner field for page-level classification and a number of entries for cache- line-level classification. The number of entries is equal to the number of cache lines in a page. Each entry has its own P/S bit, RO/RW bit, and PrivateOwner/SharerCount field for the data classification embodiments discussed above. Cache line

classification for a page is stored in memory when the corresponding TLB entry is evicted and reloaded from memory when a TLB entry is loaded with the

corresponding page table entry.

[00049] Classification per cache line and per page is performed simultaneously. In hybrid classification, additional control bits allow page level classification to take precedence over cache line classification. In one embodiment, synonym pages are classified by default as Shared, and this classification takes precedence over any cache line classification. In another embodiment, the priority (precedence) of page level and cache line level classification is decided dynamically, by comparing the number of transitions from Private to Shared and from Shared to Private at cache line level, to user-defined thresholds.

Address Translation

[00050] According to the present invention, virtual address coherence with support for virtual address synonyms is provided for request-response protocols without request forwarding. In such protocols, requests go from the LI to the LLC/SHC and responses back to the LI. There is only a single special case of a request going from the LLC/SHC to an LI. This special case occurs when the LLC/SHC, during data classification, changes an LI cache line from Private to Shared status. In this situation, since the request occurs for a specific private cache line, there is no ambiguity (e.g., no synonyms) in the private cache and the request is straightforward to handle.

[00051] In the Virtual LI, Physical LLC/SHC cache memory hierarchy described herein, all the LI requests (i.e., LI misses) are in the Virtual address space. Assume that Core 0 reads a memory block whose virtual address is not found in the LI cache. The TLB is accessed with the virtual address, and the physical address is obtained and sent to the LLC/SHC. The LLC/SHC processes the request and sends the data back to Core 0. On an LI miss, a new tag (transaction_tag) is obtained for the transaction. The virtual address is stored in a miss-status handling register (MSHR) along with the transaction_tag. LLC/SHC responses do not need to send either the physical or the virtual address, just the transaction_tag and the data, if necessary. The virtual address is found in the MSHR. Thus, the present invention reduces traffic by not sending the address and avoids the problem of the reverse translation.

[00052] In one embodiment, the transaction_tag is a concatenation of the core ID (LI cache ID) and an index to the MSHR array giving direct access to the MHSR entry upon receipt of the response. This avoids an associative search of all the MHSRs to find the corresponding entry for a particular response. The key is that this can only be done with protocols where all messages from the LLC come as consequence of a previous message from the LI cache (i.e., we always have a MSHR entry in the LI). Protocols with forwarding or invalidation messages violate this property.

Alternatively, the correspondence of the virtual address to the LLC/SHC response can be kept just after the virtual-to-physical translation from the TLB and the response can return to the LI carrying the original virtual address.

[00053] Having shown and described various versions in the present disclosure, further adaptations of the methods and systems described herein may be accomplished by appropriate modifications by one of ordinary skill in the art without departing from the scope of the present invention. Several of such potential modifications have been mentioned, and others will be apparent to those skilled in the art. For instance, the examples, versions, geometries, ratios, steps, and the like discussed above are illustrative and are not required. Accordingly, the scope of the present invention should be considered in terms of the following claims and is understood not to be limited to the details of structure and operation shown and described in the specification and drawings.

Claims

What is claimed is:

1. A computer system comprising:

multiple processor cores;

a main memory;

at least one local cache memory associated with and operatively coupled to each core for storing one or more cache lines accessible only by the associated core, the local cache memories being virtually-addressed caches;

a global cache memory, the global cache memory being operatively coupled to the local cache memories and main memory and accessible by the cores, the global cache memory being capable of storing a plurality of cache lines, the global cache memory being a physically-addressed cache;

and

a translation lookaside buffer associated with each core, the translation lookaside buffer performing virtual to physical address translations on a local cache miss prior to accessing the global cache, each of the translation lookaside buffers only servicing coherence requests from the local cache to the global cache for simple request-response coherence.

2. A computer system comprising:

multiple processor cores;

a main memory;

a global cache memory, the global cache memory being operatively coupled to the local cache memories and main memory and accessible by the cores, the global cache memory being capable of storing a plurality of cache lines, the global cache memory being a physically-addressed cache; and

a single, shared translation lookaside buffer associated with the multiple processor cores, the shared translation lookaside buffer performing virtual to physical address translation on a local cache miss prior to accessing the global cache, the shared translation lookaside buffer only servicing coherence requests from the local caches to the global cache for simple request-response coherence.

3. A computer system comprising:

multiple processor cores;

a main memory;

at least one local cache memory associated with and operatively coupled to each core for storing one or more cache lines accessible only by the associated core; and

a global cache memory, the global cache memory being operatively coupled to the local cache memories and main memory and accessible by the cores, the global cache memory being capable of storing a plurality of cache lines, the global cache memory classifying a cache line as shared when the cache line has two or more generations overlapping in separate local cache memories, and as private when the cache line has a single generation or non-overlapping generations in the local cache memories.

4. The computer system of claim 3, wherein a cache line generation comprises a period beginning at the access by a local cache of the cache line from the global cache memory and ending either with an eviction of the cache line from the local cache memory or an explicit termination of the generation.

5. The computer system of claims 3 or 4, wherein a single local cache generation of a cache line classified as private is tracked by the identity of the local cache holding the cache line, multiple overlapping local cache generations of a cache line classified as shared are tracked by a plurality, and the plurality of overlapping generations of a cache line is incremented with the beginning of every local cache generation and decremented with the end of a local cache generation, and wherein a cache line transitions from shared to a null classification when the plurality of overlapping local- cache generations is decremented to zero.

6. The computer system of claims 3, 4 or 5, wherein the classification of a cache line as private or shared is transferred to a local cache with a response of the global cache to a local cache request and the classification is stored in the local cache.

7. The computer system of claims 3, 4, 5 or 6, wherein private cache lines in the local cache evict silently; and wherein a request for a cache line from a first local cache to the global cache that finds the requested cache line classified as private, with a generation existing in a second local cache, sends a request to the second local cache to change classification of the cache from private to shared; and wherein the second local cache replies with an acknowledgment or a write back of dirty data, and a classification change to shared when the second local cache has not previously evicted the cache line, or a negative acknowledgement when the second local cache has previously evicted the cache-line; and wherein the cache line transitions to a shared classification in the global cache when an acknowledgment is received; and wherein the cache line remains in a private classification when the global cache receives a negative acknowledgment.

8. The computer system of claims 3, 4, 5, 6 or 7, wherein the computer system hierarchy is inclusive, a private or shared classification is stored for each cache line present in the global cache, and an initial classification state of a cache line that is brought into the global cache from memory is null.

9. The computer system of claims 3, 4, 5, 6 or 7, wherein the computer system hierarchy is non-inclusive, a classification state and an identity of the local cache for each private cache line generation is stored for each cache line present in the global cache, and an initial classification state of a cache line that is brought into the global cache from memory is discovered by broadcasting a request to all local caches to identify existing local cache generations.

10. The computer system of claims 3, 4, 5, 6 or 7, further comprising a separate directory structure for storing a classification for one or more cache lines, the directory structure being searchable upon a miss to the global cache for a requested cache line, and wherein if the requested cache line is not found in the directory structure, a new entry is allocated in the directory structure, and an initial

classification of the new entry is discovered by broadcasting a request to the local caches to identify existing local cache generations.

11. The computer system of claims 3, 4, 5, 6, 7, 8, 9 or 10, wherein the local cache memories are virtually-addressed caches and the global cache memory is a physically- addressed cache; and wherein the computer system further comprises per core local translation lookaside buffers, the per core local translation lookaside buffers performing virtual to physical address translation on a local cache miss prior to accessing the global cache; and wherein the per core local translation lookaside buffers only service coherence requests from the local caches to the global cache for simple request-response coherence.

12. The computer system of claims 3, 4, 5, 6, 7, 8, 9, or 10, wherein the local cache memories are virtually-addressed caches and the global cache memory is a physically- addressed cache; and wherein the computer system further comprises a shared translation lookaside buffer, the shared translation lookaside buffer performing virtual to physical address translation on a local cache miss prior to accessing the global cache, and wherein the shared translation lookaside buffer only services coherence requests from the local caches to the global cache for simple request-response coherence.

13. The computer system of claims 3, 4, 5, 6, 7 or 12, wherein the local cache memories are virtually-addressed caches and the global cache memory is a physically- addressed cache; and wherein the computer system further comprises a shared translation lookaside buffer, the shared translation lookaside buffer performing virtual to physical address translation on a local cache miss prior to accessing the global cache, and wherein the shared translation lookaside buffer only services coherence requests from the local caches to the global cache for simple request-response coherence; and wherein the shared translation lookaside buffer stores a page table entry and a classification state for one or more cache lines corresponding to the page table entry.

14. The computer system of claims 12 or 13, further comprising a plurality of virtual pages, each of the virtual pages being classified as private while a page table entry in the translation lookaside buffer corresponding to the virtual page is accessed by a single local cache, and being classified as shared beginning when a page table entry in the translation lookaside buffer for the virtual page is accessed by a second local cache, and wherein a classification state and identity of the local cache accessing a private virtual page is stored along with a corresponding page table entry in the translation lookaside buffer.

15. The computer system of claim 14, wherein the classification of a page transitions to null when all the cache lines of the page are classified as null; and wherein a virtual page transitions from null to private classification when a local cache accesses the page table entry.

16. The computer system of claims 11, 12, 14 or 15, wherein an operating system enforces a shared classification for all synonym virtual pages.

17. The computer system of claims 11, 12, 14, 15, or 16, wherein an operating system selectively enforces a precedence of shared classification of a virtual page over a classification of individual cache lines of the page.

18. The computer system of claim 17, wherein the precedence of the shared classification of a virtual page over the classification of the individual cache lines of the page is enabled dynamically as a function of the number and the type of classification transitions of individual cache lines of the page.

19. The computer system of claims 14, 15, 16, 17 or 18, wherein the classification is performed on sub-page blocks larger than a cache line.

20. A method of implementing virtual cache coherence in a multiprocessor computer system having a shared memory and private cache virtual memory hierarchy, the method comprising;

classifying cache line data as private or shared;

using the private or shared data classification to select from amongst dynamic write policies in a private cache; and

using the dynamic write policies to maintain coherence traffic in a single request response direction from one or more private data caches to a global cache using only virtual to physical address translations for cache lines.

21. The method of claim 20, wherein classifying data as private or shared further includes classifying a cache line as shared when the cache line has two or more generations overlapping in separate local cache memories, and as private when the cache line has a single generation or non-overlapping generations in the local cache memories.

22. The method of claims 20 or 21, wherein the method of classifying cache line data further includes classifying a cache line as private or shared a-posteriori in the global cache after a private cache miss for a requested cache line.

23. The method of claims 20, 21 or 22, wherein following classification of a cache line as private or shared in the global cache, the method further includes transmitting the classification along with the global cache response to a requesting private cache.

24. The method of claims 20, 21, 22 or 23, wherein the data classification as private or shared is used by a request response protocol to select between a write-back of cache line data to a private cache, or a write-through of cache line data to the global cache.

25. The method of claims 20, 21, 22, 23 or 24, wherein a cache line address is translated from a virtual address to a physical address at a single translation lookaside buffer located at the interface of the private cache hierarchy and the global cache.

26. The method of claims 20, 21, 22, 23, 24 or 25, wherein coherence actions are strictly local to the private caches or one-directional from a private cache to the global cache.