US20160147658A1 - Configuration based cache coherency protocol selection - Google Patents
Configuration based cache coherency protocol selection Download PDFInfo
- Publication number
- US20160147658A1 US20160147658A1 US14/548,988 US201414548988A US2016147658A1 US 20160147658 A1 US20160147658 A1 US 20160147658A1 US 201414548988 A US201414548988 A US 201414548988A US 2016147658 A1 US2016147658 A1 US 2016147658A1
- Authority
- US
- United States
- Prior art keywords
- node
- nodes
- protocol
- cache coherency
- cache
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
- G06F12/0817—Cache consistency protocols using directory methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
- G06F12/0831—Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/084—Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1032—Reliability improvement, data loss prevention, degraded operation etc
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/31—Providing disk cache in a specific location of a storage system
- G06F2212/314—In storage network, e.g. network attached cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/60—Details of cache memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/60—Details of cache memory
- G06F2212/601—Reconfiguration of cache memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/62—Details of cache specific to multiprocessor cache arrangements
- G06F2212/622—State-only directory, i.e. not recording identity of sharing or owning nodes
Definitions
- the present invention is related to computer systems and more particularly to dynamically selecting cache coherency protocols of multi-processor systems.
- Modern computer packaging technology provides for a modular design that may be used in a variety of computer system products in conjunction with a computer memory.
- multiple processor cores may be packaged on a single module or chip die.
- multiple chip cores may be packaged with storage control function on a single module or chip die.
- the multiple processor cores employ a cache hierarchy on the module or chip die.
- only a single core of the module or chip die may be used in a computer system.
- only a single such module or chip die may be used in a computer system.
- multiple such modules or chip dies may be used in a computer system.
- Each embodiment may require a different cache coherency protocol to efficiently perform desired function and performance.
- one of a plurality of cache coherency protocols is selected and performed in a computer system comprising a first node of a configuration comprising one or more communicatively coupled nodes, each node comprising one or more communicatively coupled computer elements, each computer element consisting of at least one of a cluster or a storage controller (SC) function, each cluster comprising one or more processors, the SC function comprising a shared cache, the shared cache being shared by processors of each cluster, the method comprising: initiating communication, by each element of the configuration, to other elements of the configuration to determine configuration topology; based on a result of the initiated communication, storing a coherency value by each element; based on the coherency value, determining, by elements of the node, a cache coherency protocol of the plurality of cache coherency protocols to be used by the node; and using the determined cache coherency protocol to handle cache coherency requests.
- SC storage controller
- the configuration topology consists of the first node, wherein the plurality of cache coherency protocols comprise a first protocol that does not use an SC function to maintain cache coherency of the first node and a second protocol that uses an SC function to maintain cache coherency of the first node.
- the configuration topology consists of the first node, the using the selected cache coherency protocol further comprising accessing an inclusive directory of the SC function of the first node to handle the cache coherency requests.
- the configuration topology further comprises one or more second nodes, the first node communicatively coupled to said one or more second nodes, the using the selected cache coherency protocol further comprises: accessing, by the first node, an inclusive directory of the SC function of the first node to handle the cache coherency requests; based on determining, by the first node, that the cache coherency operation can be performed solely by the first node, performing the cache coherency operation, by the first node, without accessing caches of the one or more second nodes; and based on determining, by the first node, that the cache coherency operation must access said one or more second nodes, performing the cache coherency operation by accessing caches of said one or more second nodes.
- the plurality of cache coherency protocols comprise a third protocol and a fourth protocol for maintaining cache coherency of the node wherein the third protocol is the MESI protocol and the fourth protocol is the MOESI protocol.
- the configuration topology further comprises one or more second nodes, the first node communicatively coupled to said one or more second nodes, wherein the SC function of the first node determines that one of the first protocol and second protocol is to be used in local cache coherency operations within the first node and that a different one of the first protocol and second protocol is to be used in global cache coherency operations involving said one or more second nodes.
- FIG. 1 illustrates one example of components of a node
- FIG. 2 illustrates one example of components of a drawer
- FIG. 3 illustrates one example of components of two drawers
- FIG. 4 illustrates one example of components of four drawers
- FIG. 5 depicts components of an example central processor (CP);
- FIG. 6 depicts components of an example SC
- FIG. 7 depicts an example single CP configuration
- FIG. 8 depicts an example multi-CP configuration
- FIG. 9 depicts an example CP interconnection
- FIGS. 10-12 depict example operations of embodiments.
- a hierarchy of caches is used to buffer memory cache lines in order to reduce the access time overhead of data in memory.
- a coherency mechanism is employed to assure that one processor's cached value is coherent. For example, if multiple processors had the ability to modify the same cache line at the same time, neither one would be assured that its copy was correct.
- the coherency mechanism gives all processors a cached view of memory that would be consistent were there no cache at all, only a single main store.
- Such cache coherency is provided in an SMP (Symmetric multi-processor) computer system.
- SMP Symmetric multi-processor
- a “cluster” 500 of one or more processing units 501 to 506 (comprising one or more processor cores on a chip die for example) is the building block of different system structures.
- Each cluster may include a hierarchy of caches, for example local (level 1) caches (L1/L2) associated with respective processing units 501 to 506 , and a cluster shared cache (L3), shared by all processing units 501 to 506 of the cluster.
- L1/L2 local
- L3 cluster shared cache
- Each cache system of a cluster may be configurable to manage coherency within the cluster 500 , or to participate with other clusters of the system to manage coherency among multiple clusters.
- Each cluster may be packaged as a single chip die, a module comprising multiple chip dies, or other packaging apparatus known in the art.
- Each cluster may be an identical configuration, or may be different configurations.
- each cluster may comprise six processor cores 501 to 506 .
- one cluster may comprise six processor cores and another cluster may comprise four processor cores 501 to 504 .
- a system may consist of a single cluster 700 .
- the cluster may be communicatively coupled 113 to memory subsystem 108 and an I/O subsystem 107 for attachment to peripheral or network devices.
- the single cluster cache coherency is managed by the single cluster.
- cache coherence is the consistency of shared resource data that ends up stored in multiple local caches.
- Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory.
- Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.
- level 2 behavior In both level 2 behavior and level 3 behavior, a program can observe stale data. Recently, computer designers have come to realize that the programming discipline required to deal with level 2 behavior is sufficient to deal also with level 3 behavior. Therefore, at some point only level 1 and level 3 behavior will be seen in machines.
- Coherence defines the behavior of reads and writes to the same memory location.
- the coherence of caches is obtained if the following conditions are met:
- a write by processor P 1 may not be seen by a read from processor P 2 if the read is made within a very small time after the write has been made.
- the memory consistency model defines when a written value must be seen by a following read instruction made by the other processors.
- the data being shared is placed in a common directory that maintains the coherence between caches.
- the directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.
- Snarfing is a mechanism where a cache controller watches both address and data in an attempt to update its own copy of a memory location when a second master modifies a location in main memory.
- the cache controller updates its own copy of the snarfed memory location with the new data.
- Distributed shared memory systems mimic these mechanisms in an attempt to maintain consistency between blocks of memory in loosely coupled systems.
- Snooping protocols tend to be faster, if enough bandwidth is available, since all transactions are a request/response seen by all processors.
- the drawback is that snooping isn't scalable. Every request must be broadcast to all nodes in a system, meaning that as the system gets larger, the size of the (logical or physical) bus and the bandwidth it provides must grow.
- Directories tend to have longer latencies (with a three hop request/forward/respond) but use much less bandwidth since messages are point to point and not broadcast. For this reason, many of the larger systems (>64 processors) use this type of cache coherence.
- a snoop filter reduces the snooping traffic by maintaining a plurality of entries, each representing a cache line that may be owned by one or more nodes.
- the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.
- a temporal or other type of algorithm is used to refine the selection if more than one cache line is owned by the fewest number of nodes.
- the MESI protocol (known also as Illinois protocol due to its development at the University of Illinois at Urbana-Champaign) is a widely used cache coherence and memory coherence protocol. It is the most common protocol which supports write-back cache.
- Every cache line is marked with one of the four following states (coded in two additional bits):
- the cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory.
- the cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state.
- the write-back changes the line to the Exclusive state.
- the cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it.
- Invalid Indicates that this cache line is invalid (unused).
- caches In a typical system, several caches share a common bus to main memory. Each also has an attached central processing unit (CPU) which issues read and write requests.
- CPU central processing unit
- the caches' collective goal is to minimize the use of the shared main memory.
- a cache may satisfy a read from any state except Invalid.
- An Invalid line must be fetched (to the Shared or Exclusive states) to satisfy a read.
- a write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation known as Request For Ownership (RFO).
- RFO Request For Ownership
- a cache may discard a non-Modified line at any time, changing to the Invalid state.
- a Modified line must be written back first.
- a cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other caches in the system) of the corresponding main memory location and insert the data that it holds. This is typically done by forcing the read to back off (i.e. retry later), then writing the data to main memory and changing the cache line to the Shared state.
- a cache that holds a line in the Shared state must listen for invalidate or request-for-ownership broadcasts from other caches, and discard the line (by moving it into Invalid state) on a match.
- a cache that holds a line in the Exclusive state must also snoop all read transactions from all other caches, and move the line to Shared state on a match.
- the Modified and Exclusive states are always precise: i.e. they match the true cache line ownership situation in the system.
- the Shared state may be imprecise: if another cache discards a Shared line, this cache may become the sole owner of that cache line, but it will not be promoted to Exclusive state. Other caches do not broadcast notices when they discard cache lines, and this cache could not use such notifications without maintaining a count of the number of shared copies.
- Exclusive state is an opportunistic optimization: If the CPU wants to modify a cache line that is in state S, a bus transaction is necessary to invalidate all other cached copies. State E enables modifying a cache line with no bus transaction.
- a variation of the MESI protocol is the MOESI protocol that includes an “owned” state: This cache is one of several with a valid copy of the cache line, but has the exclusive right to make changes to it. It must broadcast those changes to all other caches sharing the line.
- owned state allows dirty sharing of data, i.e., a modified cache block can be moved around various caches without updating main memory.
- the cache line may be changed to the Modified state after invalidating all shared copies, or changed to the Shared state by writing the modifications back to main memory. Owned cache lines must respond to a snoop request with data.
- a system may consist of multiple clusters 800 to 802 , the multiple clusters 800 to 802 may be communicatively coupled 113 to a shared memory 108 and I/O 107 .
- the multiple clusters may include a shared L3 cache in one embodiment. Cache coherency may be managed cooperatively by the multiple clusters of the system.
- the system may comprise a plurality of clusters 109 to 111 and a storage control (SC) function 112 .
- the clusters interconnected 100 , 101 , 105 with each other and communicatively coupled 102 , 103 , 104 with the SC 112 .
- the SC 112 may include a higher level cache (L4).
- Clusters 109 to 111 may be communicatively coupled 113 with a shared memory subsystem 108 and an I/O subsystem 107 .
- an interconnected plurality of clusters 109 to 111 and SC 112 may constitute a “node”.
- the SC 112 includes an inclusive cache directory such that the SC 112 has knowledge of all lines of the node.
- the inclusive cache directory function is distributed amongst the processors and may be implemented by the combined non-inclusive cache directories of all clusters of the node.
- the SC 112 may determine whether a cache access of the node can be handled entirely within the node (such as a processor of the node requesting ownership of a cache line already owned by a processor of the node), or that a cache access of the node must interrogate other nodes (such as a processor of the node requesting ownership of a cache line, not currently owned by the node).
- the SC 112 function may be provided by cooperative logic of the processors 109 to 111 of the node.
- the processors cooperate to determine if an access to an external node is required, and the requesting processor may then control coherency accesses to other nodes on behalf of the node.
- the SC 112 function may be provided by dedicated logic, perhaps in a separate chip die of the node for example.
- a configuration may include computer storage media 114 .
- a plurality of nodes 201 , 202 may be packaged as a “drawer” 203 and interconnected by an S-Bus 200 .
- a system may comprise a plurality of drawers 309 , 310 , each node of a drawer 309 , 310 may be communicatively coupled by a respective S-Bus 301 , 302 within respective drawer.
- each node of drawer 305 , 306 is communicatively coupled with a node 307 , 308 of another drawer 310 by way of an A-Bus.
- Intra node coherency communication may be performed in an embodiment using pass-thru and a combination of S-Bus and A-Bus links.
- nodes 424 , 426 , 428 , 430 are communicatively coupled to other node(s) 425 , 427 , 429 , 410 of the respective drawer via a corresponding on-drawer S-Bus 406 407 , 408 , 419 .
- nodes of a drawer may act as pass-thru paths for drawer to drawer communication.
- only one set of drawer to drawer A-Busses are needed 400 , 401 , 402 and function provided by a second set of A-Busses 403 , 404 , 405 is provided by a node 424 , 426 , 428 , 430 passing communications from the single set of A-Busses 400 , 401 , 402 of each drawer through respective S-Busses 406 , 407 , 408 , 419 to other node(s) 425 , 427 , 429 , 431 .
- buses used to communicatively couple elements are provided for direct communication.
- each element has a direct link to each other element.
- one or more elements are communicatively coupled by providing a pass-thru function at an element to reduce the number of busses required.
- Buses may be unidirectional pairs ( FIG. 9 , CP 1 900 REC-X DRV-X pair for example), common bidirectional protocol or multi-drop busses for example.
- FIG. 9 depicts an example interconnection of clusters 900 , 901 , 902 and an SC 903 .
- Each element (chip die) having three pairs (X, Y and Z) of unidirectional buses (REC-X DRV-X, REC-Y DRV-Y and REC-Z DRV-Z).
- REC-X DRV-X REC-X DRV-X
- REC-Y DRV-Y and REC-Z DRV-Z unidirectional buses
- cache coherency amongst processors is performed across the cache hierarchy at the lowest level, to improve performance and minimize interference.
- cache coherency operations of caches of a cluster that need access to caches of other clusters may access other clusters
- cache coherency operations of caches of a node that need access to caches of other nodes may access other nodes
- only cache coherency operations of a caches of a drawer that need access to caches of other nodes may access other nodes for example.
- cache lines that are owned or shared by processors of an entity at a level of coherency are candidates to be handled at the level of coherency.
- a processor of the entity requests ownership or sharing of a line that is already owned by a processor of the entity, the entity need not access other entities to handle the request coherently.
- a processor of an entity requests sharing of a line owned or shared by a processor of the entity need not access other entities to handle the request.
- a request for example, by a processor of node A FIG. 4 424 , for sharing of a cache line is examined by SC function to determine if the line is owned or shared by a processor of Node A.
- the SC handles the request within the node without accessing any other node. If the line is neither owned or shared by a processor of Node A, the SC of Node A performs a cache coherency operation with other nodes 425 to 431 .
- certain Node coherency operations may speculatively broadcast the request to other nodes before determining that the line is shared or owned by a processor of Node A. In that case, the other nodes may examine their coherency state for the cache line for example before being required to actually participate in the coherency operation. The speculative request may later be cancelled or time out if the requesting node discovers it is not needed.
- topology of a configuration is discovered by each cluster of a configuration, for example at configuration initialization time.
- signals may be transmitted over cluster interfaces in order to determine whether other clusters or support elements (SCs for example) are communicatively coupled to the signaling cluster. This may be performed by micro-code (firmware) of the cluster.
- the cluster may record a configuration value that represents the cluster's view of the topology.
- the cluster may record a bit-significant value wherein each bit represents the state of a corresponding interface.
- the configuration value represents the protocol to be used by the cluster.
- a single cluster configuration FIG. 7, 700 ) would record a configuration value indicating no such interconnections are active.
- a cluster ( FIG. 8 ) of a multi-cluster configuration 800 , 801 , 802 ) would record a configuration value indicating interconnections 100 , 101 to other clusters 801 , 802 are active.
- a cluster ( FIG. 1, 109 ) of a multi-cluster configuration having an SC function 112 would record a configuration value indicating interconnections 100 , 101 , 102 to other clusters 110 , 111 and an SC 112 are active.
- an SC function ( FIG. 6, 600 ) of a node FIGS.
- a node 424 of a drawer 420 may determine that an S-Bus 400 , 401 , 402 is active to another node 425 of the drawer, and that three node A-Bus interfaces 400 , 401 , 402 are active to corresponding nodes 426 , 428 , 430 of three corresponding other drawers 421 , 422 , 423 .
- a node 424 may also similarly determine that other nodes 427 , 429 , 431 of other drawers are communicatively coupled by a pass-thru function for example utilizing S-Bus 406 , 407 , 408 , 419 .
- a signal from a node 424 via an S-Bus 406 to another node 425 of the drawer 420 that returns a configuration value of the other node 425 indicating other nodes 427 , 429 , 431 of respective other drawers 421 , 422 , 423 are communicatively coupled to the other node 425 , and accessible by a pass-thru function of the other node 425 .
- the resulting configuration value of a node determines whether coherency of the node is to be determined by a first coherency protocol of a configuration having no SC function ( FIG. 8 ) or a second coherency protocol of a configuration having an SC function ( FIG. 1 ), the first coherency protocol being controlled solely by clusters of the node, the second coherency protocol being controlled entirely within the node by an SC 112 of the node.
- the resulting configuration value of the node may determine whether coherency of a node 305 is to be determined by a third coherency protocol of a configuration having another node(s) 306 in the same drawer, or a fourth coherency protocol of a configuration having other drawer(s) 310 having other node(s) 307 , 308 .
- a modern processor chip (silicon die) can therefore find itself being used in a variety of system configurations and it would be advantageous for the chip to be able to optimize the cache coherency protocol that is used based on the configuration it is in.
- a processor chip is described, but embodiments would be equally valuable in other packages, for example, a multi-chip module providing a processor function.
- a chip is capable of detecting the system configuration it is in and customize the cache coherency protocol that is used in the configuration as a result. This allows us to build a variety of system structures to better meet the requirements of an individual system, based on the same chip.
- a fully configured SMP system may consist of four “drawers” 420 to 423 , each may comprise up to two nodes 424 425 , 426 427 , 428 429 , 430 431 .
- Each node in the high-end configuration may consist of three CP chips 109 to 111 and one SC chip 112 as shown in FIG. 1 .
- FIGS. 5 and 6 An example diagram of the L3 ( FIG. 5 ) and L4 ( FIG. 6 ) cache hierarchy, cache controllers and their key components and interactions may be provided, which may interact as shown in FIGS. 5 and 6 .
- the CP chip 500 may contains six processor cores 501 to 506 , a shared L3 cache 508 , ports to access Main Memory by way of a memory control system (MCS) 507 and IO ports by way of respective I/O complexes 519 , 520 .
- MCS memory control system
- each CP chip 500 may include an X-Bus interconnection A B C for direct access to up to two other CP chips 500 and an SC chip ( FIG. 6 ) 600 .
- the SC chip 600 may contains a fully shared L4 cache 601 and interconnections A B and C for interconnection with three CP chips.
- the CP 500 chips and SC 600 chip on the same node communicate via the X-Bus interface ( FIGS. 1 ) 100 to 105 .
- a node may consist of up to four chips connected in a point-to-point fully connected X-Bus topology. The number and type of chips connected to the X-Bus varies depending upon the system configuration.
- the SC chip 600 When the SC chip 600 is in the configuration, it may serve as the coherency point. However, because the CP chip contains all necessary connections and coherency maintenance functionality, the SC chip is not required to be installed in Low End configurations.
- the CP chip must support the coherency protocol.
- the same CP chip design could find itself in any of several possible system configurations:
- the chips detect the configuration of the system at startup. When the chips are initially powered on, each chip has no available off-chip interfaces to any other chip. As the hardware is being initialized by micro-code for example (firmware used internally to perform tasks on behalf of hardware), the off-chip interfaces on the CP chips may be calibrated by firmware to allow for chip to chip communication. When an interface is successfully calibrated, firmware sets a status bit to indicate that the interface is now valid and available for chip to chip communication and provides additional information as to what type of chip is connected to each valid interface. By analyzing the collective state of these valid indications, the L3 cache controller may determine which interfaces have active connections and the topology of the system, i.e. whether the CP chip is interfaced to other CP chips and, in an embodiment, an SC chip [IS THERE SOMETHING MISSING HERE?]
- the L3 cache controller may provide the coherency protocol and mechanism that is used for chip to chip communication.
- the chip may be the only one in the system (no other chips return a response) and so it only needs to manage coherency between the cores on the chip.
- the protocol may be provided so that one of the CP chips serves as the point of coherency in the system. The point of coherency may done on an individual line basis based on the state of the cache line in each chip, (a CP chip point of coherency for a particular cache line may be managed by one CP chip, and another cache line by another CP chip.
- the protocol may be provided for a multi-tiered protocol with the SC serving as a global point of coherency for the system and the CP chips serving as a local point of coherency for the node.
- a CP determines via the analysis of the interface valid indications that it is connected only to other CP chips, it knows that the system topology is that of a relatively small SMP structure. It may determine then that off-chip communication should be optimized for system performance and choose to use a snooping process for the coherency mechanism and a protocol such as MESI for managing the coherency between CP chips.
- MESI a protocol such as MESI for managing the coherency between CP chips.
- the analysis of the interface valid indications indicates that the CP is connected to an SC chip, it knows that it is a part of a large SMP structure. In such a case a more scalable coherency mechanism, like a directory based mechanism, is more appropriate, and using a different protocol such as MOESI may be a better choice for managing the coherency between caches.
- each cluster determines at initialization time, whether the node of the cluster includes other clusters or an SC and records a corresponding coherency value and uses the recorded corresponding coherency value during normal operation to determine which interfaces are to be utilized. Furthermore, in an embodiment, the cluster utilizes an MESI protocol if no SC is present, but utilizes an MOESI protocol if an SC is present.
- the SC function determines at initialization time whether the node of the SC includes other clusters and whether the node is communicatively coupled to other nodes (and therefore the configuration topology).
- the SC function of the node may use it's recorded corresponding coherency value during normal operation to determine which interfaces are to be utilized, and what local cache coherency (fabric) protocol should be used and which global cache coherency (fabric) protocol should be used.
- analysis of the interface valid, by an SC chip determines whether other nodes exist. If the node of the SC chip is the only node in the system, overhead of inter-node coherency operations may be eliminated.
- coherency for a line that only effects the on-chip processing units (PUs) 501 to 506 of CPs 109 , 110 , 111 of a node may be managed without interfering with other nodes, greatly improving system performance. For one example, if a PU of a node “owns” a cache line and another PU of the same node requests ownership, the request and exchange of ownership can be accomplished without perturbing other nodes.
- PUs on-chip processing units
- the physical X-Bus may be used for transmitting memory commands and memory data and may be divided into two logical buses.
- the first bus is a shared address and data bus (i.e. the primary bus).
- the second bus is the response sideband bus.
- Full chip to chip X-Bus connectivity is shown in FIG. 9 .
- Each chip of the node has a Drive, Receive pair of busses dedicated to another chip of the node.
- the X-Bus protocol sequence can be broken down into several discrete steps. All sequences on an example X-Bus may begin with these initial four steps:
- each remote chip may make a snoop pipe pass to check the state of the requested line in its cache.
- the snoop pipe pass may also check for resource availability and address contention with other requests.
- the results of this pipe pass called the partial response, may then broadcast to all other chips on the node.
- the XBUS protocol may be different in the High End (HE) configuration which may contain an SC chip ( FIG. 1 ) and the Low End (LE) configuration ( FIG. 8 ) which may contain only CP chips.
- HE High End
- LE Low End
- the Local Intervention Master (LIM) L3 is preferably the primary serialization and coherency point on the node.
- the LIM L3 may be the L3 on the CP chip that will source data.
- the L3 on remote CP chips may send reject Partial Responses over the XBUS based on address compares detected during the snoop pipe pass on the remote CP chip. Upon receipt of these reject Partial Responses the requesting CP chip may at a later time retry the command broadcast onto the XBUS.
- the SC Chip 112 may be the serialization and coherency point on the node.
- the L3 of the CPs preferably will not reject any X-Bus requests in the HE configuration, its PRESP preferably will only contain the L3 directory results.
- the L3 preferably must wait for a non-reject CRESP before it can invalidate the L3 directory. If there is an address or resource conflict on the snoop, the L4 may send the SC reject PRESP.
- the L4 controller will preferably wait for the conflict to resolve and then send an FRESP to cause the requesting L3 to rebroadcast its request.
- the first request to the SC may load an address into a controller in the SC which may reject subsequent requests to the same address.
- the address may be protected by the L4 controller, which may remain valid for the duration of the operation.
- the L4 may reject requests from other L3s as well as protect the line from global fabric requests.
- the L3 controller may notify the L4 controller that it is safe to reset.
- requests for data may be sourced by the Local Intervention Master (LIM) chip on the node.
- LIM Local Intervention Master
- the L4 may not be fully inclusive, however it may keep track of all lines on the node.
- the L4 Non Inclusive Cache (NIC) Directory may be used to keep track of lines in the L3 that are not also in the L4. Therefore, in HE configurations, the LIM chip could be the SC or one of the CP chips, but the L4 is always the coherency point.
- local and/or remote L3 controllers may set LIM (local intervention master) pending and LMM (local memory master) pending to protect the address and enable address compares in order to protect line coherency.
- LIM local intervention master
- LMM local memory master
- the determination as to which CP chip will set the LIM pending may be determined by the local hit state and the combined hit states of the other two CP chips as determined by the Cresp (combination of the Presp states from other CP chips) and may be determined in parallel on all CP chips as a result of the fully connected X-Bus protocol.
- the determination as to which CP chip will set the LMM pending may be determined by which chip contains the target line's main memory location.
- the L3 controllers will not set LIM or LMM pending in the HE configuration, coherency may be maintained by address compares in the L4 cache controller.
- Another difference between the LE and HE protocol preferably involves the conditions for blocking the ingate of the local L3 miss controller. Ingate of this controller may only be allowed if all necessary remote resources are available. Only remote resources that are actually present in the configuration may be factored into in the local L3 miss controller blocking logic. Therefore, if there is no SC chip in the configuration, the requesting CP chip (L3) does not need to check or wait for resource availability on the SC chip. Based on this indication in the configuration setting, the CP L3 X-Bus command launch hardware arbitration will skip the logic that performs the SC resource checking and possible transitions to wait states for SC resources.
- the CP L3 X-Bus command launch hardware will check the indication for whether or not there are one or two other CP chips in the system, and if not, skip the logic that performs the remote CP resource checking and possible transitions to wait states for remote CP resources.
- the broadcast rate onto the X-Bus may have a required restriction in High End configurations due to pipelining availability in the L4. This broadcast rate restriction can also be dynamically removed once it is determined that the current configuration does not include an L4/SC chip.
- one of a plurality of cache coherency protocols is performed in a computer system comprising one or more communicatively coupled computer elements 109 to 112 of a first node, each computer element 109 to 112 consisting of a at least one of a cluster 500 or a storage controller (SC) function 600 , each cluster 500 comprising one or more processors 501 to 506 , the SC function comprising a shared cache, the shared cache being shared by processors of each cluster, the first node configured to perform a method, the method comprising: initiating ( FIG.
- the configuration topology consists of the first node 424 wherein the plurality of cache coherency protocols 1002 comprise 1007 a first protocol that does not use an SC function 600 to maintain cache coherency of the node 424 and a second protocol that uses an SC function 600 to maintain cache coherency of the node first 424 .
- the configuration topology consists of the first node 424 , the using 1004 the selected cache coherency protocol further comprising accessing an inclusive directory 1105 of the SC function 600 of the first node 424 to handle the cache coherency requests.
- the configuration topology further comprises one or more second nodes 425 to 427 , the first node 424 communicatively coupled to said one or more second nodes 425 , the using 1004 the selected cache coherency protocol further comprises:
- the plurality of cache coherency protocols 1007 comprise a third protocol and a fourth protocol to maintain cache coherency of the node 424 wherein the third protocol is the MESI protocol 1007 and the fourth protocol is the MOESI protocol 1007 .
- the SC function 600 determines 1202 that one of the first protocol 1007 or second protocol 1007 is to be used in local cache coherency operations within the node and that a different one of the first protocol 1007 or second protocol 1007 is to be used in global cache coherency operations.
- the configuration topology comprises a plurality of nodes 424 to 431 , wherein a first plurality of nodes 424 , 425 are communicatively coupled by an S-Bus 406 , the first plurality of nodes 424 , 425 communicatively coupled to other nodes 426 to 431 of the plurality of nodes by one or more A-Buses 400 to 405 wherein coherency operations are broadcast 1203 to all nodes by way of the S-Bus 406 and-the one or more A-buses 400 to 405 .
- the present invention may be a system, a method, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the Figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
- The present invention is related to computer systems and more particularly to dynamically selecting cache coherency protocols of multi-processor systems.
- Modern computer packaging technology provides for a modular design that may be used in a variety of computer system products in conjunction with a computer memory. In one example, multiple processor cores may be packaged on a single module or chip die. In another example, multiple chip cores may be packaged with storage control function on a single module or chip die. In an embodiment, the multiple processor cores employ a cache hierarchy on the module or chip die. In one embodiment, only a single core of the module or chip die may be used in a computer system. In one embodiment, only a single such module or chip die may be used in a computer system. In one embodiment, multiple such modules or chip dies may be used in a computer system. Each embodiment may require a different cache coherency protocol to efficiently perform desired function and performance.
- U.S. Pat. No. 8,423,736 “MAINTAINING CACHE COHERENCE IN A MULTI-NODE, SYMMETRIC MULTIPROCESSING COMPUTER”, filed 2010 Jun. 16 and incorporated by reference herein, teaches “Maintaining cache coherence in a multi-node, symmetric multiprocessing computer, the computer composed of a plurality of compute nodes, including, broadcasting upon a cache miss by a first compute node a request for a cache line; transmitting from each of the other compute nodes to all other nodes the state of the cache line on that node, including transmitting from any compute node having a correct copy to the first node the correct copy of the cache line; and updating by each node the state of the cache line in each node, in dependence upon one or more of the states of the cache line in all the nodes.”
- U.S. Pat. No. 8,402,225 “METHOD FOR PERFORMING CACHE COHERENCY IN A COMPUTER SYSTEM”, filed 2010 Sep. 21 and incorporated by reference herein, teaches “in a computing system, cache coherency is performed by selecting one of a plurality of coherency protocols for a first memory transaction. Each of the plurality of coherency protocols has a unique set of cache states that may be applied to cached data for the first memory transaction. Cache coherency is performed on appropriate caches in the computing system by applying the set of cache states of the selected one of the plurality of coherency protocols.”
- U.S. Pat. No. 8,010,716 “METHODS AND APPARATUS FOR SUPPORTING MULTIPLE CONFIGURATIONS IN A MULTI-PROCESSOR SYSTEM”, filed 2010 Aug. 18 and incorporated by reference herein, teaches “methods and apparatus provide for interconnecting one or more multiprocessors and one or more external devices through one or more configurable interface circuits, which are adapted for operation in: (i) a first mode to provide a coherent symmetric interface; or (ii) a second mode to provide a non-coherent interface.
- U.S. Patent Application Publication No 2004/0044850 titled “Method and Apparatus for the Synchronization of Distributed Caches” filed 2002 Aug. 28, incorporated herein by reference in its entirety, teaches a hierarchical caching protocol suitable for use with distributed caches, including use within a caching input/output hub.
- The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a mechanism to detect topology of a computer system and, based on the topology instantiate a selected cache coherency protocol of a plurality of coherency protocols available.
- In an embodiment, one of a plurality of cache coherency protocols is selected and performed in a computer system comprising a first node of a configuration comprising one or more communicatively coupled nodes, each node comprising one or more communicatively coupled computer elements, each computer element consisting of at least one of a cluster or a storage controller (SC) function, each cluster comprising one or more processors, the SC function comprising a shared cache, the shared cache being shared by processors of each cluster, the method comprising: initiating communication, by each element of the configuration, to other elements of the configuration to determine configuration topology; based on a result of the initiated communication, storing a coherency value by each element; based on the coherency value, determining, by elements of the node, a cache coherency protocol of the plurality of cache coherency protocols to be used by the node; and using the determined cache coherency protocol to handle cache coherency requests.
- In an embodiment, the configuration topology consists of the first node, wherein the plurality of cache coherency protocols comprise a first protocol that does not use an SC function to maintain cache coherency of the first node and a second protocol that uses an SC function to maintain cache coherency of the first node.
- In an embodiment, the configuration topology consists of the first node, the using the selected cache coherency protocol further comprising accessing an inclusive directory of the SC function of the first node to handle the cache coherency requests.
- In an embodiment, the configuration topology further comprises one or more second nodes, the first node communicatively coupled to said one or more second nodes, the using the selected cache coherency protocol further comprises: accessing, by the first node, an inclusive directory of the SC function of the first node to handle the cache coherency requests; based on determining, by the first node, that the cache coherency operation can be performed solely by the first node, performing the cache coherency operation, by the first node, without accessing caches of the one or more second nodes; and based on determining, by the first node, that the cache coherency operation must access said one or more second nodes, performing the cache coherency operation by accessing caches of said one or more second nodes.
- In an embodiment, the plurality of cache coherency protocols comprise a third protocol and a fourth protocol for maintaining cache coherency of the node wherein the third protocol is the MESI protocol and the fourth protocol is the MOESI protocol.
- In an embodiment, the configuration topology further comprises one or more second nodes, the first node communicatively coupled to said one or more second nodes, wherein the SC function of the first node determines that one of the first protocol and second protocol is to be used in local cache coherency operations within the first node and that a different one of the first protocol and second protocol is to be used in global cache coherency operations involving said one or more second nodes.
- System and computer program products corresponding to the above-summarized methods are also described and claimed herein.
- Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.
- The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
-
FIG. 1 illustrates one example of components of a node; -
FIG. 2 illustrates one example of components of a drawer; -
FIG. 3 illustrates one example of components of two drawers; -
FIG. 4 illustrates one example of components of four drawers; -
FIG. 5 depicts components of an example central processor (CP); -
FIG. 6 depicts components of an example SC; -
FIG. 7 depicts an example single CP configuration; -
FIG. 8 depicts an example multi-CP configuration; -
FIG. 9 depicts an example CP interconnection; and -
FIGS. 10-12 depict example operations of embodiments. - The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
- In modern multi-processor computer systems, a hierarchy of caches is used to buffer memory cache lines in order to reduce the access time overhead of data in memory. In order to give all processors, access to cached data, a coherency mechanism is employed to assure that one processor's cached value is coherent. For example, if multiple processors had the ability to modify the same cache line at the same time, neither one would be assured that its copy was correct. Thus, the coherency mechanism gives all processors a cached view of memory that would be consistent were there no cache at all, only a single main store. Such cache coherency is provided in an SMP (Symmetric multi-processor) computer system. There are many possible protocols that can be used to perform cache coherency in a large SMP system. Depending on the SMP structure and system configuration, some protocols may be better tailored than others with regards to bus utilization and system performance. Different SMP structures may also be better suited for different system packaging or customer capacity requirements.
- Processor Cluster
- In embodiments (
FIG. 5 ), a “cluster” 500 of one ormore processing units 501 to 506 (comprising one or more processor cores on a chip die for example) is the building block of different system structures. Each cluster may include a hierarchy of caches, for example local (level 1) caches (L1/L2) associated withrespective processing units 501 to 506, and a cluster shared cache (L3), shared by allprocessing units 501 to 506 of the cluster. Each cache system of a cluster may be configurable to manage coherency within thecluster 500, or to participate with other clusters of the system to manage coherency among multiple clusters. Each cluster may be packaged as a single chip die, a module comprising multiple chip dies, or other packaging apparatus known in the art. Each cluster may be an identical configuration, or may be different configurations. For a first example, each cluster may comprise sixprocessor cores 501 to 506. In a second example, one cluster may comprise six processor cores and another cluster may comprise fourprocessor cores 501 to 504. - In an embodiment (
FIG. 7 ), a system may consist of asingle cluster 700. The cluster may be communicatively coupled 113 tomemory subsystem 108 and an I/O subsystem 107 for attachment to peripheral or network devices. The single cluster cache coherency is managed by the single cluster. - In computing, cache coherence is the consistency of shared resource data that ends up stored in multiple local caches.
- When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of central processing units (CPUs) in a multiprocessing system. Referring to the illustration on the right, if the top client has a copy of a memory block from a previous read and the bottom client changes that memory block, the top client could be left with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory.
- In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.
- There are three distinct levels of cache coherence:
-
- 1. every write operation appears to occur instantaneously;
- 2. all processors see exactly the same sequence of changes of values for each separate operand; and
- 3. different processors may see an operation and assume different sequences of values; this is considered to be a non-coherent behavior.
- In both
level 2 behavior andlevel 3 behavior, a program can observe stale data. Recently, computer designers have come to realize that the programming discipline required to deal withlevel 2 behavior is sufficient to deal also withlevel 3 behavior. Therefore, at some point onlylevel 1 andlevel 3 behavior will be seen in machines. - Coherence defines the behavior of reads and writes to the same memory location. The coherence of caches is obtained if the following conditions are met:
-
- 1. In a read made by a processor P to a location X that follows a write by the same processor P to X, with no writes of X by another processor occurring between the write and the read instructions made by P, X must always return the value written by P. This condition is related with the program order preservation, and this must be achieved even in monoprocessed architectures.
- 2. A read made by a processor P1 to location X that happens after a write by another processor P2 to X must return the written value made by P2 if no other writes to X made by any processor occur between the two accesses and the read and write are sufficiently separated. This condition defines the concept of coherent view of memory. If processors can read the same old value after the write made by P2, we can say that the memory is incoherent.
- 3. Writes to the same location must be sequenced. In other words, if location X received two different values A and B, in this order, from any two processors, the processors can never read location X as B and then read it as A. The location X must be seen with values A and B in that order.
- These conditions are defined supposing that the read and write operations are made instantaneously. However, this doesn't happen in computer hardware given memory latency and other aspects of the architecture. A write by processor P1 may not be seen by a read from processor P2 if the read is made within a very small time after the write has been made. The memory consistency model defines when a written value must be seen by a following read instruction made by the other processors.
- Rarely, and especially in algorithms, coherence can instead refer to the locality of reference.
- Directory-Based
- In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.
- Snooping
- This is a process where the individual caches monitor address lines for accesses to memory locations that they have cached. It is called a write invalidate protocol when a write operation is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.
- Snarfing
- Snarfing is a mechanism where a cache controller watches both address and data in an attempt to update its own copy of a memory location when a second master modifies a location in main memory. When a write operation is observed to a location that a cache has a copy of, the cache controller updates its own copy of the snarfed memory location with the new data.
- Distributed shared memory systems mimic these mechanisms in an attempt to maintain consistency between blocks of memory in loosely coupled systems.
- The two most common mechanisms of ensuring coherency are snooping and directory-based, each having its own benefits and drawbacks. Snooping protocols tend to be faster, if enough bandwidth is available, since all transactions are a request/response seen by all processors. The drawback is that snooping isn't scalable. Every request must be broadcast to all nodes in a system, meaning that as the system gets larger, the size of the (logical or physical) bus and the bandwidth it provides must grow. Directories, on the other hand, tend to have longer latencies (with a three hop request/forward/respond) but use much less bandwidth since messages are point to point and not broadcast. For this reason, many of the larger systems (>64 processors) use this type of cache coherence.
- For the snooping mechanism, a snoop filter reduces the snooping traffic by maintaining a plurality of entries, each representing a cache line that may be owned by one or more nodes. When replacement of one of the entries is required, the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries. A temporal or other type of algorithm is used to refine the selection if more than one cache line is owned by the fewest number of nodes.
- The MESI protocol (known also as Illinois protocol due to its development at the University of Illinois at Urbana-Champaign) is a widely used cache coherence and memory coherence protocol. It is the most common protocol which supports write-back cache.
- Every cache line is marked with one of the four following states (coded in two additional bits):
- Modified: The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state.
- Exclusive: The cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it.
- Shared: Indicates that this cache line may be stored in other caches of the machine and is “clean”; it matches the main memory. The line may be discarded (changed to the Invalid state) at any time.
- Invalid: Indicates that this cache line is invalid (unused).
- In a typical system, several caches share a common bus to main memory. Each also has an attached central processing unit (CPU) which issues read and write requests. The caches' collective goal is to minimize the use of the shared main memory.
- A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Shared or Exclusive states) to satisfy a read.
- A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation known as Request For Ownership (RFO).
- A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must be written back first.
- A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other caches in the system) of the corresponding main memory location and insert the data that it holds. This is typically done by forcing the read to back off (i.e. retry later), then writing the data to main memory and changing the cache line to the Shared state.
- A cache that holds a line in the Shared state must listen for invalidate or request-for-ownership broadcasts from other caches, and discard the line (by moving it into Invalid state) on a match.
- A cache that holds a line in the Exclusive state must also snoop all read transactions from all other caches, and move the line to Shared state on a match.
- The Modified and Exclusive states are always precise: i.e. they match the true cache line ownership situation in the system. The Shared state may be imprecise: if another cache discards a Shared line, this cache may become the sole owner of that cache line, but it will not be promoted to Exclusive state. Other caches do not broadcast notices when they discard cache lines, and this cache could not use such notifications without maintaining a count of the number of shared copies.
- In that sense the Exclusive state is an opportunistic optimization: If the CPU wants to modify a cache line that is in state S, a bus transaction is necessary to invalidate all other cached copies. State E enables modifying a cache line with no bus transaction.
- A variation of the MESI protocol is the MOESI protocol that includes an “owned” state: This cache is one of several with a valid copy of the cache line, but has the exclusive right to make changes to it. It must broadcast those changes to all other caches sharing the line. The introduction of owned state allows dirty sharing of data, i.e., a modified cache block can be moved around various caches without updating main memory. The cache line may be changed to the Modified state after invalidating all shared copies, or changed to the Shared state by writing the modifications back to main memory. Owned cache lines must respond to a snoop request with data.
- Processor Clusters
- In an embodiment (
FIG. 8 ), a system may consist ofmultiple clusters 800 to 802, themultiple clusters 800 to 802 may be communicatively coupled 113 to a sharedmemory 108 and I/O 107. The multiple clusters may include a shared L3 cache in one embodiment. Cache coherency may be managed cooperatively by the multiple clusters of the system. - In an embodiment (
FIG. 1 ), the system may comprise a plurality ofclusters 109 to 111 and a storage control (SC)function 112. The clusters interconnected 100, 101, 105 with each other and communicatively coupled 102, 103, 104 with theSC 112. TheSC 112 may include a higher level cache (L4).Clusters 109 to 111 may be communicatively coupled 113 with a sharedmemory subsystem 108 and an I/O subsystem 107. In an embodiment, an interconnected plurality ofclusters 109 to 111 andSC 112 may constitute a “node”. In one embodiment theSC 112 includes an inclusive cache directory such that theSC 112 has knowledge of all lines of the node. In an embodiment, the inclusive cache directory function is distributed amongst the processors and may be implemented by the combined non-inclusive cache directories of all clusters of the node. TheSC 112 may determine whether a cache access of the node can be handled entirely within the node (such as a processor of the node requesting ownership of a cache line already owned by a processor of the node), or that a cache access of the node must interrogate other nodes (such as a processor of the node requesting ownership of a cache line, not currently owned by the node). In an embodiment, theSC 112 function may be provided by cooperative logic of theprocessors 109 to 111 of the node. In this embodiment, the processors cooperate to determine if an access to an external node is required, and the requesting processor may then control coherency accesses to other nodes on behalf of the node. In another embodiment theSC 112 function may be provided by dedicated logic, perhaps in a separate chip die of the node for example. - A configuration (
FIG. 1 ) may includecomputer storage media 114. - In an embodiment (
FIG. 2 ), a plurality ofnodes Bus 200. - In an embodiment (
FIG. 3 ), a system may comprise a plurality ofdrawers drawer Bus drawer node drawer 310 by way of an A-Bus. Intra node coherency communication may be performed in an embodiment using pass-thru and a combination of S-Bus and A-Bus links. Preferably (FIG. 4 ),separate A-Buses 400 to 405 and 411 to 414 are provided betweennodes 424 to 431 of each drawer of a configuration, and ondrawer nodes Bus 406 407, 408, 419. In an embodiment, nodes of a drawer may act as pass-thru paths for drawer to drawer communication. In such an embodiment, only one set of drawer to drawer A-Busses are needed 400, 401, 402 and function provided by a second set ofA-Busses node A-Busses Busses - Preferably, buses used to communicatively couple elements (clusters, nodes, drawers) are provided for direct communication. Thus, each element has a direct link to each other element. In another embodiment, one or more elements are communicatively coupled by providing a pass-thru function at an element to reduce the number of busses required. Buses may be unidirectional pairs (
FIG. 9 , CP1 900 REC-X DRV-X pair for example), common bidirectional protocol or multi-drop busses for example. -
FIG. 9 depicts an example interconnection ofclusters SC 903. Each element (chip die) having three pairs (X, Y and Z) of unidirectional buses (REC-X DRV-X, REC-Y DRV-Y and REC-Z DRV-Z). In this cluster confirmation, each element has a direct connection with each of the other elements of the cluster. - In an embodiment, cache coherency amongst processors is performed across the cache hierarchy at the lowest level, to improve performance and minimize interference. Thus for example, when possible, only cache coherency operations of caches of a cluster that need access to caches of other clusters, may access other clusters, only cache coherency operations of caches of a node that need access to caches of other nodes may access other nodes and only cache coherency operations of a caches of a drawer that need access to caches of other nodes may access other nodes for example.
- In an embodiment, cache lines that are owned or shared by processors of an entity at a level of coherency (cluster, node, drawer) are candidates to be handled at the level of coherency. Thus, if a processor of the entity requests ownership or sharing of a line that is already owned by a processor of the entity, the entity need not access other entities to handle the request coherently. Furthermore, if a processor of an entity requests sharing of a line owned or shared by a processor of the entity need not access other entities to handle the request. A request, for example, by a processor of node A
FIG. 4 424, for sharing of a cache line is examined by SC function to determine if the line is owned or shared by a processor of Node A. If it is already owned or shared, the SC handles the request within the node without accessing any other node. If the line is neither owned or shared by a processor of Node A, the SC of Node A performs a cache coherency operation withother nodes 425 to 431. In an embodiment, in such a scenario, certain Node coherency operations may speculatively broadcast the request to other nodes before determining that the line is shared or owned by a processor of Node A. In that case, the other nodes may examine their coherency state for the cache line for example before being required to actually participate in the coherency operation. The speculative request may later be cancelled or time out if the requesting node discovers it is not needed. - In embodiment, topology of a configuration is discovered by each cluster of a configuration, for example at configuration initialization time. When a cluster is powered on, signals may be transmitted over cluster interfaces in order to determine whether other clusters or support elements (SCs for example) are communicatively coupled to the signaling cluster. This may be performed by micro-code (firmware) of the cluster. Based on the signaling, the cluster may record a configuration value that represents the cluster's view of the topology. In one example, the cluster may record a bit-significant value wherein each bit represents the state of a corresponding interface. In another embodiment, the configuration value represents the protocol to be used by the cluster. Thus, a single cluster configuration (
FIG. 7, 700 ) would record a configuration value indicating no such interconnections are active. A cluster (FIG. 8 ) of a multi-cluster configuration (800, 801, 802) would record a configurationvalue indicating interconnections other clusters FIG. 1, 109 ) of a multi-cluster configuration having anSC function 112 would record a configurationvalue indicating interconnections other clusters SC 112 are active. In an embodiment, an SC function (FIG. 6, 600 ) of a node (FIGS. 1, 109 to 112 ) may record a configuration value indicating whether interconnections to other nodes (on-drawer SC chip 406) or to nodes of other drawers (off-drawer SC chips FIG. 4 ) having 2nodes drawer 420, anode 424 of adrawer 420 may determine that an S-Bus node 425 of the drawer, and that three node A-Bus interfaces 400, 401, 402 are active tocorresponding nodes other drawers node 424 may also similarly determine thatother nodes Bus node 424 via an S-Bus 406 to anothernode 425 of thedrawer 420 that returns a configuration value of theother node 425 indicatingother nodes other drawers other node 425, and accessible by a pass-thru function of theother node 425. - In an embodiment, the resulting configuration value of a node determines whether coherency of the node is to be determined by a first coherency protocol of a configuration having no SC function (
FIG. 8 ) or a second coherency protocol of a configuration having an SC function (FIG. 1 ), the first coherency protocol being controlled solely by clusters of the node, the second coherency protocol being controlled entirely within the node by anSC 112 of the node. - In an embodiment, the resulting configuration value of the node (
FIG. 2, 305 ) may determine whether coherency of anode 305 is to be determined by a third coherency protocol of a configuration having another node(s) 306 in the same drawer, or a fourth coherency protocol of a configuration having other drawer(s) 310 having other node(s) 307, 308. - A modern processor chip (silicon die) can therefore find itself being used in a variety of system configurations and it would be advantageous for the chip to be able to optimize the cache coherency protocol that is used based on the configuration it is in. As used herein, a processor chip is described, but embodiments would be equally valuable in other packages, for example, a multi-chip module providing a processor function.
- In embodiments, a chip is capable of detecting the system configuration it is in and customize the cache coherency protocol that is used in the configuration as a result. This allows us to build a variety of system structures to better meet the requirements of an individual system, based on the same chip.
- Referring to
FIG. 4 , in an embodiment, a fully configured SMP system may consist of four “drawers” 420 to 423, each may comprise up to twonodes 424 425, 426 427, 428 429, 430 431. Each node in the high-end configuration may consist of threeCP chips 109 to 111 and oneSC chip 112 as shown inFIG. 1 . - An example diagram of the L3 (
FIG. 5 ) and L4 (FIG. 6 ) cache hierarchy, cache controllers and their key components and interactions may be provided, which may interact as shown inFIGS. 5 and 6 . As shown for example, theCP chip 500 may contains sixprocessor cores 501 to 506, a sharedL3 cache 508, ports to access Main Memory by way of a memory control system (MCS) 507 and IO ports by way of respective I/O complexes CP chip 500 may include an X-Bus interconnection A B C for direct access to up to twoother CP chips 500 and an SC chip (FIG. 6 ) 600. By way of an X-Bus 100 to 105 TheSC chip 600 may contains a fully sharedL4 cache 601 and interconnections A B and C for interconnection with three CP chips. TheCP 500 chips andSC 600 chip on the same node communicate via the X-Bus interface (FIGS. 1 ) 100 to 105. A node may consist of up to four chips connected in a point-to-point fully connected X-Bus topology. The number and type of chips connected to the X-Bus varies depending upon the system configuration. When theSC chip 600 is in the configuration, it may serve as the coherency point. However, because the CP chip contains all necessary connections and coherency maintenance functionality, the SC chip is not required to be installed in Low End configurations. - If a SC chip is not present in the system, the CP chip must support the coherency protocol. In an embodiment, the same CP chip design could find itself in any of several possible system configurations:
-
- 1. The CP chip (
FIG. 5 ) 500 could be the only chip in the system, as shown inFIG. 7, 700 . In this configuration, cache lines that miss in the L3 cache may be fetched directly from Main Memory via the memory interface on thesingle CP chip 700 and lines that age out of theL3 cache 508 may be stored directly to the Main Memory in a similar manner. The on-chip L3 cache 508 may act as the coherency point and may manage ownership of the cache line. - 2. The CP chip (
FIG. 5 ) 500 could be in a system (FIG. 8 ) with one or morepeer CP chips X-Bus interface other CP chips X-Bus - 3. The CP chip (
FIG. 5 ) 500 could be in a large system with 1 or morepeer CP chips SC chip 112 and shown inFIG. 1 . In this configuration, fetches that miss in the local L3 may be broadcast to the other CP chip(s) 109, 110, 111 and theSC chip 112. The SC chip contains a large L4 cache shared by all CP chips in the system. TheL4 cache 601 may not be fully inclusive of all L3 caches on a node, however, the L4 cache controller (system coherency manager 601) may keep track of all lines on the node via the utilization of a L4 Non Inclusive Cache Directory which maintains the state of lines in the L3 that are not also in the L4. In this configuration, on-chip L3 misses may be broadcast off chip to the other (1 or 2) CP chips 109, 110, 111 and theSC chip 112 over theX-Bus interface 100 to 105. The X-Bus interface provides direct point-to-point communication between the SC chip and each CP chip in the node. When a SC is present it acts as the coherency point for the system, thereby enabling effective coherency management across additional nodes and/or drawers. In such a system the L3 still may manage coherency across the local node, but it may rely on the SC to provide information about the state of the target line in the local L4 and (in multi-node/drawer configurations) the state of the target line in other L3 and L4 caches on other nodes and/or drawers. As a result, the L3 behavior and management sequences are different in multi-chip configurations that contain a SC chip than in multi-chip configurations that do not contain an SC chip.
- 1. The CP chip (
- In an embodiment, the chips detect the configuration of the system at startup. When the chips are initially powered on, each chip has no available off-chip interfaces to any other chip. As the hardware is being initialized by micro-code for example (firmware used internally to perform tasks on behalf of hardware), the off-chip interfaces on the CP chips may be calibrated by firmware to allow for chip to chip communication. When an interface is successfully calibrated, firmware sets a status bit to indicate that the interface is now valid and available for chip to chip communication and provides additional information as to what type of chip is connected to each valid interface. By analyzing the collective state of these valid indications, the L3 cache controller may determine which interfaces have active connections and the topology of the system, i.e. whether the CP chip is interfaced to other CP chips and, in an embodiment, an SC chip [IS THERE SOMETHING MISSING HERE?]
- Based on this information the L3 cache controller may provide the coherency protocol and mechanism that is used for chip to chip communication. In the simplest case, the chip may be the only one in the system (no other chips return a response) and so it only needs to manage coherency between the cores on the chip. In the peer CP only case (more than one CP chip but no SC chip), based on the interface valid setting, the protocol may be provided so that one of the CP chips serves as the point of coherency in the system. The point of coherency may done on an individual line basis based on the state of the cache line in each chip, (a CP chip point of coherency for a particular cache line may be managed by one CP chip, and another cache line by another CP chip. In the systems with both CP and SC chips, the protocol may be provided for a multi-tiered protocol with the SC serving as a global point of coherency for the system and the CP chips serving as a local point of coherency for the node.
- For example, when a CP determines via the analysis of the interface valid indications that it is connected only to other CP chips, it knows that the system topology is that of a relatively small SMP structure. It may determine then that off-chip communication should be optimized for system performance and choose to use a snooping process for the coherency mechanism and a protocol such as MESI for managing the coherency between CP chips. Alternatively, if the analysis of the interface valid indications indicates that the CP is connected to an SC chip, it knows that it is a part of a large SMP structure. In such a case a more scalable coherency mechanism, like a directory based mechanism, is more appropriate, and using a different protocol such as MOESI may be a better choice for managing the coherency between caches.
- In an embodiment, each cluster determines at initialization time, whether the node of the cluster includes other clusters or an SC and records a corresponding coherency value and uses the recorded corresponding coherency value during normal operation to determine which interfaces are to be utilized. Furthermore, in an embodiment, the cluster utilizes an MESI protocol if no SC is present, but utilizes an MOESI protocol if an SC is present.
- In an embodiment, the SC function, if present in the node, determines at initialization time whether the node of the SC includes other clusters and whether the node is communicatively coupled to other nodes (and therefore the configuration topology). The SC function of the node may use it's recorded corresponding coherency value during normal operation to determine which interfaces are to be utilized, and what local cache coherency (fabric) protocol should be used and which global cache coherency (fabric) protocol should be used.
- In an embodiment, analysis of the interface valid, by an SC chip determines whether other nodes exist. If the node of the SC chip is the only node in the system, overhead of inter-node coherency operations may be eliminated.
- Thus, coherency for a line that only effects the on-chip processing units (PUs) 501 to 506 of
CPs - In an embodiment, the physical X-Bus may be used for transmitting memory commands and memory data and may be divided into two logical buses. The first bus is a shared address and data bus (i.e. the primary bus). The second bus is the response sideband bus. Full chip to chip X-Bus connectivity is shown in
FIG. 9 . Each chip of the node has a Drive, Receive pair of busses dedicated to another chip of the node. The X-Bus protocol sequence can be broken down into several discrete steps. All sequences on an example X-Bus may begin with these initial four steps: -
- 1. The requesting chip initiates the sequence by broadcasting the command and address to all remote chips on the node. Any chip on the node may be the requesting chip for a particular operation.
- 2. The command/address broadcast is received by each of the remote chips and a snoop pipe pass is made. The snoop pipe pass may generate a partial response (PRESP) and load snoop information in a controller on the remote chip.
- 3. Each remote chip may broadcast its PRESP to every other chip on the node (i.e. both the requesting chip and the other remote chips).
- 4. All chips on the node may collect all the PRESPs and then generate a combined response (CRESP) for the controller that is processing the sequence. Based on the CRESP, one of the chips on the node is established as the point of coherency for the line and is responsible for protecting the line for the duration of the sequence.
- At this point the sequence of responses varies depending on the type of operation and the CRESP. The example case of a fetch sent by the requesting chip with data sourced by a remote chip is described here:
-
- 5. The remote chip that sources the data may send a data response (DRESP) to the requesting chip. The remote chip may speculatively send the DRESP before the CRESP.
- 6. All remote chips may make any necessary directory updates.
- 7. All remote CP chips may send a final response (FRESP) to the requesting chip.
- 8. If the requesting chip is a CP, it may send a reset response (RRESP) to the SC to reset the SC controller.
- Upon receiving a command/address request, each remote chip may make a snoop pipe pass to check the state of the requested line in its cache. The snoop pipe pass may also check for resource availability and address contention with other requests. The results of this pipe pass, called the partial response, may then broadcast to all other chips on the node.
- The XBUS protocol may be different in the High End (HE) configuration which may contain an SC chip (
FIG. 1 ) and the Low End (LE) configuration (FIG. 8 ) which may contain only CP chips. - In the Low End (LE) Configuration, the Local Intervention Master (LIM) L3 is preferably the primary serialization and coherency point on the node. The LIM L3 may be the L3 on the CP chip that will source data. In this configuration, the L3 on remote CP chips may send reject Partial Responses over the XBUS based on address compares detected during the snoop pipe pass on the remote CP chip. Upon receipt of these reject Partial Responses the requesting CP chip may at a later time retry the command broadcast onto the XBUS.
- In the High end (HE) configuration (
FIG. 1 ), theSC Chip 112 may be the serialization and coherency point on the node. The L3 of the CPs preferably will not reject any X-Bus requests in the HE configuration, its PRESP preferably will only contain the L3 directory results. The L3 preferably must wait for a non-reject CRESP before it can invalidate the L3 directory. If there is an address or resource conflict on the snoop, the L4 may send the SC reject PRESP. The L4 controller will preferably wait for the conflict to resolve and then send an FRESP to cause the requesting L3 to rebroadcast its request. The first request to the SC may load an address into a controller in the SC which may reject subsequent requests to the same address. The address may be protected by the L4 controller, which may remain valid for the duration of the operation. The L4 may reject requests from other L3s as well as protect the line from global fabric requests. At the end of the sequence, the L3 controller may notify the L4 controller that it is safe to reset. - In both HE and LE configurations, requests for data may be sourced by the Local Intervention Master (LIM) chip on the node. There may always be a LIM chip on the node. The L4 may not be fully inclusive, however it may keep track of all lines on the node. The L4 Non Inclusive Cache (NIC) Directory may be used to keep track of lines in the L3 that are not also in the L4. Therefore, in HE configurations, the LIM chip could be the SC or one of the CP chips, but the L4 is always the coherency point.
- In the LE configuration, local and/or remote L3 controllers may set LIM (local intervention master) pending and LMM (local memory master) pending to protect the address and enable address compares in order to protect line coherency. The determination as to which CP chip will set the LIM pending may be determined by the local hit state and the combined hit states of the other two CP chips as determined by the Cresp (combination of the Presp states from other CP chips) and may be determined in parallel on all CP chips as a result of the fully connected X-Bus protocol. The determination as to which CP chip will set the LMM pending may be determined by which chip contains the target line's main memory location.
- The L3 controllers will not set LIM or LMM pending in the HE configuration, coherency may be maintained by address compares in the L4 cache controller.
- Another difference between the LE and HE protocol preferably involves the conditions for blocking the ingate of the local L3 miss controller. Ingate of this controller may only be allowed if all necessary remote resources are available. Only remote resources that are actually present in the configuration may be factored into in the local L3 miss controller blocking logic. Therefore, if there is no SC chip in the configuration, the requesting CP chip (L3) does not need to check or wait for resource availability on the SC chip. Based on this indication in the configuration setting, the CP L3 X-Bus command launch hardware arbitration will skip the logic that performs the SC resource checking and possible transitions to wait states for SC resources. In a similar manner, the CP L3 X-Bus command launch hardware will check the indication for whether or not there are one or two other CP chips in the system, and if not, skip the logic that performs the remote CP resource checking and possible transitions to wait states for remote CP resources. Additionally, the broadcast rate onto the X-Bus may have a required restriction in High End configurations due to pipelining availability in the L4. This broadcast rate restriction can also be dynamically removed once it is determined that the current configuration does not include an L4/SC chip.
- In an embodiment, one of a plurality of cache coherency protocols is performed in a computer system comprising one or more communicatively coupled
computer elements 109 to 112 of a first node, eachcomputer element 109 to 112 consisting of a at least one of acluster 500 or a storage controller (SC)function 600, eachcluster 500 comprising one ormore processors 501 to 506, the SC function comprising a shared cache, the shared cache being shared by processors of each cluster, the first node configured to perform a method, the method comprising: initiating (FIG. 10 ) 1000 communication, by each element of the configuration, to other elements of the configuration to determine configuration topology; based a result of the initiated communication, storing 1001 acoherency value 1006 by eachelement 109 to 112; based on thecoherency value 1006, selecting 1003, byelements 109 to 112 of anode 424, a cache coherency protocol of the plurality ofcache coherency protocols 1002 to be used by thenode 424; and using 1004 the selected cache coherency protocol to handle cache coherency requests 1100. - In an embodiment, the configuration topology consists of the
first node 424 wherein the plurality ofcache coherency protocols 1002 comprise 1007 a first protocol that does not use anSC function 600 to maintain cache coherency of thenode 424 and a second protocol that uses anSC function 600 to maintain cache coherency of the node first 424. - In an embodiment, the configuration topology consists of the
first node 424, the using 1004 the selected cache coherency protocol further comprising accessing aninclusive directory 1105 of theSC function 600 of thefirst node 424 to handle the cache coherency requests. - In an embodiment, the configuration topology further comprises one or more
second nodes 425 to 427, thefirst node 424 communicatively coupled to said one or moresecond nodes 425, the using 1004 the selected cache coherency protocol further comprises: - accessing 1103, by the
first node 424, aninclusive directory 1005 of theSC function 600 of thefirst node 424 to handle the cache coherency requests; - based on determining, by the first node, that the cache coherency operation can be performed solely by the first node, performing 1102 the cache coherency operation, by the first node, without accessing caches of the one or more second nodes; and
- based on determining, by the first node, that the cache coherency operation must access said one or more second nodes, performing 1103 the cache coherency operation by accessing caches of said one or more second nodes.
- In an embodiment, the plurality of
cache coherency protocols 1007 comprise a third protocol and a fourth protocol to maintain cache coherency of thenode 424 wherein the third protocol is theMESI protocol 1007 and the fourth protocol is theMOESI protocol 1007. - In an embodiment, the
SC function 600 determines 1202 that one of thefirst protocol 1007 orsecond protocol 1007 is to be used in local cache coherency operations within the node and that a different one of thefirst protocol 1007 orsecond protocol 1007 is to be used in global cache coherency operations. - In an embodiment, the configuration topology comprises a plurality of
nodes 424 to 431, wherein a first plurality ofnodes Bus 406, the first plurality ofnodes other nodes 426 to 431 of the plurality of nodes by one or more A-Buses 400 to 405 wherein coherency operations are broadcast 1203 to all nodes by way of the S-Bus 406 and-the one or more A-buses 400 to 405. - Referring to
FIG. 1 , the present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. - The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions. These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Claims (15)
Priority Applications (8)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/548,988 US9886382B2 (en) | 2014-11-20 | 2014-11-20 | Configuration based cache coherency protocol selection |
US14/816,636 US9898407B2 (en) | 2014-11-20 | 2015-08-03 | Configuration based cache coherency protocol selection |
GB1709198.4A GB2547397B (en) | 2014-11-20 | 2015-11-13 | Configuration based cache coherency protocol selection |
DE112015004336.5T DE112015004336B4 (en) | 2014-11-20 | 2015-11-13 | Configuration-based selection of a cache coherence protocol |
PCT/IB2015/058778 WO2016079646A1 (en) | 2014-11-20 | 2015-11-13 | Configuration based cache coherency protocol selection |
US15/860,283 US10402328B2 (en) | 2014-11-20 | 2018-01-02 | Configuration based cache coherency protocol selection |
US15/860,353 US10394712B2 (en) | 2014-11-20 | 2018-01-02 | Configuration based cache coherency protocol selection |
US16/298,683 US10824565B2 (en) | 2014-11-20 | 2019-03-11 | Configuration based cache coherency protocol selection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/548,988 US9886382B2 (en) | 2014-11-20 | 2014-11-20 | Configuration based cache coherency protocol selection |
Related Child Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/816,636 Continuation US9898407B2 (en) | 2014-11-20 | 2015-08-03 | Configuration based cache coherency protocol selection |
US15/860,353 Continuation US10394712B2 (en) | 2014-11-20 | 2018-01-02 | Configuration based cache coherency protocol selection |
US15/860,283 Continuation US10402328B2 (en) | 2014-11-20 | 2018-01-02 | Configuration based cache coherency protocol selection |
Publications (2)
Publication Number | Publication Date |
---|---|
US20160147658A1 true US20160147658A1 (en) | 2016-05-26 |
US9886382B2 US9886382B2 (en) | 2018-02-06 |
Family
ID=56010348
Family Applications (5)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/548,988 Expired - Fee Related US9886382B2 (en) | 2014-11-20 | 2014-11-20 | Configuration based cache coherency protocol selection |
US14/816,636 Expired - Fee Related US9898407B2 (en) | 2014-11-20 | 2015-08-03 | Configuration based cache coherency protocol selection |
US15/860,353 Expired - Fee Related US10394712B2 (en) | 2014-11-20 | 2018-01-02 | Configuration based cache coherency protocol selection |
US15/860,283 Expired - Fee Related US10402328B2 (en) | 2014-11-20 | 2018-01-02 | Configuration based cache coherency protocol selection |
US16/298,683 Active 2035-01-02 US10824565B2 (en) | 2014-11-20 | 2019-03-11 | Configuration based cache coherency protocol selection |
Family Applications After (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/816,636 Expired - Fee Related US9898407B2 (en) | 2014-11-20 | 2015-08-03 | Configuration based cache coherency protocol selection |
US15/860,353 Expired - Fee Related US10394712B2 (en) | 2014-11-20 | 2018-01-02 | Configuration based cache coherency protocol selection |
US15/860,283 Expired - Fee Related US10402328B2 (en) | 2014-11-20 | 2018-01-02 | Configuration based cache coherency protocol selection |
US16/298,683 Active 2035-01-02 US10824565B2 (en) | 2014-11-20 | 2019-03-11 | Configuration based cache coherency protocol selection |
Country Status (4)
Country | Link |
---|---|
US (5) | US9886382B2 (en) |
DE (1) | DE112015004336B4 (en) |
GB (1) | GB2547397B (en) |
WO (1) | WO2016079646A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9892043B2 (en) | 2014-11-20 | 2018-02-13 | International Business Machines Corporation | Nested cache coherency protocol in a tiered multi-node computer system |
CN107894914A (en) * | 2016-09-30 | 2018-04-10 | 华为技术有限公司 | Buffer consistency treating method and apparatus |
WO2018074933A1 (en) * | 2016-10-18 | 2018-04-26 | Numascale As | Programmable cache coherent node controller |
WO2018203754A1 (en) * | 2017-05-02 | 2018-11-08 | Numascale As | Cache coherent node controller for scale-up shared memory systems |
US20180341586A1 (en) * | 2017-05-26 | 2018-11-29 | International Business Machines Corporation | Dual clusters of fully connected integrated circuit multiprocessors with shared high-level cache |
US20180365070A1 (en) * | 2017-06-16 | 2018-12-20 | International Business Machines Corporation | Dynamic throttling of broadcasts in a tiered multi-node symmetric multiprocessing computer system |
US20190220410A1 (en) * | 2018-01-17 | 2019-07-18 | International Business Machines Corporation | Remote node broadcast of requests in a multinode data processing system |
US10394712B2 (en) | 2014-11-20 | 2019-08-27 | International Business Machines Corporation | Configuration based cache coherency protocol selection |
US10628312B2 (en) * | 2018-09-26 | 2020-04-21 | Nxp Usa, Inc. | Producer/consumer paced data transfer within a data processing system having a cache which implements different cache coherency protocols |
US20210311878A1 (en) * | 2020-04-03 | 2021-10-07 | Alibaba Group Holding Limited | Cache coherency for host-device systems |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10157133B2 (en) * | 2015-12-10 | 2018-12-18 | Arm Limited | Snoop filter for cache coherency in a data processing system |
US9900260B2 (en) | 2015-12-10 | 2018-02-20 | Arm Limited | Efficient support for variable width data channels in an interconnect network |
US9747222B1 (en) * | 2016-03-31 | 2017-08-29 | EMC IP Holding Company LLC | Dynamic ingestion throttling of data log |
US9990292B2 (en) | 2016-06-29 | 2018-06-05 | Arm Limited | Progressive fine to coarse grain snoop filter |
US10310982B2 (en) | 2016-12-15 | 2019-06-04 | International Business Machines Corporation | Target cache line arbitration within a processor cluster |
US10042766B1 (en) | 2017-02-02 | 2018-08-07 | Arm Limited | Data processing apparatus with snoop request address alignment and snoop response time alignment |
US10482015B2 (en) | 2017-05-18 | 2019-11-19 | International Business Machines Corporation | Ownership tracking updates across multiple simultaneous operations |
US11194753B2 (en) * | 2017-09-01 | 2021-12-07 | Intel Corporation | Platform interface layer and protocol for accelerators |
US10891228B2 (en) * | 2018-02-12 | 2021-01-12 | International Business Machines Corporation | Cache line states identifying memory cache |
US11010210B2 (en) | 2019-07-31 | 2021-05-18 | International Business Machines Corporation | Controller address contention assumption |
CN110704694B (en) * | 2019-09-29 | 2021-12-31 | 哈尔滨工业大学(威海) | Organization hierarchy dividing method based on network representation learning and application thereof |
US11163449B2 (en) | 2019-10-17 | 2021-11-02 | EMC IP Holding Company LLC | Adaptive ingest throttling in layered storage systems |
US11966330B2 (en) * | 2020-06-05 | 2024-04-23 | Intel Corporation | Link affinitization to reduce transfer latency |
US11449424B2 (en) | 2020-10-26 | 2022-09-20 | International Business Machines Corporation | Server recovery from a change in storage control chip |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080155145A1 (en) * | 2006-12-22 | 2008-06-26 | Siliconstor, Inc. | Discovery of a Bridge Device in a SAS Communication System |
Family Cites Families (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5752264A (en) * | 1995-03-31 | 1998-05-12 | International Business Machines Corporation | Computer architecture incorporating processor clusters and hierarchical cache memories |
US5895487A (en) * | 1996-11-13 | 1999-04-20 | International Business Machines Corporation | Integrated processing and L2 DRAM cache |
US6330643B1 (en) | 1998-02-17 | 2001-12-11 | International Business Machines Corporation | Cache coherency protocols with global and local posted operations |
US6192451B1 (en) * | 1998-02-17 | 2001-02-20 | International Business Machines Corporation | Cache coherency protocol for a data processing system including a multi-level memory hierarchy |
JP4123621B2 (en) | 1999-02-16 | 2008-07-23 | 株式会社日立製作所 | Main memory shared multiprocessor system and shared area setting method thereof |
US6697919B2 (en) | 2000-06-10 | 2004-02-24 | Hewlett-Packard Development Company, L.P. | System and method for limited fanout daisy chaining of cache invalidation requests in a shared-memory multiprocessor system |
US6751721B1 (en) | 2000-08-31 | 2004-06-15 | Hewlett-Packard Development Company, L.P. | Broadcast invalidate scheme |
US6918012B2 (en) | 2001-08-28 | 2005-07-12 | Hewlett-Packard Development Company, L.P. | Streamlined cache coherency protocol system and method for a multiple processor single chip device |
US7546422B2 (en) | 2002-08-28 | 2009-06-09 | Intel Corporation | Method and apparatus for the synchronization of distributed caches |
US7111128B2 (en) | 2002-12-19 | 2006-09-19 | Intel Corporation | Hierarchical virtual model of a cache hierarchy in a multiprocessor system |
US6954829B2 (en) | 2002-12-19 | 2005-10-11 | Intel Corporation | Non-speculative distributed conflict resolution for a cache coherency protocol |
US7272688B1 (en) * | 2003-04-28 | 2007-09-18 | Newisys,, Inc. | Methods and apparatus for providing cache state information |
US7802058B1 (en) * | 2003-04-30 | 2010-09-21 | Silicon Graphics International | Method for performing cache coherency in a computer system |
US6988173B2 (en) | 2003-05-12 | 2006-01-17 | International Business Machines Corporation | Bus protocol for a switchless distributed shared memory computer system |
US7085898B2 (en) | 2003-05-12 | 2006-08-01 | International Business Machines Corporation | Coherency management for a “switchless” distributed shared memory computer system |
US7436775B2 (en) * | 2003-07-24 | 2008-10-14 | Alcatel Lucent | Software configurable cluster-based router using stock personal computers as cluster nodes |
US7085895B2 (en) | 2003-09-05 | 2006-08-01 | International Business Machines Corporation | Apparatus, system, and method flushing data from a cache to secondary storage |
US20050216637A1 (en) * | 2004-03-23 | 2005-09-29 | Smith Zachary S | Detecting coherency protocol mode in a virtual bus interface |
US7822929B2 (en) | 2004-04-27 | 2010-10-26 | Intel Corporation | Two-hop cache coherency protocol |
US20050240734A1 (en) | 2004-04-27 | 2005-10-27 | Batson Brannon J | Cache coherence protocol |
WO2006041218A2 (en) | 2004-10-15 | 2006-04-20 | Sony Computer Entertainment Inc. | Methods and apparatus for supporting multiple configurations in a multi-processor system |
US7404046B2 (en) * | 2005-02-10 | 2008-07-22 | International Business Machines Corporation | Cache memory, processing unit, data processing system and method for filtering snooped operations |
US7415030B2 (en) * | 2005-02-10 | 2008-08-19 | International Business Machines Corporation | Data processing system, method and interconnect fabric having an address-based launch governor |
US7395376B2 (en) * | 2005-07-19 | 2008-07-01 | International Business Machines Corporation | Method, apparatus, and computer program product for a cache coherency protocol state that predicts locations of shared memory blocks |
US7536514B2 (en) * | 2005-09-13 | 2009-05-19 | International Business Machines Corporation | Early return indication for read exclusive requests in shared memory architecture |
US8990501B1 (en) * | 2005-10-12 | 2015-03-24 | Azul Systems, Inc. | Multiple cluster processor |
US7904676B2 (en) | 2007-04-30 | 2011-03-08 | Hewlett-Packard Development Company, L.P. | Method and system for achieving varying manners of memory access |
US8539164B2 (en) | 2007-04-30 | 2013-09-17 | Hewlett-Packard Development Company, L.P. | Cache coherency within multiprocessor computer system |
US7669013B2 (en) | 2007-07-26 | 2010-02-23 | International Business Machines Corporation | Directory for multi-node coherent bus |
US7925857B2 (en) | 2008-01-24 | 2011-04-12 | International Business Machines Corporation | Method for increasing cache directory associativity classes via efficient tag bit reclaimation |
US8560776B2 (en) * | 2008-01-29 | 2013-10-15 | International Business Machines Corporation | Method for expediting return of line exclusivity to a given processor in a symmetric multiprocessing data processing system |
US8135916B1 (en) * | 2008-04-25 | 2012-03-13 | Marvell International Ltd. | Method and apparatus for hardware-configurable multi-policy coherence protocol |
US8055847B2 (en) | 2008-07-07 | 2011-11-08 | International Business Machines Corporation | Efficient processing of data requests with the aid of a region cache |
WO2010100679A1 (en) | 2009-03-06 | 2010-09-10 | 富士通株式会社 | Computer system, control method, recording medium and control program |
US8423736B2 (en) | 2010-06-16 | 2013-04-16 | International Business Machines Corporation | Maintaining cache coherence in a multi-node, symmetric multiprocessing computer |
US9477600B2 (en) | 2011-08-08 | 2016-10-25 | Arm Limited | Apparatus and method for shared cache control including cache lines selectively operable in inclusive or non-inclusive mode |
US8954639B2 (en) * | 2011-09-06 | 2015-02-10 | International Business Machines Corporation | Integrated link calibration and multi-processor topology discovery |
FR2992446B1 (en) * | 2012-06-22 | 2015-04-10 | Commissariat Energie Atomique | SYSTEM AND METHOD FOR DATA PROCESSING HAS COHERENCE MANAGEMENT IN A NETWORK OF PROCESSORS HAVING HIDDEN MEMORY. |
US20150186277A1 (en) * | 2013-12-30 | 2015-07-02 | Netspeed Systems | Cache coherent noc with flexible number of cores, i/o devices, directory structure and coherency points |
US9727464B2 (en) | 2014-11-20 | 2017-08-08 | International Business Machines Corporation | Nested cache coherency protocol in a tiered multi-node computer system |
US9886382B2 (en) | 2014-11-20 | 2018-02-06 | International Business Machines Corporation | Configuration based cache coherency protocol selection |
-
2014
- 2014-11-20 US US14/548,988 patent/US9886382B2/en not_active Expired - Fee Related
-
2015
- 2015-08-03 US US14/816,636 patent/US9898407B2/en not_active Expired - Fee Related
- 2015-11-13 WO PCT/IB2015/058778 patent/WO2016079646A1/en active Application Filing
- 2015-11-13 GB GB1709198.4A patent/GB2547397B/en active Active
- 2015-11-13 DE DE112015004336.5T patent/DE112015004336B4/en active Active
-
2018
- 2018-01-02 US US15/860,353 patent/US10394712B2/en not_active Expired - Fee Related
- 2018-01-02 US US15/860,283 patent/US10402328B2/en not_active Expired - Fee Related
-
2019
- 2019-03-11 US US16/298,683 patent/US10824565B2/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080155145A1 (en) * | 2006-12-22 | 2008-06-26 | Siliconstor, Inc. | Discovery of a Bridge Device in a SAS Communication System |
Non-Patent Citations (1)
Title |
---|
Nowatzyk et al. The s3.mp scalable shared memory multiprocessor. 1994. P. 144-153. Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences. IEEE. Website: <http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=323149>. * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10824565B2 (en) | 2014-11-20 | 2020-11-03 | International Business Machines Corporation | Configuration based cache coherency protocol selection |
US9892043B2 (en) | 2014-11-20 | 2018-02-13 | International Business Machines Corporation | Nested cache coherency protocol in a tiered multi-node computer system |
US10394712B2 (en) | 2014-11-20 | 2019-08-27 | International Business Machines Corporation | Configuration based cache coherency protocol selection |
US10402328B2 (en) | 2014-11-20 | 2019-09-03 | International Business Machines Corporation | Configuration based cache coherency protocol selection |
CN107894914A (en) * | 2016-09-30 | 2018-04-10 | 华为技术有限公司 | Buffer consistency treating method and apparatus |
WO2018074933A1 (en) * | 2016-10-18 | 2018-04-26 | Numascale As | Programmable cache coherent node controller |
US11157405B2 (en) | 2016-10-18 | 2021-10-26 | Numascale As | Programmable cache coherent node controller |
US20200050547A1 (en) * | 2016-10-18 | 2020-02-13 | Numascale As | Programmable cache coherent node controller |
WO2018203754A1 (en) * | 2017-05-02 | 2018-11-08 | Numascale As | Cache coherent node controller for scale-up shared memory systems |
US10956329B2 (en) | 2017-05-02 | 2021-03-23 | Numascale As | Cache coherent node controller for scale-up shared memory systems having interconnect switch between a group of CPUS and FPGA node controller |
US20180341586A1 (en) * | 2017-05-26 | 2018-11-29 | International Business Machines Corporation | Dual clusters of fully connected integrated circuit multiprocessors with shared high-level cache |
US10628314B2 (en) | 2017-05-26 | 2020-04-21 | International Business Machines Corporation | Dual clusters of fully connected integrated circuit multiprocessors with shared high-level cache |
US10628313B2 (en) * | 2017-05-26 | 2020-04-21 | International Business Machines Corporation | Dual clusters of fully connected integrated circuit multiprocessors with shared high-level cache |
US20180365070A1 (en) * | 2017-06-16 | 2018-12-20 | International Business Machines Corporation | Dynamic throttling of broadcasts in a tiered multi-node symmetric multiprocessing computer system |
US10713169B2 (en) * | 2018-01-17 | 2020-07-14 | International Business Machines Corporation | Remote node broadcast of requests in a multinode data processing system |
US20190220410A1 (en) * | 2018-01-17 | 2019-07-18 | International Business Machines Corporation | Remote node broadcast of requests in a multinode data processing system |
US10628312B2 (en) * | 2018-09-26 | 2020-04-21 | Nxp Usa, Inc. | Producer/consumer paced data transfer within a data processing system having a cache which implements different cache coherency protocols |
US20210311878A1 (en) * | 2020-04-03 | 2021-10-07 | Alibaba Group Holding Limited | Cache coherency for host-device systems |
US11188471B2 (en) * | 2020-04-03 | 2021-11-30 | Alibaba Group Holding Limited | Cache coherency for host-device systems |
Also Published As
Publication number | Publication date |
---|---|
US20190205251A1 (en) | 2019-07-04 |
GB2547397A (en) | 2017-08-16 |
WO2016079646A1 (en) | 2016-05-26 |
US9886382B2 (en) | 2018-02-06 |
US10402328B2 (en) | 2019-09-03 |
US10824565B2 (en) | 2020-11-03 |
GB2547397B (en) | 2017-11-29 |
DE112015004336T5 (en) | 2017-06-14 |
US20160147661A1 (en) | 2016-05-26 |
GB201709198D0 (en) | 2017-07-26 |
US20180121359A1 (en) | 2018-05-03 |
DE112015004336B4 (en) | 2022-03-10 |
US20180121358A1 (en) | 2018-05-03 |
US9898407B2 (en) | 2018-02-20 |
US10394712B2 (en) | 2019-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10824565B2 (en) | Configuration based cache coherency protocol selection | |
US9892043B2 (en) | Nested cache coherency protocol in a tiered multi-node computer system | |
US8806148B2 (en) | Forward progress mechanism for stores in the presence of load contention in a system favoring loads by state alteration | |
EP3140749B1 (en) | In-memory lightweight coherency | |
KR100318104B1 (en) | Non-uniform memory access (numa) data processing system having shared intervention support | |
KR100308323B1 (en) | Non-uniform memory access (numa) data processing system having shared intervention support | |
US11586542B2 (en) | Reducing cache transfer overhead in a system | |
US8656115B2 (en) | Extending a cache coherency snoop broadcast protocol with directory information | |
US8793442B2 (en) | Forward progress mechanism for stores in the presence of load contention in a system favoring loads | |
US8762651B2 (en) | Maintaining cache coherence in a multi-node, symmetric multiprocessing computer | |
US6266743B1 (en) | Method and system for providing an eviction protocol within a non-uniform memory access system | |
CN115203071A (en) | Application of default shared state cache coherency protocol | |
US10901902B2 (en) | Efficient inclusive cache management | |
US10489292B2 (en) | Ownership tracking updates across multiple simultaneous operations | |
US9558119B2 (en) | Main memory operations in a symmetric multiprocessing computer | |
GB2500964A (en) | Forward progress mechanism for stores in the presence of load contention in a system favouring loads by state alteration. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AMBROLADZE, EKATERINA M;SONNELITTER, ROBERT J, III;BERGER, DEANNA P;AND OTHERS;SIGNING DATES FROM 20141030 TO 20141110;REEL/FRAME:034226/0945 |
|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AMBROLADZE, EKATERINA M;SONNELITTER, ROBERT J, III;BURGER, DEANNA P;AND OTHERS;SIGNING DATES FROM 20141030 TO 20141110;REEL/FRAME:036345/0513 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20220206 |