KR101979697B1 - Scalably mechanism to implement an instruction that monitors for writes to an address - Google Patents

Scalably mechanism to implement an instruction that monitors for writes to an address Download PDF

Info

Publication number
KR101979697B1
KR101979697B1 KR1020167005327A KR20167005327A KR101979697B1 KR 101979697 B1 KR101979697 B1 KR 101979697B1 KR 1020167005327 A KR1020167005327 A KR 1020167005327A KR 20167005327 A KR20167005327 A KR 20167005327A KR 101979697 B1 KR101979697 B1 KR 101979697B1
Authority
KR
South Korea
Prior art keywords
cache
core
address
monitor
processor
Prior art date
Application number
KR1020167005327A
Other languages
Korean (ko)
Other versions
KR20160041950A (en
Inventor
옌-정 리우
바하 파힘
에릭 지. 할노르
제프리 디. 챔버레인
스테픈 알. 반 도렌
안토니오 후안
Original Assignee
인텔 코포레이션
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 인텔 코포레이션 filed Critical 인텔 코포레이션
Priority claimed from PCT/US2014/059130 external-priority patent/WO2015048826A1/en
Publication of KR20160041950A publication Critical patent/KR20160041950A/en
Application granted granted Critical
Publication of KR101979697B1 publication Critical patent/KR101979697B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/084Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0831Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
    • G06F12/0833Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means in combination with broadcast means (e.g. for invalidation or updating)
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0846Cache with multiple tag or data arrays being simultaneously accessible
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The processor includes a cache-side address monitor unit corresponding to a first cache portion of the distributed cache and having a total number of cache-side address monitor storage locations less than the total number of logical processors of the processor. Each cache-side address monitor storage location is for storing an address to be monitored. The core-side address monitor unit corresponds to the first core and has the same number of core-side address monitor storage locations as the number of logical processors of the first core. Each core-side address monitor storage location is for storing an address and a monitor state for a different corresponding logical processor of the first core. The cache-side address monitor storage overflow unit corresponds to the first cache portion and performs an address monitor storage overflow policy when any unused cache-side address monitor storage location is not available to store the address to be monitored .

Figure R1020167005327

Description

SCALABLY MECHANISM TO IMPLEMENT AN INSTRUCTION THAT MONITORS FOR WRITES TO AN ADDRESS < RTI ID = 0.0 > Implementing < / RTI &

The embodiments described herein relate to processors. In particular, the embodiments described herein are generally related to processors that can be operated to perform instructions to monitor writing to an address.

Advances in semiconductor processing and logic design have allowed for an increase in the amount of logic that can be included in processors and other integrated circuit devices. As a result, many processors now have a large number of cores monolithically integrated on a single integrated circuit or die. Many cores generally help to cause multiple software threads or other workloads to be executed concurrently, which generally helps to increase execution throughput.

One challenge in these multiple core processors is that there are often more requirements on the caches used to cache data and / or instructions from memory. For one thing, there is a steadily increasing requirement for higher interconnect bandwidth to access data in these caches. One technique to help increase the interconnect bandwidth for caches involves using a distributed cache. This distributed cache may comprise a plurality of physically separate or distributed cache slices or other cache portions. This distributed cache may allow parallel access to different distributed portions of the cache via the shared interconnect.

Another challenge in these multiple core processors is the ability to provide thread synchronization with respect to shared memory. Operating systems typically implement idle loops to handle thread synchronization with respect to shared memory. For example, there may be several busy loops using a set of memory locations. The first thread may wait in the loop and poll the corresponding memory location. For example, such a memory location may represent a work queue of a first thread, and a first thread may poll the work queue to determine if there is an available job to perform. In a shared memory configuration, deviations from the busy loop often occur due to state changes associated with memory locations. These state changes are typically triggered by writes to memory locations by other components (e.g., other threads or cores). For example, another thread or core may write to a work queue at a memory location to provide a job to be executed by the first thread.

Certain processors (e.g., those available from Intel Corporation of Santa Clara, Calif.) May use MONITOR and MWAIT commands to achieve thread synchronization with shared memory. A hardware thread or other logical processor may use the MONITOR instruction to set up a linear address range to be monitored by the monitor unit and to arm or activate the monitor unit. Such an address may be provided through a general-purpose register. This address range is typically a write-back caching type. Such a monitor unit will monitor and detect stores / writes to addresses within the address range, which will trigger the monitor unit.

The MWAIT instruction may follow the MONITOR instruction in program order and serve as a hint to allow a hardware thread or other logical processor to halt instruction execution and enter an implementation-dependent state. For example, such a logical processor may enter a power saving consumption state. The logical processor may remain in that state until the detection of one of a set of qualifying events associated with the MONITOR command. Writing / storing to an address in the address range prepared by the preceding MONITOR instruction is one such qualification event. In these cases, the logical processor can leave the state and resume execution with an instruction following the MWAIT instruction in program order.

The invention may best be understood by reference to the following description and the accompanying drawings that are used to illustrate embodiments. In these drawings:
1 is a block diagram of one embodiment of a processor.
2 is a block diagram of one embodiment of a cache agent.
3 is a diagram illustrating states of an embodiment of a monitor finite state machine.
Figure 4 is a block diagram of one embodiment of overflow avoidance logic that can be operated to reuse a single cache-side address monitor storage location for multiple hardware threads and / or cores when monitor requests indicate the same address .
FIG. 5 illustrates an example of an overflow mode by identifying stale / outdated cache-side address monitor storage locations and entering an overflow mode when no such old / Lt; RTI ID = 0.0 > of < / RTI >
6 is a block diagram of one embodiment of an overflow architecture.
Figure 7A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue / execution pipeline in accordance with embodiments of the present invention .
Figure 7B is a block diagram illustrating both an exemplary embodiment of a sequential architecture core to be included in a processor according to embodiments of the present invention and an exemplary register renaming, nonsequential issue / execution architecture core.
8A is a block diagram illustrating a single processor core in accordance with embodiments of the present invention with its local subsets of a level 2 (L2) cache and its connection to an on-die interconnect network .
8B is an enlarged view of a portion of the processor core in FIG. 8A in accordance with embodiments of the present invention.
Figure 9 is a block diagram of a processor that may have more than one core in accordance with embodiments of the present invention, may have an integrated memory controller, and may have integrated graphics.
Figure 10 is a block diagram of a system in accordance with an embodiment of the present invention.
Figure 11 is a block diagram of a first, more specific exemplary system in accordance with an embodiment of the present invention.
12 depicts a block diagram of a second, more specific exemplary system according to an embodiment of the present invention.
Figure 13 is a block diagram of an SoC in accordance with an embodiment of the present invention.
Figure 14 is a block diagram for use of a software instruction translator to translate binary instructions in a source instruction set into binary instructions in a target instruction set in accordance with embodiments of the present invention.

Methods, apparatus, and systems for scalably implementing instructions to monitor writes to addresses are disclosed herein. In the following description, a number of specific details are presented (e.g., specific instructions, command functions, processor configurations, microarchitecture details, a series of operations, etc.). However, embodiments may be practiced without these specific details. In other instances, well known circuits, structures, and techniques have not been shown in detail in order to avoid obscuring the understanding of this description.

FIG. 1 is a block diagram of one embodiment of processor 100. In FIG. Such a processor may represent a physical processor, an integrated circuit, or a die. In some embodiments, the processor may be a general purpose processor (e.g., a general purpose microprocessor of the type used on desktops, laptops, and similar computers). Alternatively, the processor may be a special purpose processor. Examples of suitable special purpose processors include, but are not limited to, network processors, communication processors, cryptographic processors, graphics processors, co-processors, embedded processors, digital signal processors, (E. G., Microcontrollers). ≪ / RTI > The processor may be any of a variety of complex instruction set computing (CISC) processors, various reduced instruction set computing (RISC) processors, various very long instruction word (VLIW) processors, their various hybrids, Lt; / RTI >

Such a processor is a multi-core processor having a plurality of processor cores (102). In the illustrated exemplary embodiment, the processor has eight cores including core 0 (102-0) through core 7 (102-7) (collectively cores 102). In other embodiments, the processor may include any other desired number of cores, for example, from two to hundreds, often from two to dozens (e.g., from about 5 to about 100) have. Each of the cores may have a single hardware thread, multiple hardware threads, or some cores may have a single hardware thread while other cores may have multiple hardware threads. In one exemplary embodiment, each of the cores may have at least two hardware threads, but the scope of the invention is not so limited.

The term core refers to logic that is often located on an integrated circuit that can maintain an independent architectural state (e. G., A running state), wherein the independently maintained architectural state is associated with dedicated execution resources. In contrast, the term hardware thread refers to logic that is often located on an integrated circuit capable of maintaining an independent architecture state, wherein an independently maintained architecture state shares access to the execution resources it uses. The boundaries between the core and the hardware thread are less clear when certain resources are shared by the architectural state and others are dedicated to that architectural state. Nonetheless, the core and hardware threads are often considered by the operating system to be individual processing elements or logical processors. Such an operating system can generally schedule operations on each of the cores, hardware threads, or other logical processors or processing elements individually. In other words, a processing element or logical processor may be, in one embodiment, independently associated with a code, such as a software thread, operating system, application, or other code, whether the execution resources are dedicated or shared or some combination thereof May represent any on-die processor logic that may be implemented. In addition to hardware threads and cores, other examples of logical processors or processing elements may include thread units, thread slots, processing units, contexts, and / or state But is not limited to, any other logic.

The cores 102 are connected together by one or more on-die interconnects 112. Such an interconnect may be used to transfer messages and data between cores. It will be appreciated that many different types of interconnects are suitable. In one embodiment, a ring interconnect may be used. In alternate embodiments, a mesh, a torus, a crossbar, a hypercube, another interconnect structure, or a hybrid or combination of such interconnects may be used.

Each core may include local instructions and / or data storage, such as, for example, a cache of one or more lower levels (not shown). For example, each core may include a corresponding lowest-level or level 1 (L1) cache that is closest to the cores and, optionally, the next nearest middle-level or level 2 (L2) cache to cores . Such a cache of one or more lower levels may be stored at a lower level (for example, because it is physically and / or logically closer to their corresponding cores than the higher level cache (s) Lt; / RTI > Each of the one or more levels of cache may cache data and / or instructions.

The cores 102 may also share a distributed top level cache 108. [ This distributed top level cache may represent physically distributed memories or portions of the cache. In the illustrated exemplary embodiment, the distributed cache includes a plurality (e. G., In this case, eight) of physically distributed cache portions 108-0 through 108-7 (collectively cache portions 108), which are often referred to as cache slices. In other embodiments, the distributed cache may include fewer or more cache portions (e.g., the same number of distributed cache portions as the number of cores of the processor). These distributed cache portions may be shared by different cores and / or threads. As shown, each cache portion may be more related to each core and / or may be physically located (e.g., co-located) closer together on die with each core optionally. For example, cache portion 108-0 may be more related to core 0 (102-0) and / or may be more physically (E. G., Co-located).

In some embodiments, each cache portion may correspond to, or map to, a mutually exclusive or non-overlapping range of memory addresses. For example, cache portion 108-0 may have associated first set of addresses, cache portion 108-1 may have a different second set of related addresses, and so on. These address ranges may be partitioned or distributed among different cache portions of the distributed cache in a variety of different ways (e.g., using different hash functions or other algorithms). In some embodiments, the higher level shared cache may represent an LLC (last level cache) that may be operated to store data and / or instructions, but this is not required. In some embodiments, a distributed cache (e.g., an LLC) may include a cache of all lower levels in the cache layer, or a cache of the next highest level cache in the cache layer ), But this is not required. In some embodiments, cores may initially identify one or more lower level caches for data and / or instructions. If the requested data and / or instructions are not found in one or more lower level caches, then the cores may proceed to ascertain the shared distributed upper level cache.

As shown, in some embodiments, a core interface (I / F) unit 104 may be coupled to each corresponding core 102. Each core interface unit may also be coupled to an interconnect 112. Each core interface unit may be operable to act as an intermediary between the corresponding core and other cores, as well as between the corresponding cores and the distributed cache portions. As further shown, in some embodiments, a corresponding cache control unit 106 may be associated with each cache slice or other portion 108. In some embodiments, each cache control unit may be located substantially co-located with the corresponding cache slice and corresponding cores. Each cache control unit may be coupled to an interconnect 112. Each cache control unit may be operable to control and assist in providing cache coherency for the corresponding distributed cache portion. Each corresponding pair of core interface unit 104 and cache control unit 106 is coupled to a core-cache portion interface unit that is operable to interface the corresponding core and corresponding cache portion to the interconnect and / . Such core interface units and cache control units may be implemented in hardware (e.g., integrated circuits, circuits, transistors, etc.), firmware (e.g., instructions stored in nonvolatile memory), software, Can be implemented.

The processor includes a first cache coherent memory controller 110-1 that couples the processor to a first memory (not shown), and a second cache coherent memory controller 110-1 that couples the processor to a second memory (not shown) Aware memory controller 110-2. In some embodiments, each cache coherency aware memory controller includes a home agent logic that may be operable to perform cache coherency and a second memory controller logic that may be operable to interact with the memory . For simplicity, in the present description, such a home agent and memory controller functionality will be referred to as a cache coherent aware memory controller. Other embodiments may include fewer or more cache coherency aware memory controllers. Moreover, in the illustrated embodiment, cache coherent aware memory controllers are on-die or on-processor, while in other embodiments they may instead be off-die (e.g., as one or more chipset components) Or an off-processor.

It should be understood that a processor may also include other components not required to understand various embodiments herein. For example, the processor may optionally include one or more of an interface to an input and / or output device, a system interface, a socket-to-socket interconnect, and the like.

As noted above, certain processors (such as those available from Intel Corporation) may use the MONITOR and MWAIT commands to achieve thread synchronization with respect to the shared memory. A hardware thread or other logical processor can set up a linear address range to be monitored by the monitor unit and use the MONITOR instruction to prepare or activate the monitor unit. Such an address may be provided via a general purpose register (e.g., EAX). This address range is typically a write-back caching type. Such a monitor unit will monitor and detect stores / writes to addresses within the address range, which will trigger the monitor unit. Other general purpose registers (e. G., ECX and EDX) may be used to communicate other information to the monitor unit. The MWAIT instruction may follow the MONITOR instruction in program order and serve as a hint to allow a hardware thread or other logical processor to halt instruction execution and enter an implementation-dependent state. For example, such a logical processor enters a sleep state, a power C-state, or another sleep state. The logical processor may remain in that state until the detection of one of a set of qualified events associated with the MONITOR command. Writing / storing to an address in the address range prepared by the preceding MONITOR instruction is one such qualification event. In these cases, the logical processor can leave the state and resume execution with an instruction following the MWAIT instruction in program order. General purpose registers (e. G., EAX and ECX) may be used to communicate other information (e. G. Information about the entering state) to the monitor unit.

FIG. 2 is a block diagram of one embodiment of a cache agent 216. FIG. In some embodiments, a cache agent may be used in the processor of FIG. It should be understood, however, that the cache agent of FIG. 2 can be used with processors that are different from those of FIG.

The cache agent 216 includes a core 202 and a cache portion 208. In some embodiments, such a core may be one of multiple cores of a multi-core processor. In some embodiments, such a cache portion may be one of a plurality of cache slices or other cache portions of a distributed cache (e.g., Distributed LLC). The cache agent also includes a core interface unit (204) and a cache portion control unit (206). The core is connected to the interconnect 212 through the core interface unit. The cache portion is coupled to the interconnect via a cache portion control unit. The core interface unit is connected between the core and the cache portion control unit. The cache portion control unit is coupled between the core interface and the cache portion. Such a core, a cache portion, a core interface unit, and a cache portion control unit may optionally be similar or identical to the corresponding named components of FIG. In this particular example, the core is a multi-threaded core comprising a first hardware thread 218-1 and a second hardware thread 218-2, but the scope of the invention is not so limited. In other embodiments, the core may be single-threaded or may have more than two hardware threads.

Cache agent 216 may be a monitor that may be operated to implement monitor commands (e.g., a MONITOR command) used to monitor for writing to one or more addresses (e.g., the address range indicated by the MONITOR command) Mechanism. Such a mechanism may utilize or leverage an existing cache coherency mechanism (e.g., it may use communication of intent to write to an address carried over a cache coherency mechanism). In the depicted embodiment, the monitor mechanism includes a cache-side address monitor unit 226, a core-side address monitor unit 220, a core-side trigger unit 234, and a cache-side storage overflow unit 236 . The term " core-side ", as used herein, refers to a portion of the interconnect 212 that is on the same side as the core 202 of the interconnect 212 and / or is disposed between the core and the interconnect, ≪ / RTI > Likewise, the term " cache-side " refers to being located on the same side of the cache portion 208 of the interconnect 212 and / or being disposed between the cache portion and the interconnect and / It refers to the nearest thing.

In the illustrated embodiment, the cache-side address monitor unit 226 and the cache-side storage overflow unit 236 are both implemented in the cache portion control unit 206, but this is not required. In other embodiments, one or more of these units may be implemented as a separate cache-side component (e.g., coupled to a cache control unit and / or cache portion). Similarly, in the illustrated embodiment, both the core-side address monitor unit 220 and the core-side trigger unit 234 are implemented in the core interface unit 204, but this is not required. In other embodiments, one or more of these units may be implemented as separate core-side components (e.g., coupled to the core interface unit and / or core).

The cache-side address monitor unit 226 corresponds to the cache portion 208, which is a slice or other portion of the distributed cache. The cache-side address monitor unit has a number of different cache-side address monitor storage locations 228. As shown, each cache-side address monitor storage location may be used to store an address 230 to be monitored for writes. In some embodiments, each cache-side address monitor storage location may also store an indication of the core associated with the address (e.g., a core identifier, a core mask with different bits corresponding to each different core, etc.) have. For example, these storage locations may represent different entries in a hardware implemented table. As shown, in the illustrated embodiment, there may be a first cache-side address monitor storage location 228-1 to an Nth cache-side address monitor storage location 228-N, where N is a specific implementation . ≪ / RTI >

In some embodiments, the total number of cache-side address monitor storage locations in the cache-side address monitor unit corresponding to the cache portion may be determined by the processor and / or the hardware threads (or other logical processors) Lt; / RTI > In some embodiments, each hardware thread (or other logical processor) may be operable to use monitor instructions (e.g., a MONITOR instruction) to monitor a single address or a single range of addresses. In some cases, after using these monitor commands, the hardware thread may be put to sleep or may be placed in another power saving state. One possible approach would be to provide cache-side address monitor storage locations 228 for each hardware thread (or other logical processor) sufficient to store the address to be monitored. However, when a distributed cache is used, each address may have only hashed or otherwise mapped to a single corresponding cache slice or other cache portion. For example, a hash of such an address may select a single corresponding cache slice corresponding to that address according to a particular hash function. Thus, when this distributed cache is used, there is an opportunity for all of the addresses to be monitored for all of the hardware threads (or other logical processors) to be hashed, or otherwise mapped, all to the same single cache slice , Is generally very little opportunity.

To allow for this possibility, one possible approach is to provide cache-side address monitor storage locations (" cache ") that are equal in number to the total number of processor threads and / or socket hardware threads 228 < / RTI > For example, in an eight core processor where each core has two hardware threads, a total of 16 cache-side address monitor storage locations (i.e., number of cores times the number of threads per core) Lt; / RTI > For example, a hardware implemented table having entries equal in number to the total number of hardware threads may be included. In some cases, each storage location may have a fixed mapping or assignment to the corresponding hardware thread. This may allow for storing an address to be monitored for every hardware thread and may allow for the possibility that both of these addresses may map to the same cache part and thus need to be stored locally for that cache part . This approach is designed for essentially worst-case scenarios, which are generally not quite feasible, but so far can not be ignored because no approach could deal with these scenarios when this really happened to be.

One disadvantage to this approach is that it tends to be relatively unscalable as the number of hardware threads (or other logical processors) and / or the number of cache portions increases. Increasing the number of hardware threads increases the number of storage locations required for each cache portion. Moreover, increasing the number of cache portions includes adding a further set of these storage locations for each additional cache portion. Processors may have, for example, more than 32 threads, 36 threads, 40 threads, 56 threads, 128 threads, or 256 threads. It is easy to see that the amount of storage can become significant when such a large number of threads are used. These significant amounts of storage tend to increase the manufacturing cost of the processor, the amount of area on-die required to supply the storage, and / or the power consumption caused by storage.

As an alternative approach, in some embodiments, the total number of cache-side address monitor storage locations 228 in the cache-side address monitor unit 226 corresponding to the cache portion 208 may be determined by the processor and / May be less than the total number of hardware threads (or other logical processors) of the socket. There may be fewer address monitor storage locations than strictly needed to completely avoid the possibility of an address monitor storage overflow. In some embodiments, each cache portion may have a corresponding number of address monitor storage locations sufficient to avoid most of the time overflow, but this is not sufficient to completely prevent such overflow in all instances . In some embodiments, the total number of cache-side address monitor storage locations per cache portion may be less than about one hundred thousand, one million, or even ten thousand, It may be sufficient for the total number of hardware threads of a non-small processor. In some embodiments, the processor may have more than about 40 hardware threads, and the total number of cache-side address monitor storage locations per cache portion may be less than 40 (e.g., from about 20 to about 38). In some embodiments, the processor may have more than 50 hardware threads, and the total number of cache-side address monitor storage locations per cache portion may be less than about 50 (e.g., from about 20 to about 45, or about 25 to about 40, or about 30 to about 40). In some embodiments, instead of assigning or allocating cache-side address monitor storage locations to specific hardware threads, the storage locations may not correspond to any particular hardware thread, but rather any storage location may be arbitrary It can be used by hardware threads. Advantageously, the total number of cache-side address monitor storage locations in the cache-side address monitor unit corresponding to less than the total number of processor threads and / or socket's hardware threads (or other logical processors) Can potentially help provide a more scalable solution for implementing monitor commands (e.g., the MONITOR command). It should be understood, however, that the embodiments disclosed herein have utility, regardless of the number of hardware threads and / or cores, and / or the total amount of storage, whether large or small.

Referring again to FIG. 2, the cache agent includes a core-side address monitor unit 220, which corresponds to the core 202. This core-side address monitor unit has the same number of core-side address monitor storage locations as the number of one or more hardware threads of the corresponding core. In the illustrated embodiment, the first core-side address monitor storage location 221-1 has a fixed correspondence to the first hardware thread 218-1 and the second core-side address monitor storage location 221-1 -2) has a fixed correspondence to the second hardware thread 218-2. In other embodiments, different numbers of threads and storage locations may be used. Each core-side address monitor storage location may be operable to store addresses 222-1, 222-2 to be monitored for corresponding hardware threads 218-1, 218-2 of corresponding cores. When such fixed correspondence exists, storing the address in the storage location may associate the address with the hardware thread and the corresponding hardware. In other embodiments, if there is no fixed correspondence between storage locations and hardware threads, each storage location stores an indication (e.g., a hardware thread identifier) of the hardware thread corresponding to the address to be monitored . In some embodiments, each core-side address monitor storage location is also operable to store monitor states 224-1, 224-2 for corresponding hardware threads 218-1, 218-2 of corresponding cores . In some embodiments, each monitor state may represent a monitor finite state machine (FSM). In some embodiments, in the case of a MONITOR command, such a monitor state may be any one of an idle state, an estimate (e.g., a monitor loaded state), and a trigger preparation (e.g., wait2trigger) state , The scope of the present invention is not limited thereto.

In some embodiments, the cache-side address monitor unit 226 and the core-side address monitor unit 220 are configured to determine whether to write to one or more addresses (e.g., addresses in the address range shown by the MONITOR instruction) You can work together or work together to monitor. To further illustrate certain concepts, consider an example of how the monitor mechanism performs MONITOR and MWAIT commands. The first hardware thread 218-1 may perform a MONITOR instruction. The MONITOR command can indicate the address to be monitored for writing. The first hardware thread issues a corresponding monitor request for the indicated monitor address. This monitor request may cause the first core-side address monitor unit 220 to store the displayed monitor address 222-1 in the first core-side address monitor storage location 221-1. The monitor state 224-1 may be set to an estimated or monitor loaded state. The MONITOR request may be routed on the interconnect 212 to the distributed cache portion 208 that is appropriate to assume that it stores data corresponding to the indicated monitor address. Note that depending on the particular displayed monitor address it may be any distributed cache portions based on a hash function or other algorithm used for the mapping. The cache-side address monitor unit may store the displayed monitor address in the cache-side address monitor storage location 230 (e.g., any of the available locations 230-1 through 230-N). A core identifier that identifies the core 202 having the first hardware thread 218-1 may also be stored in the cache-side address monitor storage location 230 as a core identifier (ID) In some embodiments, such a core identifier may be a set of bits identifying one of the cores. In other embodiments, a core mask may optionally be used such that a single storage location may be shared by multiple cores for the same address being monitored.

The first thread 218-1 may subsequently perform an MWAIT instruction that may also display the monitored address. The first hardware thread issues a corresponding MWAIT signal for the indicated monitor address. In response to this MWAIT signal, the core side address monitor unit 220 may set the monitor state 224-1 to a trigger ready state (e.g., a trigger waiting state). The first hardware thread may optionally be placed in a different state, such as, for example, a sleep or other sleep state. Typically, the first thread can store the state in a context if the thread must go to sleep and then go to sleep.

Subsequently, there exists an intention to write to the indicated monitor address (e.g., a read for ownership request, a snoop invalidation showing the indicated monitor address, a state transition associated with an address changing from a shared state to an exclusive state, etc.) , The cache-side address monitor unit can detect this intention to write to the address. The address may match one of the addresses in one of its storage locations. One or more cores corresponding to a storage location may be determined, for example, by a core identifier or core mask stored in a cache-side address monitor storage location. The cache-side address monitor unit may clear the cache-side address monitor storage location used to store the displayed monitor address. This can send a signal to the corresponding core (s), for example, by sending a snoop invalidation to the corresponding core (s). The cache-side address monitor unit may selectively notify directly to only one or more cores that are known to monitor the address for intent to write to the address (e.g., via request for ownership or invalidate a snoop) It can act as a kind of advanced filter to help. These notifications may indicate " hints " that are optionally provided in a subset of cores that monitor that address. Advantageously, this may help avoid notifying cores that are not monitoring the address, which may help to avoid false wakeups and / or reduce traffic on the interconnect.

The core-side address monitor unit 220 at the signaled core (s) can receive the signal and provide the address indicated in the signal (e.g., for snoop invalidation) to its core- And can be compared with monitor addresses at storage locations. It can determine if the address of the signal matches the monitor address 222-1 at the first core-side monitor address storage location 221-1 corresponding to the first hardware thread 218-1. The core-side address monitor unit can know if the first hardware thread corresponds to the monitored address. The core-side address monitor unit may send a signal to the core side trigger unit 234 that the intention to write to the monitored address has been observed. This may clear the first core-side address monitor storage location and change the monitor state 224-1 to idle. The core-side trigger unit may be operable to provide a trigger signal (e.g., alert, notification, or wake signal) to the first hardware thread. In such an embodiment, the core-side trigger unit is a core-side, which may be merely useful for logic, but this may optionally be provided on the cache-side. The first hardware thread may be woken up if it was sleep.

In some embodiments, cache-side address monitor storage locations are likely to overflow. For example, a new monitor request may be received at the cache-side address monitor unit, but all of the cache-side address monitor storage locations are currently in use, so that an empty / available cache - Side address monitor The storage location may not exist. As shown, in some embodiments, the cache-side address monitor unit may be coupled to a cache-side address monitor storage overflow unit 236 corresponding to a cache portion. In some embodiments, such a cache-side address monitor storage overflow unit may be configured to monitor the address of the new monitor request when there are no empty / available / unused cache-side address monitor storage locations capable of storing the address of the new monitor request. May be operable to enforce or implement a storage overflow policy.

As noted, in some embodiments, the core-side address monitor unit may have the same number of core-side address monitor storage locations as the number of hardware threads in its corresponding core. Similarly, in some embodiments, the core-side address monitor units of other cores may have the same number of core-side address monitor storage locations as the number of hardware threads in their corresponding cores. Collectively, these core-side address monitor storage locations may represent as many sets of core-side address monitor storage locations as the total number of hardware threads (or other logical processors) of the processor. Advantageously, even when there is an overflow of cache-side address monitor storage locations, the core-side address monitor units are able to provide sufficient core-to-core address storage for all of the hardware threads (or other logical processors) Side address monitor storage locations.

3 is a diagram illustrating states of one embodiment of a monitor finite state machine (FSM) 347 suitable for implementing the MONITOR and MWAIT commands. Upon receiving a monitor request for an address from the executing thread, the monitor FSM may make a transition 343 from the idle state 340 to the estimated state 341. While the monitor FSM is in the estimated state, if the cache portion that should store the data corresponding to that address receives a write request that matches that address, or if a monitor clear request is provided from the executing thread, the monitor FSM is in an idle state 340 < / RTI > If another monitor request is provided from that same executing thread, the monitor FSM may make a transition 343 back to the estimated state 341, and the monitored address may be adjusted as appropriate. On the other hand, if an MWAIT request is provided from the execution thread while in the estimated state 341, the monitor FSM may make a transition 345 to the trigger wait state 342. Even prior to receiving the MWAIT request, this estimation state may be helpful in ensuring that monitor-wake events are only sent for the most recently monitored address, while tracking the address from the time the monitor request is received . While the monitor FSM is in the trigger wait state, a monitor-wake event may be sent to the executing thread when the cache portion to which the data corresponding to that address is to be stored receives a write request that matches the monitored address. On the other hand, a monitor clear request may be provided from the executing thread while the monitor FSM is in the trigger waiting state 342. In this case, the monitor request may be cleared for that execution thread and no monitor-wake event need be sent to the execution thread, but in either of these two cases, the monitor FSM may be a transition back to the idle state 340 (346) can be performed.

4 illustrates an overflow avoidance logic 460 that may be operable to reuse a single cache-side address monitor storage location 428 for multiple hardware threads and / or cores when monitor requests display the same address. Figure 2 is a block diagram of one embodiment. This logic includes a cache-side address monitor storage location reuse unit 464 coupled with a cache-side address monitor storage location 428. This cache-side address monitor storage location reuse unit may receive monitor requests 462 from different hardware threads and / or cores indicating the same address. One possible approach would be to store different copies of this same address in different cache-side address monitor storage locations (e.g., in a table where different entries are implemented in hardware). However, this may consume many, or in some cases, many, cache-side address monitor storage locations.

As an alternative approach, in some embodiments a single cache-side address monitor storage location 428 may be used to store the address 430 to be monitored and to display monitor requests from different hardware threads. In some embodiments, a structure 432 that can associate a plurality of cores with an address to be monitored is also stored in the cache-side address monitor storage location 428. In one example, such a structure may include a core mask structure 432. [ This core mask may have the same number of bits as the total number of cores in the processor and each bit of the core mask may have a fixed correspondence to the different cores. According to one possible convention, each bit has a first value (e.g., cleared to binary zero) indicating that the corresponding core does not have pending monitor requests to the address, And may have a second value (e.g., set to binary 1) indicating having a pending monitor request. Opposition agreements are also possible. The bit for the corresponding core may be set to indicate that the monitor request is received from its core for the address stored in the cache-side address monitor storage location, or a write to the address is observed and reported to the core-side logic Can be cleared. Note that cache-side address monitor storage locations are tracked by address, not by thread identifier. Advantageously, in this manner, monitor requests for the same address from different cores may be collapsed into the same single cache-side address monitor storage location. This reuse of storage locations for multiple requests from different threads / cores can help avoid cache-side address monitor storage location overflows.

As noted above, in some cases it is possible to overflow a limited number of cache-side address monitor storage locations. In some embodiments, an overflow mode or set of policies may be provided to allow the monitor mechanism to operate correctly in the event of an overflow.

Figure 5 illustrates an overflow mode by identifying stale / outdated cache-side address monitor storage locations and entering an overflow mode when no such old / old storage locations are found. 570 < / RTI > In some embodiments, the operations and / or methods of FIG. 5 may be performed by and / or within the processor of FIG. 1 and / or the cache agent of FIG. The components, features, and details of the specific options described herein for the processor of FIG. 1 and the cache agent of FIG. 2 are also optionally applied to the operations and / or methods of FIG. Alternatively, the operations and / or methods of FIG. 5 may be performed by and / or within similar and / or different processor and / or cache agents. In addition, the processor of FIG. 1 and / or the cache agent of FIG. 2 may perform the same, similar, or different operations and / or methods as those of FIG.

The method includes determining, at block 571, that there are no available / unused cache-side address monitor storage locations to handle the received monitor request. For example, a monitor request may be received at a cache-side address monitor unit (e.g., cache-side address monitor unit 226), which cache-side address monitor unit may / It can be determined that there is no unused cache-side address monitor storage location. For example, all cache-side address monitor storage locations can now store addresses to be monitored.

The method optionally includes, at block 572, determining whether an old / old cache-side address monitor storage location is present and may be used to handle recently received monitor requests. In some embodiments, the cache-side address monitor unit may select an entry with an address to determine whether it is out-of-date and / or out-of-date. For example, an old / old address may represent an address still stored in the storage location, but no valid pending monitor requests for that address currently exist. For example, due to a monitor that is configured but not ready, there may be instances of false monitor requests. Such an entry may be selected randomly, based on the age of the entry, based on a prediction of validity, or in other manners. In some embodiments, to verify that the storage location is out-of-date / old, the cache-side address monitor unit may send a snoop request for the associated address to one or more cores (E.g., based on the core identifier or core mask stored in the storage location). One or more core-side address monitor unit (s) for the core (s) receiving such a snoop request may identify their corresponding core-side address monitor storage locations to determine if the address is stored. Each of the one or more core-side address monitor unit (s) then sends back to the cache-side address monitor unit a response indicative that the address is still valid (e.g. still corresponding to a valid monitor request from the corresponding core) . If the responses from one or more core-side address monitor units still indicate pending monitor requests that are valid for that address, then such address and / or storage location may be determined to be out-of-date / out of date. Otherwise, if no core-side address monitor unit still reports pending monitor requests valid for that address, then the address and / or storage location may be determined to be out of date / old. In some embodiments, only a single storage location and / or address may be identified using this approach. Alternatively, multiple storage locations and / or addresses may be identified using this approach.

5, if it is determined at block 572 that there is such an old / old cache-side address monitor storage location that can be used to handle recently received monitor requests (i.e., if the determination at block 572 is yes, , This method may optionally proceed to block 573. At block 573, the old / older cache-side address monitor storage location may optionally be used to handle recently received monitor requests. Advantageously, in this case the overflow mode can be avoided at this point by using the old / old storage location.

Alternatively, if it is determined at block 572 that no such old / spherical cache-side address monitor storage location exists (i.e., the determination at block 572 is NO), then the method may proceed to block 574 . At block 574, this method may enter an overflow mode. The step of entering the overflow mode may include enforcing or implementing overflow policies. In overflow mode the performance may be somewhat degraded. Often, however, the overflow mode rarely and usually needs to be implemented only for relatively short periods of time until the overflow condition is relaxed.

As an overflow policy, at block 575, the method may include forcing all read transactions to use the shared cache coherency state. Conceptually, this can be viewed as handling all read transactions as monitor requests. Upon entering the overflow mode, the cache-side address monitor unit is no longer able to track monitor requests / addresses to dedicated storage. Thus, no core may be allowed to have an exclusive copy of the cache line. For example, any read operation received by the cache-side address monitor unit may be handled by a shared state response. Forcing these read transactions to use the shared state ensures that the intent to write to the corresponding address will cause a snoop or broadcast to be provided to all cores that may have cached the address It can be helpful.

As another overflow policy, at block 576, the method includes sending an invalidation request to all cores that may have pending monitor requests. In some embodiments, this may be accomplished in a processor that is likely to have pending monitor requests when any invalidation request is detected (e.g., through detection of read, snoop invalidation requests, etc. invalidating its own request) And / or a snoop that invalidates all cores in the same socket. Upon entering the overflow mode, the cache-side address monitor unit is no longer able to track monitor requests / addresses to dedicated storage. Therefore, all cores that are likely to have pending monitor requests must be informed of all invalidation requests. These snoops can reach the core-side address monitor units of all these cores and provide monitor triggers when appropriate for any cores with valid pending monitor requests for the associated address.

It is worth noting that it is not strictly required to notify all cores of the processor, but only to all cores that are likely to have pending monitor requests. In some embodiments, one structure may optionally be used to keep track of all cores that may have pending monitor requests when an overflow occurs. One example of such a structure is an optional overflow structure. This overflow structure can indicate which cores are likely to have pending monitor requests when an overflow occurs. In one example, the overflow structure may have the same number of bits as the total number of cores in the processor, and each bit may have a fixed correspondence to different corresponding cores. According to one possible convention, each bit may have a first value (e.g., set to binary 1) indicating that the corresponding core is likely to have a pending monitor request when an overflow occurs , Or a second value (e.g., cleared to binary zero) indicating that the corresponding core is not likely to have a pending monitor request when an overflow occurs.

In one embodiment, the overflow architecture itself may reflect all of the cores that may have pending monitor requests when an overflow occurs. For example, when an overflow occurs, the overflow structure may be modified to reflect all cores corresponding to any one or more addresses currently stored in the cache-side address monitor storage locations. In another embodiment, an overflow structure in combination with cache-side address monitor storage locations may reflect all of the cores that may have pending monitor requests when an overflow occurs. For example, whenever an overflow occurs, each time the cache-side address monitor storage location is overwritten or consumed by a recently received monitor request, the cores associated with addresses being overwritten or consumed Can be reflected in the overflow structure. That is, the overflow structure may be updated each time a storage element is overwritten to capture information about cores that may have pending monitor requests. In these embodiments, information about whether cores are likely to have pending monitor requests when an overflow occurs is partitioned between the cache-side address monitor storage locations and the overflow structure.

In embodiments in which such an overflow structure or related structure is used, it is not required to send any received invalidation request to all cores, but rather the cores indicated by the overflow vector and / To storage locations that are likely to have access. Some cores may not be displayed in the overflow vector and / or storage locations, and therefore there is no possibility of having any pending monitor requests when an overflow occurs, thus no invalidation requests need to be sent. However, the use of such an overflow structure is optional and not required.

Referring again to FIG. 5, the overflow mode may be continued by repeating blocks 575 and 576 as needed, as long as no available storage locations exist. However, over time, the old and / or old addresses and / or storage locations may be actively updated by snooping or otherwise sending any invalidation requests to all cores that may have pending monitor requests at block 576 Can be removed. If the core-side address monitoring units do not have valid pending monitor requests or invalidate requests for these snoops, they can report back on this, which indicates that the cache-side address monitor unit (E. G., Update the core mask) that it is not interested in monitoring, or may clear the storage location if other cores are not interested in the address. In various embodiments, removal of old / old storage locations may be performed based on a specific address, a particular cache portion, a particular core, and the like. The overflow mask may also be modified to reflect cleaning up of old / old storage locations or addresses. For example, cores that no longer have pending monitor requests can be updated to zeroes instead of ones in the overflow mask. In this manner, snoops or invalidation requests at block 576 may help clean up older / older storage elements or addresses over time so that the overflow mode may be deviated at some point. As shown in block 577, this overflow mode may be diverted.

This is just one exemplary embodiment. Many variations on these embodiments are contemplated. For example, the determination at block 572 is optional and not required. In other embodiments, the overflow mode may be automatically entered without confirmation of possible old entries / addresses.

6 is a block diagram of one embodiment of an overflow architecture 680. In FIG. This overflow structure can be used to indicate which cores are likely to have pending monitor requests when an overflow occurs, either in combination with a cache alone or in combination with cache-side address monitor storage locations. In this embodiment, the overflow structure includes N + 1 bits each having a fixed correspondence to a different one of the N + 1 cores (e.g., core 0 through core N). According to one possible convention, each bit may have a first value (e.g., set to binary 1) indicating that the corresponding core is likely to have a pending monitor request when an overflow occurs , Or a second value (e.g., cleared to binary zero) indicating that the corresponding core is not likely to have a pending monitor request when an overflow occurs. For example, in the figure, the leftmost bit corresponding to core 0 has a binary zero (i.e., 0) indicating that core 0 has no pending monitor requests, and the next most corresponding The left bit has a binary circle (i.e., 1) indicating that Core 1 has a pending monitor request, and the rightmost bit corresponding to Core N indicates that Core N does not have pending monitor requests And has a binary zero (i.e., 0). This is just one example of a suitable overflow structure. It should be understood that other structures may be used to convey information of the same or similar types of information. For example, in another embodiment, a list of core IDs with pending monitor requests may be stored in a structure or the like.

Any of these units or components may be implemented in hardware (e.g., an integrated circuit, transistors or other circuit elements, etc.), firmware (e.g., ROM , EPROM, flash memory or other persistent or nonvolatile memory and microcode, microinstructions stored therein, or other low-level instructions), software (e.g., high-level instructions stored in memory) Or a combination thereof (e.g., hardware that is potentially combined with one or more of firmware and / or software).

The components, features, and details described for any of FIGS. 1, 3, 4, and 6 may also optionally be used in any of FIGS. It should also be understood that the components, features, and details described herein for any of the devices may be implemented in any of a variety of ways, which may be performed by and / It can also be used as an option in any of the methods.

Exemplary core architectures, processors and computer architectures

The processor cores may be implemented in different ways, for different purposes, in different processors. For example, implementations of these cores may include: 1) a general purpose sequential core targeted for general purpose computing; 2) a high performance general purpose non-sequential core for general purpose computing; 3) special purpose cores primarily targeted to graphics and / or scientific (throughput) computing. Implementations of the different processors may include: 1) a CPU comprising one or more general purpose non-sequential cores targeted for general purpose sequential cores and / or general purpose computing aimed at general purpose computing; And 2) one or more special purpose cores primarily targeted for graphics and / or scientific (throughput) computing. These different processors lead to different computer system architectures, including: 1) a coprocessor on a chip separate from the CPU; 2) a coprocessor on a separate die in the same package as the CPU; 3) a coprocessor on the same die as the CPU (in which case this coprocessor is sometimes referred to as special purpose logic, such as integrated graphics and / or scientific (throughput) logic, or special purpose cores); And 4) a system on a chip, which may include the CPU (sometimes called the application core (s) or application processor (s)) described above, the coprocessor disclosed above, . ≪ / RTI > Exemplary core architectures are described below, followed by the disclosures of exemplary processors and computer architectures.

Exemplary core architectures

Sequential and nonsequential cores Block diagram

Figure 7A is a block diagram illustrating both an exemplary sequential pipeline and an exemplary register renaming, nonsequential issue / execution pipeline, in accordance with embodiments of the present invention. Figure 7B is a block diagram illustrating an exemplary embodiment of both a sequential architecture core to be included in a processor and exemplary register renaming, nonsequential issue / execution architecture cores in accordance with embodiments of the present invention. Solid-line boxes in FIGs. 7a-b show sequential pipelines and sequential cores, while optional additions to dotted boxes illustrate register renaming, non-sequential issue / execute pipelines and cores. Considering that the sequential aspect is a subset of the non-sequential aspect, the non-sequential aspect will be explained.

7A, processor pipeline 700 includes a fetch stage 702, a length decode stage 704, a decode stage 706, an allocation stage 708, a renaming stage 710, (also known as dispatch or issue) ) Scheduling stage 712, a register read / memory read stage 714, an execute stage 716, a write back / memory write stage 718, an exception handling stage 722, and a commit stage 724).

7B shows a processor core 790 that includes a front-end unit 730 coupled to an execution engine unit 750, both of which are coupled to a memory unit 770. [ Core 790 may be a Reduced Instruction Set Computing (RISC) core, a Complex Instruction Set Computing (CISC) core, a Very Long Instruction Word (VLIW) core, or a hybrid or alternative core type. As another option, the core 790 may be a special purpose core such as, for example, a network or communications core, a compression engine, a coprocessor core, a general purpose computing graphics processing unit (GPGPU) core, a graphics core, have.

The front end unit 730 includes a branch prediction unit 732 coupled to the instruction cache unit 734 and the instruction cache unit 734 is coupled to an instruction TLB (Translation Lookaside Buffer) 736, 736 are coupled to an instruction fetch unit 738 and an instruction fetch unit 738 is coupled to a decode unit 740. [ The decode unit 740 (or decoder) may decode the instructions and may include one or more micro-operations, which are decoded from, or otherwise reflected from, or derived from the original instructions, , Micro-instructions, other instructions, or other control signals as outputs. Decode unit 740 may be implemented using a number of different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode ROM (Read Only Memory) In one embodiment, the core 790 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in the decode unit 740 or otherwise in the front end unit 730) do. Decode unit 740 is coupled to rename / allocator unit 752 in execution engine unit 750.

Execution engine unit 750 includes a rename / allocator unit 752 coupled to a set of retirement unit 754 and one or more scheduler unit (s) 756. The scheduler unit (s) 756 represent any number of different schedulers, including reservation stations, central command windows, and the like. Scheduler unit (s) 756 is coupled to physical register file (s) unit (s) 758. Each of the physical register file (s) units 758 represents one or more physical register files, and the different ones include scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, For example, an instruction pointer that is the address of the next instruction to be executed). In one embodiment, the physical register file (s) unit 758 includes a vector register unit, a write mask register unit, and a scalar register unit. These register units may provide architecture vector registers, vector mask registers, and general purpose registers. The physical register file (s) unit (s) 758 may be programmed to cause the register renaming and nonsequential execution to be performed on the next file (s) (e.g., using the reorder buffer (s) and the retirement register file ), The history buffer (s), and the retirement register file (s)), using a pool of register maps and registers, etc.) by the retirement unit 754 Overlap. The retirement unit 754 and the physical register file (s) unit (s) 758 are coupled to the execution cluster (s) 760. The execution cluster (s) 760 includes a set of one or more execution units 762 and a set of one or more memory access units 764. Execution units 762 may perform various operations (e.g., shift, add, subtract) on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point) , Multiplication) can be performed. While some embodiments may include multiple execution units dedicated to a particular function or set of functions, other embodiments may include only one execution unit, or multiple execution units, all of which perform all functions have. It is to be appreciated that the scheduler unit (s) 756, physical register file (s) unit (s) 758 and execution cluster (s) 760 can be multiple, (For example, a scalar integer pipeline, a scalar floating point / packed integer / packed < RTI ID = 0.0 > In the case of a floating-point / vector integer / vector floating-point pipeline and / or a memory access pipeline-and a separate memory access pipeline, only the execution cluster of such a pipeline is associated with the memory access unit (s) Are implemented). It should also be appreciated that when individual pipelines are used, one or more of these pipelines may be nonsequential issuing / executing and the remainder may be sequential.

The set of memory access units 764 is coupled to a memory unit 770 that includes a data TLB unit 772 coupled to a data cache unit 774 coupled to a level two (L2) cache unit 776 . In an exemplary embodiment, memory access units 764 may include a load unit, a storage address unit, and a storage data unit, each of which is coupled to a data TLB unit 772 in memory unit 770. Instruction cache unit 734 is further coupled to a level two (L2) cache unit 776 in memory unit 770. L2 cache unit 776 is coupled to one or more other levels of cache and ultimately to main memory.

For example, an exemplary register renaming, non-sequential issue / execute core architecture may implement pipeline 700 as follows: 1) instruction fetch 738 includes fetch and length decoding stages 702 and 704, ; 2) Decode unit 740 performs decode stage 706; 3) rename / allocator unit 752 performs allocation stage 708 and renaming stage 710; 4) The scheduler unit (s) 756 performs a schedule stage 712; 5) The physical register file (s) unit (s) 758 and the memory unit 770 perform a register read / memory read stage 714; Execution cluster 760 performs execution stage 716; 6) The memory unit 770 and the physical register file (s) unit (s) 758 perform the writeback / memory write stage 718; 7) the various units may be associated with exception handling stage 722; 8) The retirement unit 754 and the physical register file (s) unit (s) 758 perform the commit stage 724.

Core 790 may include one or more instruction sets (e.g., x86 instruction set (with some extensions added with newer versions), including the instruction (s) disclosed herein, MIPS instruction set of MIPS Technologies; ARM Holdings of Sunnyvale, CA (with optional additional extensions such as NEON) ARM instruction set. In one embodiment, core 790 includes logic to support packed data instruction set extensions (e.g., AVX1, AVX2), so that operations used by many multimedia applications can be performed using packed data To be performed.

The core may support multithreading (running the operations of two or more parallel sets or threads), and may include time sliced multithreading, where a single physical core is allocated to each of the threads (E.g., providing a logical core for a processor), or a combination thereof (e.g., time division fetching and decoding, such as in Intel® Hyperthreading technology, and concurrent multithreading thereafter). It should be understood.

Although register renaming is described in the context of nonsequential execution, it should be understood that register renaming may also be used in a sequential architecture. Embodiments of the illustrated processor also include separate instruction and data cache units 734/774 and shared L2 cache units 776, but alternative embodiments may include, for example, a level 1 (L1) A single internal cache for both instructions and data, such as multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache external to the core and / or processor. Alternatively, all of the caches may be external to the core and / or processor.

Certain exemplary sequential core architectures

8A-B show a block diagram of a more specific exemplary sequential core architecture in which the core is one of several logic blocks (including other cores of the same type and / or different types) in the chip. The logic blocks communicate with some fixed function logic, memory I / O interfaces, and other necessary I / O logic, depending on the application, over a high-bandwidth interconnect network (e.g., ring network).

Figure 8A illustrates a block diagram of a single processor core, a connection to an on-die interconnect network 802, and a local sub-cache of a level 2 (L2) cache 804, in accordance with embodiments of the present invention. Show with set. In one embodiment, instruction decoder 800 supports an x86 instruction set with a packed data instruction set extension. The L1 cache 806 allows low-latency accesses to cache memories into scalar and vector units. In one embodiment, the scalar unit 808 and the vector unit 810 use separate register sets (scalar registers 812 and vector registers 814, respectively) (to simplify the design) Although the moved data is written to memory and then read back from the level 1 (L1) cache 806, alternative embodiments of the present invention may use a different approach (e.g., using a single set of registers, Or a communication path that allows data to be moved between the two register files without being written and read again).

The local subset of the L2 cache 804 is part of a global L2 cache that is divided into discrete local subsets, one per processor core. Each processor core has a direct access path to its local subset of the L2 cache 804. The data read by the processor core is stored in its L2 cache subset 804 and can be quickly accessed in parallel with other processor cores accessing their own local L2 cache subsets. Data written by the processor cores is stored in its L2 cache subset 804 and, if necessary, removed from other subsets. The ring network guarantees coherency for shared data. The ring network is bi-directional, allowing agents such as processor cores, L2 caches, and other logical blocks to communicate with each other within the chip. Each ring data-path is 1012-bits wide per direction.

8B is a partial enlarged view of the processor core of FIG. 8A in accordance with embodiments of the present invention. 8B includes more details regarding the vector unit 810 and the vector registers 814 as well as the L1 data cache 806A portion of the L1 cache 804. [ Specifically, the vector unit 810 is a 16-wide Vector Processing Unit (16-wide ALU 828) that executes one or more of integer, single precision floating, and double precision floating instructions. The VPU supports swizzling of register inputs by swizzle unit 820, numeric conversion by numeric conversion units 822A-B, and cloning by clone unit 824 for memory input. Write mask registers 826 allow predicating of the resulting vector writes.

A processor with integrated memory controller and graphics

9 is a block diagram of a processor 900 that may have more than one core, may have an integrated memory controller, and may have integrated graphics, in accordance with embodiments of the present invention. The solid line boxes in Figure 9 illustrate a processor 900 having a single core 902A, a system agent 910, a set of one or more bus controller units 916, while the addition of optional dotted- A set of one or more integrated memory controller unit (s) 914 in system agent unit 910, and special purpose logic 908, as shown in FIG.

Thus, different implementations of processor 900 may include: 1) special purpose logic 908, which may be integrated graphical and / or scientific (throughput) logic (which may include one or more cores) and one or more general purpose cores A general-purpose sequential cores, a general-purpose non-sequential cores, a combination of the two) cores 902A-N; 2) a coprocessor having cores 902A-N, which are a number of special purpose cores primarily targeted for graphical and / or scientific (throughput) computing; And 3) cores 902A-N, which are a number of general purpose sequential cores. Thus, processor 900 may be a general purpose processor, a coprocessor, or a special purpose processor, such as a network or communications processor, a compression engine, a graphics processor, a General Purpose Graphics Processing Unit (GPGPU) A processor (including more than 30 cores), an embedded processor, or the like. A processor may be implemented on one or more chips. Processor 900 may be part of and / or be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.

The memory hierarchy includes a cache of one or more levels in the cores, a set of one or more shared cache units 906, and an external memory (not shown) coupled to the set of unified memory controller units 914. The set of shared cache units 906 may include one or more intermediate level caches, e.g., level 2 (L2), level 3 (L3), level 4 (L4) ) And / or combinations thereof. In one embodiment, the ring-based interconnect unit 912 interconnects the integrated graphics logic 908, the set of shared cache units 906 and the system agent unit 910 / integrated memory controller unit (s) 914, Alternate embodiments may utilize any number of known techniques for interconnecting such units. In one embodiment, coherency is maintained between one or more cache units 906 and cores 902A-N.

In some embodiments, at least one of the cores 902A-N is multi-threadable. System agent 910 includes components for coordinating and manipulating cores 902A-N. The system agent unit 910 may include, for example, a PCU (Power Control Unit) and a display unit. The PCU may include or may include logic and components necessary to adjust the power state of cores 902A-N and integrated graphics logic 908. [ The display unit is for driving one or more externally connected displays.

The cores 902A-N may be homogeneous or heterogeneous with respect to a set of architectural instructions; That is, two or more of the cores 902A-N may execute the same instruction set, while other cores may execute only a subset of that instruction set or a different instruction set.

Exemplary computer architecture

Figures 10-13 are block diagrams of exemplary computer architectures. (DSPs), graphics devices (DSPs), personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, , Video game devices, set-top boxes, microcontrollers, cell phones, portable media players, handheld devices, and various other electronic devices are also suitable . In general, a wide variety of systems or electronic devices capable of integrating processors and / or other execution logic as disclosed herein are generally suitable.

Referring now to FIG. 10, a block diagram of a system 1000 in accordance with one embodiment of the present invention is shown. The system 1000 may include one or more processors 1010 and 1015, which are coupled to the controller hub 1020. In one embodiment, controller hub 1020 includes a Graphics Memory Controller Hub (GMCH) 1090 and an Input / Output Hub (IOH) 1050 (which may be on separate chips); GMCH 1090 includes memory and graphics controllers coupled to memory 1040 and coprocessor 1045; IOH 1050 connects I / O (Input / Output) devices 1060 to GMCH 1090. Alternatively, one or both of the memory and graphics controllers may be integrated within the processor (as disclosed herein), and memory 1040 and coprocessor 1045 may be coupled to processor 1010 and IOH 1050 and / And is directly connected to the controller hub 1020 on a single chip.

The optional attributes of the additional processors 1015 are indicated by dashed lines in FIG. Each processor 1010, 1015 may include one or more of the processing cores disclosed herein, and may be some version of the processor 1000.

Memory 1040 may be, for example, a dynamic random access memory (DRAM), a phase change memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 1020 may be coupled to the processor 1020 via a point-to-point interface such as a Front Side Bus (FSB), a QuickPath Interconnect (QPI) (S) 1010 and 1015, respectively.

In one embodiment, the coprocessor 1045 is a special purpose processor such as, for example, a high-throughput MIC processor, a network or communications processor, a compression engine, a graphics processor, a GPGPU, an embedded processor, In one embodiment, the controller hub 1020 may include an integrated graphics accelerator.

There may be various differences between the physical resources 1010 and 1015 in connection with the various metrics of the advantage including architecture, microarchitecture, heat, power consumption characteristics, and the like.

In one embodiment, the processor 1010 executes instructions that control general types of data processing tasks. Coprocessor instructions may be embedded within the instructions. The processor 1010 recognizes these coprocessor instructions as being of a type that needs to be executed by the associated coprocessor 1045. Thus, the processor 1010 issues these coprocessor instructions (or control signals representing the coprocessor instructions) to the coprocessor 1045 on the coprocessor bus or other interconnect. The coprocessor (s) 1045 accepts and executes the received coprocessor instructions.

Referring now to FIG. 11, there is shown a block diagram of a first, more specific exemplary system 1100 in accordance with an embodiment of the present invention. 11, a multiprocessor system 1100 is a point-to-point interconnect system and includes a first processor 1170 and a second processor 1180 connected via a point-to-point interconnect 1150, . Each of the processors 1170 and 1180 may be a processor 900 of some version. In one embodiment of the invention, the processors 1170 and 1180 are processors 1010 and 1015, respectively, and the coprocessor 1138 is a coprocessor 1045. In another embodiment, processors 1170 and 1180 are processor 1010 and coprocessor 1045, respectively.

Processors 1170 and 1180 are shown as including IMC (Integrated Memory Controller) units 1172 and 1182, respectively. Processor 1170 also includes Point-to-Point interfaces 1176 and 1178 as part of its bus controller units; Similarly, the second processor 1180 includes P-P interfaces 1186 and 1188. Processors 1170 and 1180 may exchange information via P-P interface 1150 using Point-to-Point (P-P) interfaces 1178 and 1188. 11, the IMCs 1172 and 1182 couple the processors to their respective memories, i. E., Memory 1132 and memory 1134, which are part of the main memory locally attached to each processor .

Processors 1170 and 1180 may exchange information with chipset 1190 via respective P-P interfaces 1152 and 1154 using point-to-point interface circuits 1176, 1194, 1186 and 1198, respectively. The chipset 1190 can selectively exchange information with the coprocessor 1138 via the high-performance interface 1139. [ In one embodiment, the coprocessor 1138 is a special purpose processor, such as, for example, a high-throughput MIC processor, a network or communications processor, a compression engine, a graphics processor, a GPGPU, an embedded processor.

The shared cache (not shown) may be included in either processor or external to both processors, but still be connected to the processors via the PP interconnect, so that when the processor is placed in a low power mode, either or both processors May be stored in the shared cache.

The chipset 1190 may be coupled to the first bus 1116 via an interface 1196. In one embodiment, the first bus 1116 may be a Peripheral Component Interconnect (PCI) bus, or it may be a bus such as a PCI Express bus or other third generation I / O interconnect bus, It is not.

11, a variety of I / O devices 1114 may be coupled to the first bus 1116 while the bus bridge 1118 couples the first bus 1116 to the second bus 1120, Lt; / RTI > (E.g., a graphics accelerator or a Digital Signal Processing (DSP) unit, etc.), field programmable gate arrays, or any other processor, etc. In one embodiment, One or more additional processor (s) 1115 are coupled to the first bus 1116. In one embodiment, the second bus 1120 may be a Low Pin Count (LPC) bus. In one embodiment, a storage unit such as a disk drive or other mass storage device that may include, for example, a keyboard and / or mouse 1122, communication devices 1127 and instructions / code and data 1130 1128 may be connected to the second bus 1120. [ Audio I / O 1124 may also be coupled to second bus 1120. [ Note that other architectures are possible. For example, instead of the point-to-point architecture of FIG. 11, the system may implement a multi-drop bus or other such architecture.

Turning now to FIG. 12, a block diagram of a second, more specific exemplary system 1200 in accordance with an embodiment of the present invention is shown. The same elements in Figs. 11 and 12 have the same reference numerals, and certain aspects of Fig. 11 have been omitted from Fig. 12 to avoid obscuring other aspects of Fig.

12 illustrates that processors 1170 and 1180 may each include an integrated memory and I / O control logic (" CL ") 1172, 1182. Thus, CLs 1172 and 1182 include unified memory controller units and include I / O control logic. 12 illustrates that not only the memories 1132 and 1134 are connected to the CLs 1172 and 1182 but also the I / O devices 1214 are also connected to the control logic 1172 and 1182. Legacy I / O devices 1215 are connected to chipset 1190.

Referring now to FIG. 13, a block diagram of an SoC 1300 in accordance with one embodiment of the present invention is shown. Similar elements in FIG. 9 have the same reference numbers. Also, the dotted box is an optional feature for the more advanced SoCs. 13, an interconnect unit (s) 1302 includes: an application processor 1310 including a set of one or more cores 202A-N and shared cache unit (s) 906; A system agent unit 910; Bus controller unit (s) 916; Integrated memory controller unit (s) 914; A set of one or more coprocessors 1320 that may include integrated graphics logic, an image processor, an audio processor, and a video processor; An SRAM (Static Random Access Memory) unit 1330; A direct memory access (DMA) unit 1332; And a display unit 1340 for connection to one or more external displays. In one embodiment, the coprocessor (s) 1320 include special purpose processors such as, for example, a network or communications processor, a compression engine, a GPGPU, a high-throughput MIC processor, an embedded processor,

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be practiced on programmable systems including at least one processor, a storage system (including volatile and nonvolatile memory and / or storage elements), at least one input device, and at least one output device Computer programs, or program code.

Program code, such as code 1130 shown in FIG. 11, may be applied to input instructions for performing the functions described herein and for generating output information. The output information may be applied to one or more output devices in a known manner. For purposes of the present application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.

The program code may be implemented in a high-level procedural or object-oriented programming language to communicate with the processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms disclosed herein are not limited in scope to any particular programming language. In either case, the language may be a compiled or interpreted language.

At least one aspect of at least one embodiment is a computer readable medium having stored thereon machine readable medium representing various logic within the processor that, when read by a machine, causes the machine to produce logic to perform the techniques described herein May be implemented by representative instructions. Such representations, known as " IP cores ", may be stored on a type of machine readable medium and supplied to various customers or manufacturing facilities, which may be loaded into manufacturing machines that actually produce the logic or processor.

These machine-readable storage media include, but are not limited to, hard disks, floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritable Random access memories (EPROMs) such as Read Only Memories (Random Access Memories), Dynamic Random Access Memories (DRAMs), and Static Random Access Memories (SRAMs), electrically erasable programmable read- Readable memories, flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memories (PCM), magnetic or optical cards, or any other type suitable for storing electronic instructions But are not limited to, non-transitory and tangential arrangements of articles made or formed by a machine or device comprising a storage medium such as,

Thus, embodiments of the present invention may also be embodied in a computer-readable medium, such as a hardware description language (HDL), which includes instructions or defines the structures, circuits, devices, processors and / And includes a non-transient type of machine readable medium containing design data. These embodiments may also be referred to as program products.

emulation( Binary translation, code Morphing  Etc.)

In some cases, an instruction translator may be used to translate instructions from a source instruction set to a target instruction set. For example, the instruction translator may translate, morph, emulate, or otherwise translate instructions (e.g., using static binary translation, dynamic binary translation including dynamic compilation) into one or more other instructions to be processed by the core, Or otherwise. The instruction translator may be implemented in software, hardware, firmware, or a combination thereof. The instruction translator may be an on-processor, an off-processor, or a part-on and part-off processor.

14 is a block diagram collating the use of a software instruction translator to convert binary instructions in a source instruction set into binary instructions in a target instruction set in accordance with embodiments of the present invention. In the illustrated embodiment, the instruction translator is a software instruction translator, but, in the alternative, the instruction translator may be implemented in software, firmware, hardware, or various combinations thereof. 14 compiles a program in a high-level language 1402 using an x86 compiler 1404 to generate an x86 binary code 1406 that can be executed innately by a processor 1416 having at least one x86 instruction set core Lt; / RTI > Processor 1416 having at least one x86 instruction set core may be configured to (i) implement a significant portion of the instruction set of the Intel x86 instruction set core, to achieve substantially the same result as an Intel processor with at least one x86 instruction set core Or (2) by interoperably or otherwise processing applications or other software of object code versions intended to run on an Intel processor having at least one x86 instruction set core, at least one x86 instruction set And any processor capable of performing substantially the same functions as an Intel processor having a core. x86 compiler 1404 may include x86 binary code 1406 (e.g., object code) that may be executed on processor 1416 having at least one x86 instruction set core, with or without additional linkage processing It indicates a compiler that can be operated to generate. Similarly, FIG. 14 compiles a program of the high-level language 1402 using an alternative instruction set compiler 1408 to generate a processor 1414 having at least one x86 instruction set core (e.g., An alternative instruction set binary code 1410 that can be executed innocently by a processor having core executing and / or executing the MIPS instruction set of MIPS Technologies of Sunnyvale or executing the ARM instruction set of ARM Holdings of Sunnyvale, Calif. Lt; / RTI > Instruction translator 1412 is used to convert x86 binary code 1406 into code that can be executed natively by a processor that does not have x86 instruction set core 1414. [ This converted code is not likely to be the same as the alternative instruction set binary code 1410 because it is difficult to fabricate an instruction translator that can do this; However, the transformed code will accomplish common tasks and will consist of instructions from an alternative set of instructions. Thus, instruction translator 1412 may be software, firmware, or other software that allows an x86 instruction set processor or other electronic device to execute x86 binary code 1406 through an emulation, simulation, or any other process, Hardware or a combination thereof.

In the description and the claims, the terms " coupled " and / or " connected ", along with their derivatives, may be used. It is to be understood that these terms are not intended to be synonymous with each other. Rather, in embodiments, " connected " can be used to indicate that two or more elements are in direct physical and / or electrical contact with each other. &Quot; Coupled " may mean that two or more elements are in direct physical and / or electrical contact. However, " connected " may also mean that two or more elements are not in direct contact with each other, but still interact or interact. For example, a core may be coupled to a cache portion via one or more intermediate components. In the figures, arrows are used to show connections and connections.

In the description and claims, the terms "logic", "unit", "module", or "component" may be used. It should be understood that they may include hardware, firmware, software, or a combination thereof. Examples of these include integrated circuits, application specific integrated circuits, analog circuits, digital circuits, program logic devices, memory devices including instructions, and the like. In some embodiments, these may potentially include transistors and / or gates and / or other circuit components.

In the foregoing description, for purposes of explanation, numerous specific details have been set forth in order to provide a thorough understanding of embodiments of the invention. However, other embodiments may be practiced without some of these specific details. The scope of the invention should be determined only by the claims, not by the specific examples provided above. In other instances, well known circuits, structures, devices, and operations have been shown in block diagram form or in detail, in order to avoid obscuring the understanding of the description. Where multiple components are shown and described, in some instances they may be integrated together as a single component. In other cases in which a single component is shown and described, in some cases it may be divided into two or more components.

Various operations and methods have been described. Although some of the methods have been described in a relatively basic form in the flowcharts, operations may optionally be added to methods and / or eliminated in methods. In addition, although the flowcharts illustrate specific sequences of operations in accordance with embodiments, the specific order is exemplary. Alternate embodiments may optionally perform operations in a different order, combine certain operations, duplicate certain operations, and so on.

Certain operations may be performed by hardware components or may cause a machine, circuit or hardware component (e.g., processor, portion of a processor, circuitry, etc.) programmed with instructions to perform operations ≪ / RTI > may be implemented with machine executable or circuit executable instructions that may be used to effectuate the above-described operations. Further, the operations may be selectively performed by a combination of hardware and software.

Some embodiments include an article of manufacture (e.g., a computer program product) that includes a non-transitory machine-readable storage medium. Such non-transitory machine-readable storage media do not include transiently propagated signals. Such non-transitory machine-readable storage media may comprise a mechanism for storing information in a form readable by a machine. Such machine-readable storage media may be stored on a machine in a manner that causes the machine to perform and / or perform one or more operations, methods, or techniques described herein when executed and / Or a sequence of instructions. Examples of suitable machines include, but are not limited to, processors and computer systems or other electronic devices having such processors. In various embodiments, the non-transitory machine-readable storage medium may be a floppy diskette, optical storage medium, optical disk, CD-ROM, magnetic disk, magnetooptical disk, ROM (Read Only Memory), PROM (Programmable ROM) Erasable and programmable ROM (EEPROM), random access memory (RAM), static random access memory (SRAM), dynamic RAM (DRAM), flash memory, Non-volatile data storage devices, non-volatile memory, non-volatile data storage devices.

Reference throughout this specification to " one embodiment ", " an embodiment ", " one or more embodiments ", " some embodiments ", for example, But it does not have to be. Similarly, for the purpose of streamlining the present disclosure and helping to understand various aspects of the invention, various aspects are sometimes grouped together in a single embodiment, figure, or description thereof in the description. However, the methods of this disclosure should not be interpreted as reflecting an intention to require more features than are expressly recited in each claim. Rather, as the following claims reflect, aspects of the invention may be less than all features of a single disclosed embodiment. Accordingly, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as an individual embodiment of the present invention.

Illustrative Examples

The following examples relate to additional embodiments. The details in these examples may be used in any one or more embodiments.

Example 1 is a processor that corresponds to a first cache portion of a distributed cache and includes a cache-side address monitor unit having a total number of cache-side address monitor storage locations less than the total number of logical processors of the processor. Each cache-side address monitor storage location is for storing an address to be monitored. This process also includes a core-side address monitor unit corresponding to the first core and having the same number of core-side address monitor storage locations as the number of one or more logical processors of the first core. Each core-side address monitor storage location is for storing a monitored state for a different corresponding logical processor of an address and a first core to be monitored. The processor is configured to perform an address monitor storage overflow policy when any unused cache-side address monitor storage location is not available to store additional addresses to be monitored, a cache-side address monitor storage overhead corresponding to a first cache portion And a flow unit.

Example 2 optionally includes a processor of any preceding example and optionally a core-side trigger unit corresponding to the first core and coupled to the core-side address monitor unit. This core-side trigger unit is for triggering the logical processor of the first core when the corresponding core-side address monitor storage location has a monitor state ready to trigger and a trigger event is detected.

Example 3 includes a processor of any preceding example and is coupled to a cache-side address monitor unit and includes cache-side registers for writing monitor requests from different logical processors for the same monitor address to a common cache- Optionally include an address monitor storage location reuse unit.

Example 4 includes the processor of Example 3, and the common cache-side address monitor storage location includes a structure for writing different logical processors that provide monitor requests for the same monitor address.

Example 5 includes a processor of any preceding example wherein the processor has more than 40 hardware threads and the total number of cache-side address monitor storage locations of the cache-side address monitor unit corresponding to the first cache portion is at least 20 cache-side address monitor storage locations, but less than the total number of hardware threads over 40.

Example 6 includes a processor of any of the preceding examples, wherein the total number of cache-side address monitor storage locations of the cache-side address monitor unit is less than one-tenth the likelihood of overflow of cache-side address monitor storage locations Of the total number of logical processors of the processor.

Example 7 includes a processor of any preceding example, and in response to an instruction to indicate a first address to be monitored, the cache-side address monitor unit is for storing the first address in a cache-side address monitor storage location. In addition, the core-side address monitor unit is for storing the first address in the core-side address monitor storage location.

Example 8 includes a processor of any of the preceding examples, and the logical processors are hardware threads.

Example 9 includes a processor of any of the preceding examples, and the cache-side address monitor storage < RTI ID = 0.0 > operandi < / RTI > unit is for enforcing an address monitor storage overflow policy that includes forcing read transactions to use shared state.

Example 10 includes a processor of any preceding example, and the cache-side address monitor storage overflow unit enforces an address monitor storage overflow policy that includes sending invalidation requests to all cores likely to have pending monitor requests .

Example 11 includes the processor of Example 10, and the cache-side address monitor storage overflow unit is to identify the overflow structure to determine which cores are likely to have pending monitor requests.

Example 12 is a system for processing instructions including an interconnect, and a processor coupled to the interconnect. The processor includes a first address monitor unit of the cache portion control unit that corresponds to a first cache portion of the distributed cache and has a total number of address monitor storage locations less than the total number of hardware threads of the processor. Each address monitor storage location is for storing an address to be monitored. The processor also includes a second address monitor unit of the core interface unit corresponding to the first core and having the same number of address monitor storage locations as the number of one or more hardware threads of the first core. Each address monitor storage location of the second address monitor unit is for storing a monitored state for different corresponding hardware threads of the first core and an address to be monitored. This processor implements an address monitor storage overflow policy when all address monitor storage locations of the first address monitor unit are used and none of them are available to store addresses for monitor requests. And a storage overflow unit. The system also includes a dynamic random access memory coupled to the interconnect, a wireless communication device coupled to the interconnect, and an image capture device coupled to the interconnect.

Example 13 includes the system of Example 12, wherein the address monitor storage overflow unit is configured to force the read transactions to use the shared state; And an address monitor storage overflow policy that includes sending invalidation requests to all cores that may have pending monitor requests.

Example 14 includes a system of any of Examples 12-13 wherein the processor has more than 40 hardware threads and the total number of address monitor storage locations of the first address monitor unit is at least 20 but less than 40 of the processors Less than the total number of hardware threads.

Example 15 includes a system of any of Examples 12-14, wherein the processor is configured to re-use the address monitor storage location of the cache portion control unit to write monitor requests from different hardware threads to the common address monitor storage location for the same monitor address Unit.

Example 16 is a method in a processor that includes a step of displaying an address and receiving a first instruction indicating to monitor for writes to a first logical processor of a first core of a multi-core processor. In response to the first instruction, the method includes storing the address indicated by the first instruction at a first one of a plurality of core-side address monitor storage locations corresponding to a first core, . The number of the plurality of core-side address monitor storage locations is equal to the number of logical processors of the first core. The method also includes storing the address indicated by the first instruction at a first one of a plurality of cache-side address monitor storage locations corresponding to a first cache portion of the distributed cache at a first cache-side address monitor storage location do. The total number of cache-side address monitor storage locations is less than the total number of logical processors of the multi-core processor. The method further includes changing the monitor state to an estimated state.

Example 17 includes the method of Example 16, comprising the steps of: receiving a second instruction indicating to also address and to monitor for writes to the address from a second logical processor of the second core, And recording the monitor request for the address to the first cache-side address monitor storage location.

Example 18 includes the method of Example 17 wherein writing a monitor request for an address for a second core to a first cache-side address monitor storage location includes writing a different bit corresponding to each core of the multi- And changing the bit in the core mask.

Example 19 includes the method of any preceding example, comprising receiving a second instruction indicating a second address and indicating to monitor for writes from a first logical processor to a second address, Determining that there is no available cache-side address monitor storage locations among a plurality of corresponding cache-side address monitor storage locations, and determining to enter a cache-side address monitor storage location overflow mode as an option .

Example 20 includes the method of Example 19, wherein during the cache-side address monitor storage location overflow mode, all read transactions corresponding to the first cache portion are forced to use the shared cache coherency state And sending invalidation requests corresponding to the first cache portion to all cores of the multi-core processor that may have one or more pending monitor requests.

Example 21 includes a method of any preceding example, comprising the steps of receiving a second instruction indicating an address in a first logical processor, and optionally in response to a second instruction, changing a monitor state to a trigger wait state .

Example 22 includes a processor or other device that performs any of the methods of Examples 16-21.

Example 23 includes a processor or device including means for performing the method of any of Examples 16-21.

Example 24 includes integrated circuits and / or logic and / or units and / or components and / or modules and / or means, or any combination thereof, performing any of the methods of Examples 16-21. Lt; / RTI >

Example 25 is an option to store and / or otherwise provide one or more instructions to cause the machine to perform the method of any of Examples 16-21 when executed and / - readable medium.

Example 26 includes a computer system including an interconnect, a processor coupled to the interconnect, a DRAM, a graphics chip, a wireless communication chip, a phase change memory, and a video camera, at least one of which is connected to the interconnect, / RTI > and / or the computer system is for performing the method of any of Examples 16-21.

Example 27 includes a processor or other device that substantially performs one or more operations or any method as described herein.

Example 28 includes a processor or other device that includes means for substantially performing one or more operations or any method as described herein.

Example 29 includes a processor or other device that substantially performs the instructions as described herein.

Example 30 includes a processor or other device that includes means for substantially performing the instructions as described herein.

Claims (25)

A processor,
A cache-side address monitor that includes at least some circuitry and corresponds to and is also associated with a first cache portion of the distributed cache and has a total number of cache-side address monitor storage locations less than the total number of logical processors of the processor Wherein the distributed cache includes a plurality of cache portions during an operation mapped to non-overlapping ranges of addresses, each cache-side address monitor storage location storing an address at which the cache- Wherein the cache-side address monitor storage locations are not part of the distributed cache;
A core-side address monitor having at least some of the circuitry and corresponding to and coupled to a first core and having the same number of core-side address monitor storage locations as the number of one or more logical processors of the first core; Wherein the core-side address monitor storage location stores an address at which the core-side address monitor monitors writes and a monitor state for a different corresponding logical processor of the first core;
At least some of the circuits implementing an address monitor storage overflow policy when any unused cache-side address monitor storage location is not available to store additional addresses to be monitored, A cache-side address monitor storage overflow unit coupled to the cache-side address monitor; And
A core-side trigger unit that includes at least some circuitry and is coupled to and coupled to the first core and is coupled to the core-
Wherein the core-side trigger unit triggers a logical processor of the first core when a trigger event is detected with a monitor state ready for the corresponding core-side address monitor storage location to be triggered.
delete The method according to claim 1,
A cache-side address monitor storage location for writing monitor requests from different logical processors for the same monitor address to a common cache-side address monitor storage location, including at least some circuitry and associated with the cache- Further comprising a reuse unit.
The method of claim 3,
Wherein the common cache-side address monitor storage location comprises a structure for recording the different logical processors that have provided the monitor requests for the same monitor address.
The method according to claim 1,
Wherein the processor has more than 40 hardware threads and the total number of cache-side address monitor storage locations of the cache-side address monitor corresponding to the first cache portion is at least 20 cache-side address monitor storage locations But less than the total number of hardware threads above 40.
delete The method according to claim 1,
Wherein the cache-side address monitor stores the first address in a cache-side address monitor storage location, and the core-side address monitor stores the first address in a core-side address monitor storage location in response to an instruction indicating a first address to be monitored, Side address monitor The processor that stores the storage location.
The method according to claim 1,
Wherein the one or more logical processors of the first core comprise hardware threads.
The method according to claim 1,
Wherein the cache-side address monitor storage < RTI ID = 0.0 > op-flow unit < / RTI > enforces read transactions to use a shared state.
The method according to claim 1,
Wherein the cache-side address monitor storage overflow unit comprises sending invalidation requests only to a subset of cores where core identifiers are stored.
11. The method of claim 10,
Wherein the cache-side address monitor storage overflow unit identifies an overflow structure to determine a subset of the cores.
A system for processing instructions,
Interconnect;
A processor coupled to the interconnect, the processor comprising:
A cache portion including a first address monitor that includes at least some circuitry and corresponds to and is also associated with a first cache portion of the distributed cache and has a total number of address monitor storage locations less than the total number of hardware threads of the processor Wherein the distributed cache comprises a plurality of cache portions during an operation each mapped to a non-overlapping range of addresses, each address monitor storage location storing an address at which the cache portion control unit monitors writes, The address monitor storage locations being different from the distributed cache;
A core interface unit comprising at least some circuitry and corresponding to and coupled to a first core and having a second number of address monitor storage locations equal to the number of one or more hardware threads of the first core, Each address monitor storage location of the second address monitor stores an address at which the core interface unit monitors for writes and a monitor state for different corresponding hardware threads of the first core;
At least some of the circuits implementing an address monitor storage overflow policy when all of the address monitor storage locations of the first address monitor are used and none of them are available to store addresses for monitor requests, An address monitor storage overflow unit of the cache portion control unit coupled to the one address monitor; And
A core-side trigger unit including at least some circuitry and corresponding to and coupled to the first core,
Wherein the core-side trigger unit triggers a hardware thread of the first core when a trigger event is detected with a monitor state ready for the corresponding address monitor storage location to trigger;
A dynamic random access memory coupled to the interconnect;
A wireless communication device coupled to the interconnect; And
An image capture device coupled to the interconnect
/ RTI >
13. The method of claim 12,
Wherein the address monitor storage overflow unit comprises:
Forcing read transactions to use a shared state; And
Sending invalidation requests only to a subset of cores where core identifiers will be stored
Wherein the address monitor storage overflow policy comprises:
13. The method of claim 12,
Wherein the processor has more than 40 hardware threads, wherein the total number of address monitor storage locations of the first address monitor is at least 20 but less than the total number of hardware threads of more than forty of the processors.
13. The method of claim 12,
The processor further comprises an address monitor storage location reuse unit of the cache portion control unit, the at least some of the circuitry for writing monitor requests from different hardware threads for the same monitor address to a common address monitor storage location system.
9. A method in a processor,
Receiving a first instruction indicating an address and indicating to monitor for writes from a first logical processor of the first core of the multi-core processor to the address; And
In response to the first instruction:
Storing the address indicated by the first instruction at a first of the plurality of core-side address monitor storage locations corresponding to the first core at a first one of the plurality of core-side address monitor storage locations, The number of monitor storage locations being equal to the number of logical processors of the first core;
A plurality of cache-side address monitor storage locations corresponding to a first cache portion of the decentralized cache comprising a plurality of cache portions each mapped to a non-overlapping range of addresses, the address being represented by the first instruction; Side address monitor storage locations, wherein the plurality of cache-side address monitor storage locations are not part of the distributed cache, and wherein the total number of cache- - less than the total number of logical processors in the core processor;
Causing the processor to activate monitoring for writes to the address; And
Changing the monitor state to a speculative state
Lt; / RTI >
The method comprises:
Detecting a write to the address; And
A step of sending a wake-up signal from the core-side trigger unit to the first logical processor
≪ / RTI >
17. The method of claim 16,
Receiving a second instruction indicative of also indicating the address and monitoring for writes from a second logical processor of the second core to the address; And
Writing a monitor request for the address for the second core to the first cache-side address monitor storage location
≪ / RTI >
18. The method of claim 17,
Writing the monitor request for the address for the second core to the first cache-side address monitor storage location comprises writing a bit in the core mask having different bits corresponding to each core of the multi- Lt; / RTI >
17. The method of claim 16,
Receiving a second instruction indicating a second address and indicating to monitor for writes from the first logical processor to the second address;
Determining that there are no cache-side address monitor storage locations available among the plurality of cache-side address monitor storage locations corresponding to the first cache portion; And
Cache-side address monitor storage location overflow mode
≪ / RTI >
20. The method of claim 19,
During the cache-side address monitor storage location overflow mode,
Enforcing all read transactions corresponding to the first cache portion to use a shared cache coherency state; And
Sending invalidation requests corresponding to the first cache portion to only a subset of cores of the multi-core processor having one or more pending monitor requests
≪ / RTI >
17. The method of claim 16,
Receiving a second instruction from the first logical processor indicating the address; And
In response to the second instruction, changing the monitor state to a trigger wait state
≪ / RTI >
17. The method of claim 16,
Wherein the number of the plurality of core-side address monitor storage locations is equal to the number of hardware threads of the first core.
delete A processor,
integrated circuit;
A cache portion control unit integrated on the integrated circuit and corresponding to and coupled to a first cache portion of the distributed cache and having a total number of cache-side address monitor storage locations less than the total number of logical processors of the processor, Side address monitor storage location stores an address at which the cache sub control unit monitors for writes, and wherein each cache-side address monitor storage location stores an address at which the cache partial control unit monitors addresses, The side address monitor storage locations differ from the distributed cache;
A core interface unit that is integrated on the integrated circuit and corresponding to and coupled to a first core and has the same number of core-side address monitor storage locations as the number of one or more logical processors of the first core, Wherein the monitor storage location stores an address at which the core interface unit monitors the writes and a monitor state for a different corresponding logical processor of the first core; And
A core-side trigger unit that includes at least some of the circuitry and is associated with and connected to the first core,
Wherein the core-side trigger unit triggers a logical processor of the first core when a trigger event is detected with a monitor state ready for the corresponding core-side address monitor storage location to be triggered.
delete
KR1020167005327A 2014-10-03 2014-10-03 Scalably mechanism to implement an instruction that monitors for writes to an address KR101979697B1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2014/059130 WO2015048826A1 (en) 2013-09-27 2014-10-03 Scalably mechanism to implement an instruction that monitors for writes to an address

Publications (2)

Publication Number Publication Date
KR20160041950A KR20160041950A (en) 2016-04-18
KR101979697B1 true KR101979697B1 (en) 2019-05-17

Family

ID=56973722

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020167005327A KR101979697B1 (en) 2014-10-03 2014-10-03 Scalably mechanism to implement an instruction that monitors for writes to an address

Country Status (3)

Country Link
JP (1) JP6227151B2 (en)
KR (1) KR101979697B1 (en)
CN (1) CN105683922B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10289516B2 (en) * 2016-12-29 2019-05-14 Intel Corporation NMONITOR instruction for monitoring a plurality of addresses
US10860487B2 (en) * 2019-04-17 2020-12-08 Chengdu Haiguang Integrated Circuit Design Co. Ltd. Multi-core processing device and method of transferring data between cores thereof
CN111857591A (en) 2020-07-20 2020-10-30 北京百度网讯科技有限公司 Method, apparatus, device and computer-readable storage medium for executing instructions

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070282928A1 (en) * 2006-06-06 2007-12-06 Guofang Jiao Processor core stack extension
US20080005504A1 (en) * 2006-06-30 2008-01-03 Jesse Barnes Global overflow method for virtualized transactional memory
US20090172284A1 (en) * 2007-12-28 2009-07-02 Zeev Offen Method and apparatus for monitor and mwait in a distributed cache architecture

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7363474B2 (en) * 2001-12-31 2008-04-22 Intel Corporation Method and apparatus for suspending execution of a thread until a specified memory access occurs
US7213093B2 (en) * 2003-06-27 2007-05-01 Intel Corporation Queued locks using monitor-memory wait

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070282928A1 (en) * 2006-06-06 2007-12-06 Guofang Jiao Processor core stack extension
US20080005504A1 (en) * 2006-06-30 2008-01-03 Jesse Barnes Global overflow method for virtualized transactional memory
US20090172284A1 (en) * 2007-12-28 2009-07-02 Zeev Offen Method and apparatus for monitor and mwait in a distributed cache architecture

Also Published As

Publication number Publication date
CN105683922A (en) 2016-06-15
CN105683922B (en) 2018-12-11
JP6227151B2 (en) 2017-11-08
JP2016532233A (en) 2016-10-13
KR20160041950A (en) 2016-04-18

Similar Documents

Publication Publication Date Title
US10705961B2 (en) Scalably mechanism to implement an instruction that monitors for writes to an address
US20180225211A1 (en) Processors having virtually clustered cores and cache slices
US9740617B2 (en) Hardware apparatuses and methods to control cache line coherence
US10248568B2 (en) Efficient data transfer between a processor core and an accelerator
US9934146B2 (en) Hardware apparatuses and methods to control cache line coherency
US10409727B2 (en) System, apparatus and method for selective enabling of locality-based instruction handling
US9361233B2 (en) Method and apparatus for shared line unified cache
US20170185515A1 (en) Cpu remote snoop filtering mechanism for field programmable gate array
US9690706B2 (en) Changing cache ownership in clustered multiprocessor
US20170286118A1 (en) Processors, methods, systems, and instructions to fetch data to indicated cache level with guaranteed completion
US10102124B2 (en) High bandwidth full-block write commands
US9898298B2 (en) Context save and restore
US9201792B2 (en) Short circuit of probes in a chain
US9146871B2 (en) Retrieval of previously accessed data in a multi-core processor
US10705962B2 (en) Supporting adaptive shared cache management
US20170286301A1 (en) Method, system, and apparatus for a coherency task list to minimize cache snooping between cpu and fpga
US10402336B2 (en) System, apparatus and method for overriding of non-locality-based instruction handling
KR101979697B1 (en) Scalably mechanism to implement an instruction that monitors for writes to an address
US9436605B2 (en) Cache coherency apparatus and method minimizing memory writeback operations
US9037804B2 (en) Efficient support of sparse data structure access
US20200174929A1 (en) System, Apparatus And Method For Dynamic Automatic Sub-Cacheline Granularity Memory Access Control
WO2018001528A1 (en) Apparatus and methods to manage memory side cache eviction

Legal Events

Date Code Title Description
A201 Request for examination
AMND Amendment
E902 Notification of reason for refusal
AMND Amendment
E90F Notification of reason for final refusal
E601 Decision to refuse application
AMND Amendment
X701 Decision to grant (after re-examination)
GRNT Written decision to grant