KR101979697B1 - Scalably mechanism to implement an instruction that monitors for writes to an address - Google Patents
Scalably mechanism to implement an instruction that monitors for writes to an address Download PDFInfo
- Publication number
- KR101979697B1 KR101979697B1 KR1020167005327A KR20167005327A KR101979697B1 KR 101979697 B1 KR101979697 B1 KR 101979697B1 KR 1020167005327 A KR1020167005327 A KR 1020167005327A KR 20167005327 A KR20167005327 A KR 20167005327A KR 101979697 B1 KR101979697 B1 KR 101979697B1
- Authority
- KR
- South Korea
- Prior art keywords
- cache
- core
- address
- monitor
- processor
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/084—Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
- G06F12/0831—Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
- G06F12/0833—Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means in combination with broadcast means (e.g. for invalidation or updating)
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0844—Multiple simultaneous or quasi-simultaneous cache accessing
- G06F12/0846—Cache with multiple tag or data arrays being simultaneously accessible
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The processor includes a cache-side address monitor unit corresponding to a first cache portion of the distributed cache and having a total number of cache-side address monitor storage locations less than the total number of logical processors of the processor. Each cache-side address monitor storage location is for storing an address to be monitored. The core-side address monitor unit corresponds to the first core and has the same number of core-side address monitor storage locations as the number of logical processors of the first core. Each core-side address monitor storage location is for storing an address and a monitor state for a different corresponding logical processor of the first core. The cache-side address monitor storage overflow unit corresponds to the first cache portion and performs an address monitor storage overflow policy when any unused cache-side address monitor storage location is not available to store the address to be monitored .
Description
The embodiments described herein relate to processors. In particular, the embodiments described herein are generally related to processors that can be operated to perform instructions to monitor writing to an address.
Advances in semiconductor processing and logic design have allowed for an increase in the amount of logic that can be included in processors and other integrated circuit devices. As a result, many processors now have a large number of cores monolithically integrated on a single integrated circuit or die. Many cores generally help to cause multiple software threads or other workloads to be executed concurrently, which generally helps to increase execution throughput.
One challenge in these multiple core processors is that there are often more requirements on the caches used to cache data and / or instructions from memory. For one thing, there is a steadily increasing requirement for higher interconnect bandwidth to access data in these caches. One technique to help increase the interconnect bandwidth for caches involves using a distributed cache. This distributed cache may comprise a plurality of physically separate or distributed cache slices or other cache portions. This distributed cache may allow parallel access to different distributed portions of the cache via the shared interconnect.
Another challenge in these multiple core processors is the ability to provide thread synchronization with respect to shared memory. Operating systems typically implement idle loops to handle thread synchronization with respect to shared memory. For example, there may be several busy loops using a set of memory locations. The first thread may wait in the loop and poll the corresponding memory location. For example, such a memory location may represent a work queue of a first thread, and a first thread may poll the work queue to determine if there is an available job to perform. In a shared memory configuration, deviations from the busy loop often occur due to state changes associated with memory locations. These state changes are typically triggered by writes to memory locations by other components (e.g., other threads or cores). For example, another thread or core may write to a work queue at a memory location to provide a job to be executed by the first thread.
Certain processors (e.g., those available from Intel Corporation of Santa Clara, Calif.) May use MONITOR and MWAIT commands to achieve thread synchronization with shared memory. A hardware thread or other logical processor may use the MONITOR instruction to set up a linear address range to be monitored by the monitor unit and to arm or activate the monitor unit. Such an address may be provided through a general-purpose register. This address range is typically a write-back caching type. Such a monitor unit will monitor and detect stores / writes to addresses within the address range, which will trigger the monitor unit.
The MWAIT instruction may follow the MONITOR instruction in program order and serve as a hint to allow a hardware thread or other logical processor to halt instruction execution and enter an implementation-dependent state. For example, such a logical processor may enter a power saving consumption state. The logical processor may remain in that state until the detection of one of a set of qualifying events associated with the MONITOR command. Writing / storing to an address in the address range prepared by the preceding MONITOR instruction is one such qualification event. In these cases, the logical processor can leave the state and resume execution with an instruction following the MWAIT instruction in program order.
The invention may best be understood by reference to the following description and the accompanying drawings that are used to illustrate embodiments. In these drawings:
1 is a block diagram of one embodiment of a processor.
2 is a block diagram of one embodiment of a cache agent.
3 is a diagram illustrating states of an embodiment of a monitor finite state machine.
Figure 4 is a block diagram of one embodiment of overflow avoidance logic that can be operated to reuse a single cache-side address monitor storage location for multiple hardware threads and / or cores when monitor requests indicate the same address .
FIG. 5 illustrates an example of an overflow mode by identifying stale / outdated cache-side address monitor storage locations and entering an overflow mode when no such old / Lt; RTI ID = 0.0 > of < / RTI >
6 is a block diagram of one embodiment of an overflow architecture.
Figure 7A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue / execution pipeline in accordance with embodiments of the present invention .
Figure 7B is a block diagram illustrating both an exemplary embodiment of a sequential architecture core to be included in a processor according to embodiments of the present invention and an exemplary register renaming, nonsequential issue / execution architecture core.
8A is a block diagram illustrating a single processor core in accordance with embodiments of the present invention with its local subsets of a level 2 (L2) cache and its connection to an on-die interconnect network .
8B is an enlarged view of a portion of the processor core in FIG. 8A in accordance with embodiments of the present invention.
Figure 9 is a block diagram of a processor that may have more than one core in accordance with embodiments of the present invention, may have an integrated memory controller, and may have integrated graphics.
Figure 10 is a block diagram of a system in accordance with an embodiment of the present invention.
Figure 11 is a block diagram of a first, more specific exemplary system in accordance with an embodiment of the present invention.
12 depicts a block diagram of a second, more specific exemplary system according to an embodiment of the present invention.
Figure 13 is a block diagram of an SoC in accordance with an embodiment of the present invention.
Figure 14 is a block diagram for use of a software instruction translator to translate binary instructions in a source instruction set into binary instructions in a target instruction set in accordance with embodiments of the present invention.
Methods, apparatus, and systems for scalably implementing instructions to monitor writes to addresses are disclosed herein. In the following description, a number of specific details are presented (e.g., specific instructions, command functions, processor configurations, microarchitecture details, a series of operations, etc.). However, embodiments may be practiced without these specific details. In other instances, well known circuits, structures, and techniques have not been shown in detail in order to avoid obscuring the understanding of this description.
FIG. 1 is a block diagram of one embodiment of
Such a processor is a multi-core processor having a plurality of processor cores (102). In the illustrated exemplary embodiment, the processor has eight cores including core 0 (102-0) through core 7 (102-7) (collectively cores 102). In other embodiments, the processor may include any other desired number of cores, for example, from two to hundreds, often from two to dozens (e.g., from about 5 to about 100) have. Each of the cores may have a single hardware thread, multiple hardware threads, or some cores may have a single hardware thread while other cores may have multiple hardware threads. In one exemplary embodiment, each of the cores may have at least two hardware threads, but the scope of the invention is not so limited.
The term core refers to logic that is often located on an integrated circuit that can maintain an independent architectural state (e. G., A running state), wherein the independently maintained architectural state is associated with dedicated execution resources. In contrast, the term hardware thread refers to logic that is often located on an integrated circuit capable of maintaining an independent architecture state, wherein an independently maintained architecture state shares access to the execution resources it uses. The boundaries between the core and the hardware thread are less clear when certain resources are shared by the architectural state and others are dedicated to that architectural state. Nonetheless, the core and hardware threads are often considered by the operating system to be individual processing elements or logical processors. Such an operating system can generally schedule operations on each of the cores, hardware threads, or other logical processors or processing elements individually. In other words, a processing element or logical processor may be, in one embodiment, independently associated with a code, such as a software thread, operating system, application, or other code, whether the execution resources are dedicated or shared or some combination thereof May represent any on-die processor logic that may be implemented. In addition to hardware threads and cores, other examples of logical processors or processing elements may include thread units, thread slots, processing units, contexts, and / or state But is not limited to, any other logic.
The cores 102 are connected together by one or more on-die interconnects 112. Such an interconnect may be used to transfer messages and data between cores. It will be appreciated that many different types of interconnects are suitable. In one embodiment, a ring interconnect may be used. In alternate embodiments, a mesh, a torus, a crossbar, a hypercube, another interconnect structure, or a hybrid or combination of such interconnects may be used.
Each core may include local instructions and / or data storage, such as, for example, a cache of one or more lower levels (not shown). For example, each core may include a corresponding lowest-level or level 1 (L1) cache that is closest to the cores and, optionally, the next nearest middle-level or level 2 (L2) cache to cores . Such a cache of one or more lower levels may be stored at a lower level (for example, because it is physically and / or logically closer to their corresponding cores than the higher level cache (s) Lt; / RTI > Each of the one or more levels of cache may cache data and / or instructions.
The cores 102 may also share a distributed
In some embodiments, each cache portion may correspond to, or map to, a mutually exclusive or non-overlapping range of memory addresses. For example, cache portion 108-0 may have associated first set of addresses, cache portion 108-1 may have a different second set of related addresses, and so on. These address ranges may be partitioned or distributed among different cache portions of the distributed cache in a variety of different ways (e.g., using different hash functions or other algorithms). In some embodiments, the higher level shared cache may represent an LLC (last level cache) that may be operated to store data and / or instructions, but this is not required. In some embodiments, a distributed cache (e.g., an LLC) may include a cache of all lower levels in the cache layer, or a cache of the next highest level cache in the cache layer ), But this is not required. In some embodiments, cores may initially identify one or more lower level caches for data and / or instructions. If the requested data and / or instructions are not found in one or more lower level caches, then the cores may proceed to ascertain the shared distributed upper level cache.
As shown, in some embodiments, a core interface (I / F)
The processor includes a first cache coherent memory controller 110-1 that couples the processor to a first memory (not shown), and a second cache coherent memory controller 110-1 that couples the processor to a second memory (not shown) Aware memory controller 110-2. In some embodiments, each cache coherency aware memory controller includes a home agent logic that may be operable to perform cache coherency and a second memory controller logic that may be operable to interact with the memory . For simplicity, in the present description, such a home agent and memory controller functionality will be referred to as a cache coherent aware memory controller. Other embodiments may include fewer or more cache coherency aware memory controllers. Moreover, in the illustrated embodiment, cache coherent aware memory controllers are on-die or on-processor, while in other embodiments they may instead be off-die (e.g., as one or more chipset components) Or an off-processor.
It should be understood that a processor may also include other components not required to understand various embodiments herein. For example, the processor may optionally include one or more of an interface to an input and / or output device, a system interface, a socket-to-socket interconnect, and the like.
As noted above, certain processors (such as those available from Intel Corporation) may use the MONITOR and MWAIT commands to achieve thread synchronization with respect to the shared memory. A hardware thread or other logical processor can set up a linear address range to be monitored by the monitor unit and use the MONITOR instruction to prepare or activate the monitor unit. Such an address may be provided via a general purpose register (e.g., EAX). This address range is typically a write-back caching type. Such a monitor unit will monitor and detect stores / writes to addresses within the address range, which will trigger the monitor unit. Other general purpose registers (e. G., ECX and EDX) may be used to communicate other information to the monitor unit. The MWAIT instruction may follow the MONITOR instruction in program order and serve as a hint to allow a hardware thread or other logical processor to halt instruction execution and enter an implementation-dependent state. For example, such a logical processor enters a sleep state, a power C-state, or another sleep state. The logical processor may remain in that state until the detection of one of a set of qualified events associated with the MONITOR command. Writing / storing to an address in the address range prepared by the preceding MONITOR instruction is one such qualification event. In these cases, the logical processor can leave the state and resume execution with an instruction following the MWAIT instruction in program order. General purpose registers (e. G., EAX and ECX) may be used to communicate other information (e. G. Information about the entering state) to the monitor unit.
FIG. 2 is a block diagram of one embodiment of a
The
In the illustrated embodiment, the cache-side address monitor unit 226 and the cache-side storage overflow unit 236 are both implemented in the cache
The cache-side address monitor unit 226 corresponds to the
In some embodiments, the total number of cache-side address monitor storage locations in the cache-side address monitor unit corresponding to the cache portion may be determined by the processor and / or the hardware threads (or other logical processors) Lt; / RTI > In some embodiments, each hardware thread (or other logical processor) may be operable to use monitor instructions (e.g., a MONITOR instruction) to monitor a single address or a single range of addresses. In some cases, after using these monitor commands, the hardware thread may be put to sleep or may be placed in another power saving state. One possible approach would be to provide cache-side address monitor storage locations 228 for each hardware thread (or other logical processor) sufficient to store the address to be monitored. However, when a distributed cache is used, each address may have only hashed or otherwise mapped to a single corresponding cache slice or other cache portion. For example, a hash of such an address may select a single corresponding cache slice corresponding to that address according to a particular hash function. Thus, when this distributed cache is used, there is an opportunity for all of the addresses to be monitored for all of the hardware threads (or other logical processors) to be hashed, or otherwise mapped, all to the same single cache slice , Is generally very little opportunity.
To allow for this possibility, one possible approach is to provide cache-side address monitor storage locations (" cache ") that are equal in number to the total number of processor threads and / or socket hardware threads 228 < / RTI > For example, in an eight core processor where each core has two hardware threads, a total of 16 cache-side address monitor storage locations (i.e., number of cores times the number of threads per core) Lt; / RTI > For example, a hardware implemented table having entries equal in number to the total number of hardware threads may be included. In some cases, each storage location may have a fixed mapping or assignment to the corresponding hardware thread. This may allow for storing an address to be monitored for every hardware thread and may allow for the possibility that both of these addresses may map to the same cache part and thus need to be stored locally for that cache part . This approach is designed for essentially worst-case scenarios, which are generally not quite feasible, but so far can not be ignored because no approach could deal with these scenarios when this really happened to be.
One disadvantage to this approach is that it tends to be relatively unscalable as the number of hardware threads (or other logical processors) and / or the number of cache portions increases. Increasing the number of hardware threads increases the number of storage locations required for each cache portion. Moreover, increasing the number of cache portions includes adding a further set of these storage locations for each additional cache portion. Processors may have, for example, more than 32 threads, 36 threads, 40 threads, 56 threads, 128 threads, or 256 threads. It is easy to see that the amount of storage can become significant when such a large number of threads are used. These significant amounts of storage tend to increase the manufacturing cost of the processor, the amount of area on-die required to supply the storage, and / or the power consumption caused by storage.
As an alternative approach, in some embodiments, the total number of cache-side address monitor storage locations 228 in the cache-side address monitor unit 226 corresponding to the
Referring again to FIG. 2, the cache agent includes a core-side address monitor unit 220, which corresponds to the
In some embodiments, the cache-side address monitor unit 226 and the core-side address monitor unit 220 are configured to determine whether to write to one or more addresses (e.g., addresses in the address range shown by the MONITOR instruction) You can work together or work together to monitor. To further illustrate certain concepts, consider an example of how the monitor mechanism performs MONITOR and MWAIT commands. The first hardware thread 218-1 may perform a MONITOR instruction. The MONITOR command can indicate the address to be monitored for writing. The first hardware thread issues a corresponding monitor request for the indicated monitor address. This monitor request may cause the first core-side address monitor unit 220 to store the displayed monitor address 222-1 in the first core-side address monitor storage location 221-1. The monitor state 224-1 may be set to an estimated or monitor loaded state. The MONITOR request may be routed on the
The first thread 218-1 may subsequently perform an MWAIT instruction that may also display the monitored address. The first hardware thread issues a corresponding MWAIT signal for the indicated monitor address. In response to this MWAIT signal, the core side address monitor unit 220 may set the monitor state 224-1 to a trigger ready state (e.g., a trigger waiting state). The first hardware thread may optionally be placed in a different state, such as, for example, a sleep or other sleep state. Typically, the first thread can store the state in a context if the thread must go to sleep and then go to sleep.
Subsequently, there exists an intention to write to the indicated monitor address (e.g., a read for ownership request, a snoop invalidation showing the indicated monitor address, a state transition associated with an address changing from a shared state to an exclusive state, etc.) , The cache-side address monitor unit can detect this intention to write to the address. The address may match one of the addresses in one of its storage locations. One or more cores corresponding to a storage location may be determined, for example, by a core identifier or core mask stored in a cache-side address monitor storage location. The cache-side address monitor unit may clear the cache-side address monitor storage location used to store the displayed monitor address. This can send a signal to the corresponding core (s), for example, by sending a snoop invalidation to the corresponding core (s). The cache-side address monitor unit may selectively notify directly to only one or more cores that are known to monitor the address for intent to write to the address (e.g., via request for ownership or invalidate a snoop) It can act as a kind of advanced filter to help. These notifications may indicate " hints " that are optionally provided in a subset of cores that monitor that address. Advantageously, this may help avoid notifying cores that are not monitoring the address, which may help to avoid false wakeups and / or reduce traffic on the interconnect.
The core-side address monitor unit 220 at the signaled core (s) can receive the signal and provide the address indicated in the signal (e.g., for snoop invalidation) to its core- And can be compared with monitor addresses at storage locations. It can determine if the address of the signal matches the monitor address 222-1 at the first core-side monitor address storage location 221-1 corresponding to the first hardware thread 218-1. The core-side address monitor unit can know if the first hardware thread corresponds to the monitored address. The core-side address monitor unit may send a signal to the core
In some embodiments, cache-side address monitor storage locations are likely to overflow. For example, a new monitor request may be received at the cache-side address monitor unit, but all of the cache-side address monitor storage locations are currently in use, so that an empty / available cache - Side address monitor The storage location may not exist. As shown, in some embodiments, the cache-side address monitor unit may be coupled to a cache-side address monitor storage overflow unit 236 corresponding to a cache portion. In some embodiments, such a cache-side address monitor storage overflow unit may be configured to monitor the address of the new monitor request when there are no empty / available / unused cache-side address monitor storage locations capable of storing the address of the new monitor request. May be operable to enforce or implement a storage overflow policy.
As noted, in some embodiments, the core-side address monitor unit may have the same number of core-side address monitor storage locations as the number of hardware threads in its corresponding core. Similarly, in some embodiments, the core-side address monitor units of other cores may have the same number of core-side address monitor storage locations as the number of hardware threads in their corresponding cores. Collectively, these core-side address monitor storage locations may represent as many sets of core-side address monitor storage locations as the total number of hardware threads (or other logical processors) of the processor. Advantageously, even when there is an overflow of cache-side address monitor storage locations, the core-side address monitor units are able to provide sufficient core-to-core address storage for all of the hardware threads (or other logical processors) Side address monitor storage locations.
3 is a diagram illustrating states of one embodiment of a monitor finite state machine (FSM) 347 suitable for implementing the MONITOR and MWAIT commands. Upon receiving a monitor request for an address from the executing thread, the monitor FSM may make a
4 illustrates an overflow avoidance logic 460 that may be operable to reuse a single cache-side address monitor storage location 428 for multiple hardware threads and / or cores when monitor requests display the same address. Figure 2 is a block diagram of one embodiment. This logic includes a cache-side address monitor storage location reuse unit 464 coupled with a cache-side address monitor storage location 428. This cache-side address monitor storage location reuse unit may receive monitor requests 462 from different hardware threads and / or cores indicating the same address. One possible approach would be to store different copies of this same address in different cache-side address monitor storage locations (e.g., in a table where different entries are implemented in hardware). However, this may consume many, or in some cases, many, cache-side address monitor storage locations.
As an alternative approach, in some embodiments a single cache-side address monitor storage location 428 may be used to store the
As noted above, in some cases it is possible to overflow a limited number of cache-side address monitor storage locations. In some embodiments, an overflow mode or set of policies may be provided to allow the monitor mechanism to operate correctly in the event of an overflow.
Figure 5 illustrates an overflow mode by identifying stale / outdated cache-side address monitor storage locations and entering an overflow mode when no such old / old storage locations are found. 570 < / RTI > In some embodiments, the operations and / or methods of FIG. 5 may be performed by and / or within the processor of FIG. 1 and / or the cache agent of FIG. The components, features, and details of the specific options described herein for the processor of FIG. 1 and the cache agent of FIG. 2 are also optionally applied to the operations and / or methods of FIG. Alternatively, the operations and / or methods of FIG. 5 may be performed by and / or within similar and / or different processor and / or cache agents. In addition, the processor of FIG. 1 and / or the cache agent of FIG. 2 may perform the same, similar, or different operations and / or methods as those of FIG.
The method includes determining, at
The method optionally includes, at
5, if it is determined at
Alternatively, if it is determined at
As an overflow policy, at block 575, the method may include forcing all read transactions to use the shared cache coherency state. Conceptually, this can be viewed as handling all read transactions as monitor requests. Upon entering the overflow mode, the cache-side address monitor unit is no longer able to track monitor requests / addresses to dedicated storage. Thus, no core may be allowed to have an exclusive copy of the cache line. For example, any read operation received by the cache-side address monitor unit may be handled by a shared state response. Forcing these read transactions to use the shared state ensures that the intent to write to the corresponding address will cause a snoop or broadcast to be provided to all cores that may have cached the address It can be helpful.
As another overflow policy, at
It is worth noting that it is not strictly required to notify all cores of the processor, but only to all cores that are likely to have pending monitor requests. In some embodiments, one structure may optionally be used to keep track of all cores that may have pending monitor requests when an overflow occurs. One example of such a structure is an optional overflow structure. This overflow structure can indicate which cores are likely to have pending monitor requests when an overflow occurs. In one example, the overflow structure may have the same number of bits as the total number of cores in the processor, and each bit may have a fixed correspondence to different corresponding cores. According to one possible convention, each bit may have a first value (e.g., set to binary 1) indicating that the corresponding core is likely to have a pending monitor request when an overflow occurs , Or a second value (e.g., cleared to binary zero) indicating that the corresponding core is not likely to have a pending monitor request when an overflow occurs.
In one embodiment, the overflow architecture itself may reflect all of the cores that may have pending monitor requests when an overflow occurs. For example, when an overflow occurs, the overflow structure may be modified to reflect all cores corresponding to any one or more addresses currently stored in the cache-side address monitor storage locations. In another embodiment, an overflow structure in combination with cache-side address monitor storage locations may reflect all of the cores that may have pending monitor requests when an overflow occurs. For example, whenever an overflow occurs, each time the cache-side address monitor storage location is overwritten or consumed by a recently received monitor request, the cores associated with addresses being overwritten or consumed Can be reflected in the overflow structure. That is, the overflow structure may be updated each time a storage element is overwritten to capture information about cores that may have pending monitor requests. In these embodiments, information about whether cores are likely to have pending monitor requests when an overflow occurs is partitioned between the cache-side address monitor storage locations and the overflow structure.
In embodiments in which such an overflow structure or related structure is used, it is not required to send any received invalidation request to all cores, but rather the cores indicated by the overflow vector and / To storage locations that are likely to have access. Some cores may not be displayed in the overflow vector and / or storage locations, and therefore there is no possibility of having any pending monitor requests when an overflow occurs, thus no invalidation requests need to be sent. However, the use of such an overflow structure is optional and not required.
Referring again to FIG. 5, the overflow mode may be continued by repeating
This is just one exemplary embodiment. Many variations on these embodiments are contemplated. For example, the determination at
6 is a block diagram of one embodiment of an
Any of these units or components may be implemented in hardware (e.g., an integrated circuit, transistors or other circuit elements, etc.), firmware (e.g., ROM , EPROM, flash memory or other persistent or nonvolatile memory and microcode, microinstructions stored therein, or other low-level instructions), software (e.g., high-level instructions stored in memory) Or a combination thereof (e.g., hardware that is potentially combined with one or more of firmware and / or software).
The components, features, and details described for any of FIGS. 1, 3, 4, and 6 may also optionally be used in any of FIGS. It should also be understood that the components, features, and details described herein for any of the devices may be implemented in any of a variety of ways, which may be performed by and / It can also be used as an option in any of the methods.
Exemplary core architectures, processors and computer architectures
The processor cores may be implemented in different ways, for different purposes, in different processors. For example, implementations of these cores may include: 1) a general purpose sequential core targeted for general purpose computing; 2) a high performance general purpose non-sequential core for general purpose computing; 3) special purpose cores primarily targeted to graphics and / or scientific (throughput) computing. Implementations of the different processors may include: 1) a CPU comprising one or more general purpose non-sequential cores targeted for general purpose sequential cores and / or general purpose computing aimed at general purpose computing; And 2) one or more special purpose cores primarily targeted for graphics and / or scientific (throughput) computing. These different processors lead to different computer system architectures, including: 1) a coprocessor on a chip separate from the CPU; 2) a coprocessor on a separate die in the same package as the CPU; 3) a coprocessor on the same die as the CPU (in which case this coprocessor is sometimes referred to as special purpose logic, such as integrated graphics and / or scientific (throughput) logic, or special purpose cores); And 4) a system on a chip, which may include the CPU (sometimes called the application core (s) or application processor (s)) described above, the coprocessor disclosed above, . ≪ / RTI > Exemplary core architectures are described below, followed by the disclosures of exemplary processors and computer architectures.
Exemplary core architectures
Sequential and nonsequential cores Block diagram
Figure 7A is a block diagram illustrating both an exemplary sequential pipeline and an exemplary register renaming, nonsequential issue / execution pipeline, in accordance with embodiments of the present invention. Figure 7B is a block diagram illustrating an exemplary embodiment of both a sequential architecture core to be included in a processor and exemplary register renaming, nonsequential issue / execution architecture cores in accordance with embodiments of the present invention. Solid-line boxes in FIGs. 7a-b show sequential pipelines and sequential cores, while optional additions to dotted boxes illustrate register renaming, non-sequential issue / execute pipelines and cores. Considering that the sequential aspect is a subset of the non-sequential aspect, the non-sequential aspect will be explained.
7A, processor pipeline 700 includes a fetch
7B shows a processor core 790 that includes a front-
The
The set of memory access units 764 is coupled to a memory unit 770 that includes a
For example, an exemplary register renaming, non-sequential issue / execute core architecture may implement pipeline 700 as follows: 1) instruction fetch 738 includes fetch and length decoding stages 702 and 704, ; 2) Decode unit 740 performs decode stage 706; 3) rename / allocator unit 752 performs allocation stage 708 and renaming
Core 790 may include one or more instruction sets (e.g., x86 instruction set (with some extensions added with newer versions), including the instruction (s) disclosed herein, MIPS instruction set of MIPS Technologies; ARM Holdings of Sunnyvale, CA (with optional additional extensions such as NEON) ARM instruction set. In one embodiment, core 790 includes logic to support packed data instruction set extensions (e.g., AVX1, AVX2), so that operations used by many multimedia applications can be performed using packed data To be performed.
The core may support multithreading (running the operations of two or more parallel sets or threads), and may include time sliced multithreading, where a single physical core is allocated to each of the threads (E.g., providing a logical core for a processor), or a combination thereof (e.g., time division fetching and decoding, such as in Intel® Hyperthreading technology, and concurrent multithreading thereafter). It should be understood.
Although register renaming is described in the context of nonsequential execution, it should be understood that register renaming may also be used in a sequential architecture. Embodiments of the illustrated processor also include separate instruction and data cache units 734/774 and shared L2 cache units 776, but alternative embodiments may include, for example, a level 1 (L1) A single internal cache for both instructions and data, such as multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache external to the core and / or processor. Alternatively, all of the caches may be external to the core and / or processor.
Certain exemplary sequential core architectures
8A-B show a block diagram of a more specific exemplary sequential core architecture in which the core is one of several logic blocks (including other cores of the same type and / or different types) in the chip. The logic blocks communicate with some fixed function logic, memory I / O interfaces, and other necessary I / O logic, depending on the application, over a high-bandwidth interconnect network (e.g., ring network).
Figure 8A illustrates a block diagram of a single processor core, a connection to an on-
The local subset of the L2 cache 804 is part of a global L2 cache that is divided into discrete local subsets, one per processor core. Each processor core has a direct access path to its local subset of the L2 cache 804. The data read by the processor core is stored in its L2 cache subset 804 and can be quickly accessed in parallel with other processor cores accessing their own local L2 cache subsets. Data written by the processor cores is stored in its L2 cache subset 804 and, if necessary, removed from other subsets. The ring network guarantees coherency for shared data. The ring network is bi-directional, allowing agents such as processor cores, L2 caches, and other logical blocks to communicate with each other within the chip. Each ring data-path is 1012-bits wide per direction.
8B is a partial enlarged view of the processor core of FIG. 8A in accordance with embodiments of the present invention. 8B includes more details regarding the vector unit 810 and the vector registers 814 as well as the L1 data cache 806A portion of the L1 cache 804. [ Specifically, the vector unit 810 is a 16-wide Vector Processing Unit (16-wide ALU 828) that executes one or more of integer, single precision floating, and double precision floating instructions. The VPU supports swizzling of register inputs by
A processor with integrated memory controller and graphics
9 is a block diagram of a processor 900 that may have more than one core, may have an integrated memory controller, and may have integrated graphics, in accordance with embodiments of the present invention. The solid line boxes in Figure 9 illustrate a processor 900 having a single core 902A, a system agent 910, a set of one or more
Thus, different implementations of processor 900 may include: 1) special purpose logic 908, which may be integrated graphical and / or scientific (throughput) logic (which may include one or more cores) and one or more general purpose cores A general-purpose sequential cores, a general-purpose non-sequential cores, a combination of the two) cores 902A-N; 2) a coprocessor having cores 902A-N, which are a number of special purpose cores primarily targeted for graphical and / or scientific (throughput) computing; And 3) cores 902A-N, which are a number of general purpose sequential cores. Thus, processor 900 may be a general purpose processor, a coprocessor, or a special purpose processor, such as a network or communications processor, a compression engine, a graphics processor, a General Purpose Graphics Processing Unit (GPGPU) A processor (including more than 30 cores), an embedded processor, or the like. A processor may be implemented on one or more chips. Processor 900 may be part of and / or be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes a cache of one or more levels in the cores, a set of one or more shared cache units 906, and an external memory (not shown) coupled to the set of unified memory controller units 914. The set of shared cache units 906 may include one or more intermediate level caches, e.g., level 2 (L2), level 3 (L3), level 4 (L4) ) And / or combinations thereof. In one embodiment, the ring-based interconnect unit 912 interconnects the integrated graphics logic 908, the set of shared cache units 906 and the system agent unit 910 / integrated memory controller unit (s) 914, Alternate embodiments may utilize any number of known techniques for interconnecting such units. In one embodiment, coherency is maintained between one or more cache units 906 and cores 902A-N.
In some embodiments, at least one of the cores 902A-N is multi-threadable. System agent 910 includes components for coordinating and manipulating cores 902A-N. The system agent unit 910 may include, for example, a PCU (Power Control Unit) and a display unit. The PCU may include or may include logic and components necessary to adjust the power state of cores 902A-N and integrated graphics logic 908. [ The display unit is for driving one or more externally connected displays.
The cores 902A-N may be homogeneous or heterogeneous with respect to a set of architectural instructions; That is, two or more of the cores 902A-N may execute the same instruction set, while other cores may execute only a subset of that instruction set or a different instruction set.
Exemplary computer architecture
Figures 10-13 are block diagrams of exemplary computer architectures. (DSPs), graphics devices (DSPs), personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, , Video game devices, set-top boxes, microcontrollers, cell phones, portable media players, handheld devices, and various other electronic devices are also suitable . In general, a wide variety of systems or electronic devices capable of integrating processors and / or other execution logic as disclosed herein are generally suitable.
Referring now to FIG. 10, a block diagram of a
The optional attributes of the
In one embodiment, the
There may be various differences between the
In one embodiment, the processor 1010 executes instructions that control general types of data processing tasks. Coprocessor instructions may be embedded within the instructions. The processor 1010 recognizes these coprocessor instructions as being of a type that needs to be executed by the associated
Referring now to FIG. 11, there is shown a block diagram of a first, more specific
The shared cache (not shown) may be included in either processor or external to both processors, but still be connected to the processors via the PP interconnect, so that when the processor is placed in a low power mode, either or both processors May be stored in the shared cache.
The
11, a variety of I / O devices 1114 may be coupled to the first bus 1116 while the
Turning now to FIG. 12, a block diagram of a second, more specific
12 illustrates that
Referring now to FIG. 13, a block diagram of an SoC 1300 in accordance with one embodiment of the present invention is shown. Similar elements in FIG. 9 have the same reference numbers. Also, the dotted box is an optional feature for the more advanced SoCs. 13, an interconnect unit (s) 1302 includes: an application processor 1310 including a set of one or more cores 202A-N and shared cache unit (s) 906; A system agent unit 910; Bus controller unit (s) 916; Integrated memory controller unit (s) 914; A set of one or more coprocessors 1320 that may include integrated graphics logic, an image processor, an audio processor, and a video processor; An SRAM (Static Random Access Memory)
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be practiced on programmable systems including at least one processor, a storage system (including volatile and nonvolatile memory and / or storage elements), at least one input device, and at least one output device Computer programs, or program code.
Program code, such as
The program code may be implemented in a high-level procedural or object-oriented programming language to communicate with the processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms disclosed herein are not limited in scope to any particular programming language. In either case, the language may be a compiled or interpreted language.
At least one aspect of at least one embodiment is a computer readable medium having stored thereon machine readable medium representing various logic within the processor that, when read by a machine, causes the machine to produce logic to perform the techniques described herein May be implemented by representative instructions. Such representations, known as " IP cores ", may be stored on a type of machine readable medium and supplied to various customers or manufacturing facilities, which may be loaded into manufacturing machines that actually produce the logic or processor.
These machine-readable storage media include, but are not limited to, hard disks, floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritable Random access memories (EPROMs) such as Read Only Memories (Random Access Memories), Dynamic Random Access Memories (DRAMs), and Static Random Access Memories (SRAMs), electrically erasable programmable read- Readable memories, flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memories (PCM), magnetic or optical cards, or any other type suitable for storing electronic instructions But are not limited to, non-transitory and tangential arrangements of articles made or formed by a machine or device comprising a storage medium such as,
Thus, embodiments of the present invention may also be embodied in a computer-readable medium, such as a hardware description language (HDL), which includes instructions or defines the structures, circuits, devices, processors and / And includes a non-transient type of machine readable medium containing design data. These embodiments may also be referred to as program products.
emulation( Binary translation, code Morphing Etc.)
In some cases, an instruction translator may be used to translate instructions from a source instruction set to a target instruction set. For example, the instruction translator may translate, morph, emulate, or otherwise translate instructions (e.g., using static binary translation, dynamic binary translation including dynamic compilation) into one or more other instructions to be processed by the core, Or otherwise. The instruction translator may be implemented in software, hardware, firmware, or a combination thereof. The instruction translator may be an on-processor, an off-processor, or a part-on and part-off processor.
14 is a block diagram collating the use of a software instruction translator to convert binary instructions in a source instruction set into binary instructions in a target instruction set in accordance with embodiments of the present invention. In the illustrated embodiment, the instruction translator is a software instruction translator, but, in the alternative, the instruction translator may be implemented in software, firmware, hardware, or various combinations thereof. 14 compiles a program in a high-level language 1402 using an x86 compiler 1404 to generate an x86 binary code 1406 that can be executed innately by a processor 1416 having at least one x86 instruction set core Lt; / RTI > Processor 1416 having at least one x86 instruction set core may be configured to (i) implement a significant portion of the instruction set of the Intel x86 instruction set core, to achieve substantially the same result as an Intel processor with at least one x86 instruction set core Or (2) by interoperably or otherwise processing applications or other software of object code versions intended to run on an Intel processor having at least one x86 instruction set core, at least one x86 instruction set And any processor capable of performing substantially the same functions as an Intel processor having a core. x86 compiler 1404 may include x86 binary code 1406 (e.g., object code) that may be executed on processor 1416 having at least one x86 instruction set core, with or without additional linkage processing It indicates a compiler that can be operated to generate. Similarly, FIG. 14 compiles a program of the high-level language 1402 using an alternative instruction set compiler 1408 to generate a processor 1414 having at least one x86 instruction set core (e.g., An alternative instruction set binary code 1410 that can be executed innocently by a processor having core executing and / or executing the MIPS instruction set of MIPS Technologies of Sunnyvale or executing the ARM instruction set of ARM Holdings of Sunnyvale, Calif. Lt; / RTI > Instruction translator 1412 is used to convert x86 binary code 1406 into code that can be executed natively by a processor that does not have x86 instruction set core 1414. [ This converted code is not likely to be the same as the alternative instruction set binary code 1410 because it is difficult to fabricate an instruction translator that can do this; However, the transformed code will accomplish common tasks and will consist of instructions from an alternative set of instructions. Thus, instruction translator 1412 may be software, firmware, or other software that allows an x86 instruction set processor or other electronic device to execute x86 binary code 1406 through an emulation, simulation, or any other process, Hardware or a combination thereof.
In the description and the claims, the terms " coupled " and / or " connected ", along with their derivatives, may be used. It is to be understood that these terms are not intended to be synonymous with each other. Rather, in embodiments, " connected " can be used to indicate that two or more elements are in direct physical and / or electrical contact with each other. &Quot; Coupled " may mean that two or more elements are in direct physical and / or electrical contact. However, " connected " may also mean that two or more elements are not in direct contact with each other, but still interact or interact. For example, a core may be coupled to a cache portion via one or more intermediate components. In the figures, arrows are used to show connections and connections.
In the description and claims, the terms "logic", "unit", "module", or "component" may be used. It should be understood that they may include hardware, firmware, software, or a combination thereof. Examples of these include integrated circuits, application specific integrated circuits, analog circuits, digital circuits, program logic devices, memory devices including instructions, and the like. In some embodiments, these may potentially include transistors and / or gates and / or other circuit components.
In the foregoing description, for purposes of explanation, numerous specific details have been set forth in order to provide a thorough understanding of embodiments of the invention. However, other embodiments may be practiced without some of these specific details. The scope of the invention should be determined only by the claims, not by the specific examples provided above. In other instances, well known circuits, structures, devices, and operations have been shown in block diagram form or in detail, in order to avoid obscuring the understanding of the description. Where multiple components are shown and described, in some instances they may be integrated together as a single component. In other cases in which a single component is shown and described, in some cases it may be divided into two or more components.
Various operations and methods have been described. Although some of the methods have been described in a relatively basic form in the flowcharts, operations may optionally be added to methods and / or eliminated in methods. In addition, although the flowcharts illustrate specific sequences of operations in accordance with embodiments, the specific order is exemplary. Alternate embodiments may optionally perform operations in a different order, combine certain operations, duplicate certain operations, and so on.
Certain operations may be performed by hardware components or may cause a machine, circuit or hardware component (e.g., processor, portion of a processor, circuitry, etc.) programmed with instructions to perform operations ≪ / RTI > may be implemented with machine executable or circuit executable instructions that may be used to effectuate the above-described operations. Further, the operations may be selectively performed by a combination of hardware and software.
Some embodiments include an article of manufacture (e.g., a computer program product) that includes a non-transitory machine-readable storage medium. Such non-transitory machine-readable storage media do not include transiently propagated signals. Such non-transitory machine-readable storage media may comprise a mechanism for storing information in a form readable by a machine. Such machine-readable storage media may be stored on a machine in a manner that causes the machine to perform and / or perform one or more operations, methods, or techniques described herein when executed and / Or a sequence of instructions. Examples of suitable machines include, but are not limited to, processors and computer systems or other electronic devices having such processors. In various embodiments, the non-transitory machine-readable storage medium may be a floppy diskette, optical storage medium, optical disk, CD-ROM, magnetic disk, magnetooptical disk, ROM (Read Only Memory), PROM (Programmable ROM) Erasable and programmable ROM (EEPROM), random access memory (RAM), static random access memory (SRAM), dynamic RAM (DRAM), flash memory, Non-volatile data storage devices, non-volatile memory, non-volatile data storage devices.
Reference throughout this specification to " one embodiment ", " an embodiment ", " one or more embodiments ", " some embodiments ", for example, But it does not have to be. Similarly, for the purpose of streamlining the present disclosure and helping to understand various aspects of the invention, various aspects are sometimes grouped together in a single embodiment, figure, or description thereof in the description. However, the methods of this disclosure should not be interpreted as reflecting an intention to require more features than are expressly recited in each claim. Rather, as the following claims reflect, aspects of the invention may be less than all features of a single disclosed embodiment. Accordingly, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as an individual embodiment of the present invention.
Illustrative Examples
The following examples relate to additional embodiments. The details in these examples may be used in any one or more embodiments.
Example 1 is a processor that corresponds to a first cache portion of a distributed cache and includes a cache-side address monitor unit having a total number of cache-side address monitor storage locations less than the total number of logical processors of the processor. Each cache-side address monitor storage location is for storing an address to be monitored. This process also includes a core-side address monitor unit corresponding to the first core and having the same number of core-side address monitor storage locations as the number of one or more logical processors of the first core. Each core-side address monitor storage location is for storing a monitored state for a different corresponding logical processor of an address and a first core to be monitored. The processor is configured to perform an address monitor storage overflow policy when any unused cache-side address monitor storage location is not available to store additional addresses to be monitored, a cache-side address monitor storage overhead corresponding to a first cache portion And a flow unit.
Example 2 optionally includes a processor of any preceding example and optionally a core-side trigger unit corresponding to the first core and coupled to the core-side address monitor unit. This core-side trigger unit is for triggering the logical processor of the first core when the corresponding core-side address monitor storage location has a monitor state ready to trigger and a trigger event is detected.
Example 3 includes a processor of any preceding example and is coupled to a cache-side address monitor unit and includes cache-side registers for writing monitor requests from different logical processors for the same monitor address to a common cache- Optionally include an address monitor storage location reuse unit.
Example 4 includes the processor of Example 3, and the common cache-side address monitor storage location includes a structure for writing different logical processors that provide monitor requests for the same monitor address.
Example 5 includes a processor of any preceding example wherein the processor has more than 40 hardware threads and the total number of cache-side address monitor storage locations of the cache-side address monitor unit corresponding to the first cache portion is at least 20 cache-side address monitor storage locations, but less than the total number of hardware threads over 40.
Example 6 includes a processor of any of the preceding examples, wherein the total number of cache-side address monitor storage locations of the cache-side address monitor unit is less than one-tenth the likelihood of overflow of cache-side address monitor storage locations Of the total number of logical processors of the processor.
Example 7 includes a processor of any preceding example, and in response to an instruction to indicate a first address to be monitored, the cache-side address monitor unit is for storing the first address in a cache-side address monitor storage location. In addition, the core-side address monitor unit is for storing the first address in the core-side address monitor storage location.
Example 8 includes a processor of any of the preceding examples, and the logical processors are hardware threads.
Example 9 includes a processor of any of the preceding examples, and the cache-side address monitor storage < RTI ID = 0.0 > operandi < / RTI > unit is for enforcing an address monitor storage overflow policy that includes forcing read transactions to use shared state.
Example 10 includes a processor of any preceding example, and the cache-side address monitor storage overflow unit enforces an address monitor storage overflow policy that includes sending invalidation requests to all cores likely to have pending monitor requests .
Example 11 includes the processor of Example 10, and the cache-side address monitor storage overflow unit is to identify the overflow structure to determine which cores are likely to have pending monitor requests.
Example 12 is a system for processing instructions including an interconnect, and a processor coupled to the interconnect. The processor includes a first address monitor unit of the cache portion control unit that corresponds to a first cache portion of the distributed cache and has a total number of address monitor storage locations less than the total number of hardware threads of the processor. Each address monitor storage location is for storing an address to be monitored. The processor also includes a second address monitor unit of the core interface unit corresponding to the first core and having the same number of address monitor storage locations as the number of one or more hardware threads of the first core. Each address monitor storage location of the second address monitor unit is for storing a monitored state for different corresponding hardware threads of the first core and an address to be monitored. This processor implements an address monitor storage overflow policy when all address monitor storage locations of the first address monitor unit are used and none of them are available to store addresses for monitor requests. And a storage overflow unit. The system also includes a dynamic random access memory coupled to the interconnect, a wireless communication device coupled to the interconnect, and an image capture device coupled to the interconnect.
Example 13 includes the system of Example 12, wherein the address monitor storage overflow unit is configured to force the read transactions to use the shared state; And an address monitor storage overflow policy that includes sending invalidation requests to all cores that may have pending monitor requests.
Example 14 includes a system of any of Examples 12-13 wherein the processor has more than 40 hardware threads and the total number of address monitor storage locations of the first address monitor unit is at least 20 but less than 40 of the processors Less than the total number of hardware threads.
Example 15 includes a system of any of Examples 12-14, wherein the processor is configured to re-use the address monitor storage location of the cache portion control unit to write monitor requests from different hardware threads to the common address monitor storage location for the same monitor address Unit.
Example 16 is a method in a processor that includes a step of displaying an address and receiving a first instruction indicating to monitor for writes to a first logical processor of a first core of a multi-core processor. In response to the first instruction, the method includes storing the address indicated by the first instruction at a first one of a plurality of core-side address monitor storage locations corresponding to a first core, . The number of the plurality of core-side address monitor storage locations is equal to the number of logical processors of the first core. The method also includes storing the address indicated by the first instruction at a first one of a plurality of cache-side address monitor storage locations corresponding to a first cache portion of the distributed cache at a first cache-side address monitor storage location do. The total number of cache-side address monitor storage locations is less than the total number of logical processors of the multi-core processor. The method further includes changing the monitor state to an estimated state.
Example 17 includes the method of Example 16, comprising the steps of: receiving a second instruction indicating to also address and to monitor for writes to the address from a second logical processor of the second core, And recording the monitor request for the address to the first cache-side address monitor storage location.
Example 18 includes the method of Example 17 wherein writing a monitor request for an address for a second core to a first cache-side address monitor storage location includes writing a different bit corresponding to each core of the multi- And changing the bit in the core mask.
Example 19 includes the method of any preceding example, comprising receiving a second instruction indicating a second address and indicating to monitor for writes from a first logical processor to a second address, Determining that there is no available cache-side address monitor storage locations among a plurality of corresponding cache-side address monitor storage locations, and determining to enter a cache-side address monitor storage location overflow mode as an option .
Example 20 includes the method of Example 19, wherein during the cache-side address monitor storage location overflow mode, all read transactions corresponding to the first cache portion are forced to use the shared cache coherency state And sending invalidation requests corresponding to the first cache portion to all cores of the multi-core processor that may have one or more pending monitor requests.
Example 21 includes a method of any preceding example, comprising the steps of receiving a second instruction indicating an address in a first logical processor, and optionally in response to a second instruction, changing a monitor state to a trigger wait state .
Example 22 includes a processor or other device that performs any of the methods of Examples 16-21.
Example 23 includes a processor or device including means for performing the method of any of Examples 16-21.
Example 24 includes integrated circuits and / or logic and / or units and / or components and / or modules and / or means, or any combination thereof, performing any of the methods of Examples 16-21. Lt; / RTI >
Example 25 is an option to store and / or otherwise provide one or more instructions to cause the machine to perform the method of any of Examples 16-21 when executed and / - readable medium.
Example 26 includes a computer system including an interconnect, a processor coupled to the interconnect, a DRAM, a graphics chip, a wireless communication chip, a phase change memory, and a video camera, at least one of which is connected to the interconnect, / RTI > and / or the computer system is for performing the method of any of Examples 16-21.
Example 27 includes a processor or other device that substantially performs one or more operations or any method as described herein.
Example 28 includes a processor or other device that includes means for substantially performing one or more operations or any method as described herein.
Example 29 includes a processor or other device that substantially performs the instructions as described herein.
Example 30 includes a processor or other device that includes means for substantially performing the instructions as described herein.
Claims (25)
A cache-side address monitor that includes at least some circuitry and corresponds to and is also associated with a first cache portion of the distributed cache and has a total number of cache-side address monitor storage locations less than the total number of logical processors of the processor Wherein the distributed cache includes a plurality of cache portions during an operation mapped to non-overlapping ranges of addresses, each cache-side address monitor storage location storing an address at which the cache- Wherein the cache-side address monitor storage locations are not part of the distributed cache;
A core-side address monitor having at least some of the circuitry and corresponding to and coupled to a first core and having the same number of core-side address monitor storage locations as the number of one or more logical processors of the first core; Wherein the core-side address monitor storage location stores an address at which the core-side address monitor monitors writes and a monitor state for a different corresponding logical processor of the first core;
At least some of the circuits implementing an address monitor storage overflow policy when any unused cache-side address monitor storage location is not available to store additional addresses to be monitored, A cache-side address monitor storage overflow unit coupled to the cache-side address monitor; And
A core-side trigger unit that includes at least some circuitry and is coupled to and coupled to the first core and is coupled to the core-
Wherein the core-side trigger unit triggers a logical processor of the first core when a trigger event is detected with a monitor state ready for the corresponding core-side address monitor storage location to be triggered.
A cache-side address monitor storage location for writing monitor requests from different logical processors for the same monitor address to a common cache-side address monitor storage location, including at least some circuitry and associated with the cache- Further comprising a reuse unit.
Wherein the common cache-side address monitor storage location comprises a structure for recording the different logical processors that have provided the monitor requests for the same monitor address.
Wherein the processor has more than 40 hardware threads and the total number of cache-side address monitor storage locations of the cache-side address monitor corresponding to the first cache portion is at least 20 cache-side address monitor storage locations But less than the total number of hardware threads above 40.
Wherein the cache-side address monitor stores the first address in a cache-side address monitor storage location, and the core-side address monitor stores the first address in a core-side address monitor storage location in response to an instruction indicating a first address to be monitored, Side address monitor The processor that stores the storage location.
Wherein the one or more logical processors of the first core comprise hardware threads.
Wherein the cache-side address monitor storage < RTI ID = 0.0 > op-flow unit < / RTI > enforces read transactions to use a shared state.
Wherein the cache-side address monitor storage overflow unit comprises sending invalidation requests only to a subset of cores where core identifiers are stored.
Wherein the cache-side address monitor storage overflow unit identifies an overflow structure to determine a subset of the cores.
Interconnect;
A processor coupled to the interconnect, the processor comprising:
A cache portion including a first address monitor that includes at least some circuitry and corresponds to and is also associated with a first cache portion of the distributed cache and has a total number of address monitor storage locations less than the total number of hardware threads of the processor Wherein the distributed cache comprises a plurality of cache portions during an operation each mapped to a non-overlapping range of addresses, each address monitor storage location storing an address at which the cache portion control unit monitors writes, The address monitor storage locations being different from the distributed cache;
A core interface unit comprising at least some circuitry and corresponding to and coupled to a first core and having a second number of address monitor storage locations equal to the number of one or more hardware threads of the first core, Each address monitor storage location of the second address monitor stores an address at which the core interface unit monitors for writes and a monitor state for different corresponding hardware threads of the first core;
At least some of the circuits implementing an address monitor storage overflow policy when all of the address monitor storage locations of the first address monitor are used and none of them are available to store addresses for monitor requests, An address monitor storage overflow unit of the cache portion control unit coupled to the one address monitor; And
A core-side trigger unit including at least some circuitry and corresponding to and coupled to the first core,
Wherein the core-side trigger unit triggers a hardware thread of the first core when a trigger event is detected with a monitor state ready for the corresponding address monitor storage location to trigger;
A dynamic random access memory coupled to the interconnect;
A wireless communication device coupled to the interconnect; And
An image capture device coupled to the interconnect
/ RTI >
Wherein the address monitor storage overflow unit comprises:
Forcing read transactions to use a shared state; And
Sending invalidation requests only to a subset of cores where core identifiers will be stored
Wherein the address monitor storage overflow policy comprises:
Wherein the processor has more than 40 hardware threads, wherein the total number of address monitor storage locations of the first address monitor is at least 20 but less than the total number of hardware threads of more than forty of the processors.
The processor further comprises an address monitor storage location reuse unit of the cache portion control unit, the at least some of the circuitry for writing monitor requests from different hardware threads for the same monitor address to a common address monitor storage location system.
Receiving a first instruction indicating an address and indicating to monitor for writes from a first logical processor of the first core of the multi-core processor to the address; And
In response to the first instruction:
Storing the address indicated by the first instruction at a first of the plurality of core-side address monitor storage locations corresponding to the first core at a first one of the plurality of core-side address monitor storage locations, The number of monitor storage locations being equal to the number of logical processors of the first core;
A plurality of cache-side address monitor storage locations corresponding to a first cache portion of the decentralized cache comprising a plurality of cache portions each mapped to a non-overlapping range of addresses, the address being represented by the first instruction; Side address monitor storage locations, wherein the plurality of cache-side address monitor storage locations are not part of the distributed cache, and wherein the total number of cache- - less than the total number of logical processors in the core processor;
Causing the processor to activate monitoring for writes to the address; And
Changing the monitor state to a speculative state
Lt; / RTI >
The method comprises:
Detecting a write to the address; And
A step of sending a wake-up signal from the core-side trigger unit to the first logical processor
≪ / RTI >
Receiving a second instruction indicative of also indicating the address and monitoring for writes from a second logical processor of the second core to the address; And
Writing a monitor request for the address for the second core to the first cache-side address monitor storage location
≪ / RTI >
Writing the monitor request for the address for the second core to the first cache-side address monitor storage location comprises writing a bit in the core mask having different bits corresponding to each core of the multi- Lt; / RTI >
Receiving a second instruction indicating a second address and indicating to monitor for writes from the first logical processor to the second address;
Determining that there are no cache-side address monitor storage locations available among the plurality of cache-side address monitor storage locations corresponding to the first cache portion; And
Cache-side address monitor storage location overflow mode
≪ / RTI >
During the cache-side address monitor storage location overflow mode,
Enforcing all read transactions corresponding to the first cache portion to use a shared cache coherency state; And
Sending invalidation requests corresponding to the first cache portion to only a subset of cores of the multi-core processor having one or more pending monitor requests
≪ / RTI >
Receiving a second instruction from the first logical processor indicating the address; And
In response to the second instruction, changing the monitor state to a trigger wait state
≪ / RTI >
Wherein the number of the plurality of core-side address monitor storage locations is equal to the number of hardware threads of the first core.
integrated circuit;
A cache portion control unit integrated on the integrated circuit and corresponding to and coupled to a first cache portion of the distributed cache and having a total number of cache-side address monitor storage locations less than the total number of logical processors of the processor, Side address monitor storage location stores an address at which the cache sub control unit monitors for writes, and wherein each cache-side address monitor storage location stores an address at which the cache partial control unit monitors addresses, The side address monitor storage locations differ from the distributed cache;
A core interface unit that is integrated on the integrated circuit and corresponding to and coupled to a first core and has the same number of core-side address monitor storage locations as the number of one or more logical processors of the first core, Wherein the monitor storage location stores an address at which the core interface unit monitors the writes and a monitor state for a different corresponding logical processor of the first core; And
A core-side trigger unit that includes at least some of the circuitry and is associated with and connected to the first core,
Wherein the core-side trigger unit triggers a logical processor of the first core when a trigger event is detected with a monitor state ready for the corresponding core-side address monitor storage location to be triggered.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2014/059130 WO2015048826A1 (en) | 2013-09-27 | 2014-10-03 | Scalably mechanism to implement an instruction that monitors for writes to an address |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20160041950A KR20160041950A (en) | 2016-04-18 |
KR101979697B1 true KR101979697B1 (en) | 2019-05-17 |
Family
ID=56973722
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020167005327A KR101979697B1 (en) | 2014-10-03 | 2014-10-03 | Scalably mechanism to implement an instruction that monitors for writes to an address |
Country Status (3)
Country | Link |
---|---|
JP (1) | JP6227151B2 (en) |
KR (1) | KR101979697B1 (en) |
CN (1) | CN105683922B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10289516B2 (en) * | 2016-12-29 | 2019-05-14 | Intel Corporation | NMONITOR instruction for monitoring a plurality of addresses |
US10860487B2 (en) * | 2019-04-17 | 2020-12-08 | Chengdu Haiguang Integrated Circuit Design Co. Ltd. | Multi-core processing device and method of transferring data between cores thereof |
CN111857591A (en) | 2020-07-20 | 2020-10-30 | 北京百度网讯科技有限公司 | Method, apparatus, device and computer-readable storage medium for executing instructions |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070282928A1 (en) * | 2006-06-06 | 2007-12-06 | Guofang Jiao | Processor core stack extension |
US20080005504A1 (en) * | 2006-06-30 | 2008-01-03 | Jesse Barnes | Global overflow method for virtualized transactional memory |
US20090172284A1 (en) * | 2007-12-28 | 2009-07-02 | Zeev Offen | Method and apparatus for monitor and mwait in a distributed cache architecture |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7363474B2 (en) * | 2001-12-31 | 2008-04-22 | Intel Corporation | Method and apparatus for suspending execution of a thread until a specified memory access occurs |
US7213093B2 (en) * | 2003-06-27 | 2007-05-01 | Intel Corporation | Queued locks using monitor-memory wait |
-
2014
- 2014-10-03 CN CN201480047555.XA patent/CN105683922B/en active Active
- 2014-10-03 JP JP2016545961A patent/JP6227151B2/en active Active
- 2014-10-03 KR KR1020167005327A patent/KR101979697B1/en active IP Right Grant
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070282928A1 (en) * | 2006-06-06 | 2007-12-06 | Guofang Jiao | Processor core stack extension |
US20080005504A1 (en) * | 2006-06-30 | 2008-01-03 | Jesse Barnes | Global overflow method for virtualized transactional memory |
US20090172284A1 (en) * | 2007-12-28 | 2009-07-02 | Zeev Offen | Method and apparatus for monitor and mwait in a distributed cache architecture |
Also Published As
Publication number | Publication date |
---|---|
CN105683922A (en) | 2016-06-15 |
CN105683922B (en) | 2018-12-11 |
JP6227151B2 (en) | 2017-11-08 |
JP2016532233A (en) | 2016-10-13 |
KR20160041950A (en) | 2016-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10705961B2 (en) | Scalably mechanism to implement an instruction that monitors for writes to an address | |
US20180225211A1 (en) | Processors having virtually clustered cores and cache slices | |
US9740617B2 (en) | Hardware apparatuses and methods to control cache line coherence | |
US10248568B2 (en) | Efficient data transfer between a processor core and an accelerator | |
US9934146B2 (en) | Hardware apparatuses and methods to control cache line coherency | |
US10409727B2 (en) | System, apparatus and method for selective enabling of locality-based instruction handling | |
US9361233B2 (en) | Method and apparatus for shared line unified cache | |
US20170185515A1 (en) | Cpu remote snoop filtering mechanism for field programmable gate array | |
US9690706B2 (en) | Changing cache ownership in clustered multiprocessor | |
US20170286118A1 (en) | Processors, methods, systems, and instructions to fetch data to indicated cache level with guaranteed completion | |
US10102124B2 (en) | High bandwidth full-block write commands | |
US9898298B2 (en) | Context save and restore | |
US9201792B2 (en) | Short circuit of probes in a chain | |
US9146871B2 (en) | Retrieval of previously accessed data in a multi-core processor | |
US10705962B2 (en) | Supporting adaptive shared cache management | |
US20170286301A1 (en) | Method, system, and apparatus for a coherency task list to minimize cache snooping between cpu and fpga | |
US10402336B2 (en) | System, apparatus and method for overriding of non-locality-based instruction handling | |
KR101979697B1 (en) | Scalably mechanism to implement an instruction that monitors for writes to an address | |
US9436605B2 (en) | Cache coherency apparatus and method minimizing memory writeback operations | |
US9037804B2 (en) | Efficient support of sparse data structure access | |
US20200174929A1 (en) | System, Apparatus And Method For Dynamic Automatic Sub-Cacheline Granularity Memory Access Control | |
WO2018001528A1 (en) | Apparatus and methods to manage memory side cache eviction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
AMND | Amendment | ||
E902 | Notification of reason for refusal | ||
AMND | Amendment | ||
E90F | Notification of reason for final refusal | ||
E601 | Decision to refuse application | ||
AMND | Amendment | ||
X701 | Decision to grant (after re-examination) | ||
GRNT | Written decision to grant |