New! View global litigation for patent families

US20070294516A1 - Switch prefetch in a multicore computer chip - Google Patents

Switch prefetch in a multicore computer chip Download PDF

Info

Publication number
US20070294516A1
US20070294516A1 US11454245 US45424506A US2007294516A1 US 20070294516 A1 US20070294516 A1 US 20070294516A1 US 11454245 US11454245 US 11454245 US 45424506 A US45424506 A US 45424506A US 2007294516 A1 US2007294516 A1 US 2007294516A1
Authority
US
Grant status
Application
Patent type
Prior art keywords
data
processor
interval
instruction
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US11454245
Other versions
US7502913B2 (en )
Inventor
Paul R. Barham
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/084Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0897Caches characterised by their organisation or structure with two or more cache hierarchy levels
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for programme control, e.g. control unit
    • G06F9/06Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
    • G06F9/30Arrangements for executing machine-instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for programme control, e.g. control unit
    • G06F9/06Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
    • G06F9/30Arrangements for executing machine-instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling, out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling, out of order instruction execution from multiple instruction streams, e.g. multistreaming

Abstract

Systems and methods for switch prefetch in multicore computer chips can allow a programmer to tailor operations of a computer program to available data. Control-flow decisions can be made by the program based on the availability of data in a cache. For example, a new instruction in a processor instruction set can receive a list comprising pairs of data addresses and code addresses. The processor can look for data items corresponding to the listed data addresses, and find the first available data item in the cache. When a cached data item is found, control is transferred to the code address supplied in the table. If no data is in the cache, then the processor can stall until the most quickly fetched data item is available.

Description

    BACKGROUND
  • [0001]
    Moore's Law says that the number of transistors we can fit on a silicon wafer doubles every year or so. No exponential lasts forever, but we can reasonably expect that this trend will continue to hold over the next decade. Moore's Law means that future computers will be much more powerful, much less expensive, there will be many more of them and they will be interconnected.
  • [0002]
    Moore's Law is continuing, as can be appreciated with reference to FIG. 1, which provides trends in transistor counts in processors capable of executing the x86 instruction set. However, another trend is about to end. Many people know only a simplified version of Moore's Law: “Processors get twice as fast (measured in clock rate) every year or two.” This simplified version has been true for the last twenty years but it is about to stop. Adding more transistors to a single-threaded processor no longer produces a faster processor. Increasing system performance must now come from multiple processor cores on a single chip. In the past, existing sequential programs ran faster on new computers because the sequential performance scaled, but that will no longer be true.
  • [0003]
    Future systems will look increasingly unlike current systems. We won't have faster and faster processors in the future, just more and more. This hardware revolution is already starting, with 2-8 core computer chip design appearing commercially. Most embedded processors already use multi-core designs. Desktop and server processors have lagged behind, due in part to the difficulty of general-purpose concurrent programming.
  • [0004]
    It is likely that in the not too distant future chip manufacturers will ship massively parallel, homogenous, many-core architecture computer chips. These will appear, for example, in traditional PCs and entertainment PCs, and cheap supercomputers. Each processor die may hold fives, tens, or even hundreds of processor cores.
  • [0005]
    In a multicore system, processors may store and read data from any number of cache levels. For example, a first cache may be accessed and modified by only a single processor, while a second cache may be associated with a small group of processors, and a third cache is associated with a wider group of processors, and so on. A problem with such a configuration is that cache access becomes dramatically more expensive, in terms of processor clock cycles, as caches are farther away from the accessing processor. A search for desired data in a “level one” cache can be conducted relatively quickly, while a search of a “level two” cache requires much more time, and a “level three” search may require a relatively enormous amount of time, when compared to the time necessary for level one or level two searches. Therefore, tailoring the amount of time spent on memory access is a problem that will increasingly emerge in the computing industry.
  • SUMMARY
  • [0006]
    In consideration of the above-identified shortcomings of the art, the present invention provides systems and methods for switch prefetch in multicore computer chips. In one exemplary embodiment, a programmer may tailor operations of a computer program to available data by making control-flow decisions based on the availability of data in a cache. A new instruction in a processor instruction set (referred to herein as a “module”) can receive a list comprising pairs of data addresses and code addresses. The module can look for the listed data, and find the first available data in the cache. When a cached data item is found, control is transferred to the code address supplied in the table. If no data is in the cache, then the processor can stall until the most quickly fetched data item is available. Other embodiments, features and advantages of the invention are described below.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • [0007]
    The systems and methods for switch prefetch in a multicore computer chip in accordance with the present invention are further described with reference to the accompanying drawings in which:
  • [0008]
    FIG. 1 illustrates trends in transistor counts in processors capable of executing the x86 instruction set.
  • [0009]
    FIG. 2 illustrates a multicore computer chip that comprises a variety of exemplary components such as several general purpose controller, graphics, and digital signal processing computation powerhouses.
  • [0010]
    FIG. 3 illustrates an overview of a system with an application layer, and OS layer, and a multicore computer chip.
  • [0011]
    FIG. 4 illustrates a chip 450 with a processor 440 accepting data addresses 401-406 and corresponding code addresses 401A-406A. The processor 440 looks for the data 401B-406B identified by addresses 401-406 in the various caches 410, 420, 430, and once it finds a first data, e.g. data 402B (in the claims, this is the data available in a shortest interval), the processor 440 executes code at the corresponding code address 402A.
  • [0012]
    FIG. 5 illustrates an application 550 that has some instructions 551-553 that need executing. Application 550 gives instructions 551-553 to processor 540, along with an acceptable interval 507. The processor 540 looks in caches 510, 520, 530 for the data 501-506 it needs to execute the instructions 551-553. Processor 540 will execute instructions based on the data that are discoverable during the acceptable interval.
  • [0013]
    FIG. 6 illustrates a method for fetching data for a processor in which a plurality of addresses are provided to a processor, the processor finds first available data, and executes code corresponding to the first available data.
  • [0014]
    FIG. 7 illustrates a method for fetching data for a processor in which an address is provided to a processor along with an acceptable stall interval. The processor can wait for the acceptable interval, execute any code corresponding to retrieved data, and move on to other tasks.
  • [0015]
    FIG. 8 illustrates various aspects of an exemplary computing device in which the invention may be deployed.
  • DETAILED DESCRIPTION
  • [0016]
    Certain specific details are set forth in the following description and figures to provide a thorough understanding of various embodiments of the invention. Certain well-known details often associated with computing and software technology are not set forth in the following disclosure, however, to avoid unnecessarily obscuring the various embodiments of the invention. Further, those of ordinary skill in the relevant art will understand that they can practice other embodiments of the invention without one or more of the details described below. Finally, while various methods are described with reference to steps and sequences in the following disclosure, the description as such is for providing a clear implementation of embodiments of the invention, and the steps and sequences of steps should not be taken as required to practice this invention.
  • [0017]
    In modern computer chips, level two cache misses generally take several hundred processor cycles to satisfy. Main memory systems are often composed of multiple banks and memory controllers configured to safely reorder cache fetches to make best use of underlying memory systems. Thus it can be very difficult to predict how long it will take to satisfy a cache miss. One solution to this problem is to provide prefetch instructions which allow the programmer to tell the memory system that a cache line will be needed before the processor has to stall waiting for the data. Such approaches may be used in tandem with the solutions proposed here.
  • [0018]
    A “switch prefetch” is described herein which allows more sophisticated control over memory access activity. In one embodiment, as provided above, a programmer can make control-flow decisions based on the availability of data in the cache. A processor can discover which of a plurality of data items is available in a shortest interval, and immediately execute a corresponding instruction. In another embodiment, for example, a processor stall interval can be specified. The processor will stall and wait for retrieval of desired data, but only for the duration of the stall interval. After the interval is elapsed, the processor may proceed to other tasks.
  • [0019]
    FIG. 2 gives an exemplary computer chip 200 that comprises a wide variety of components. Though not limited to systems comprising chips such as chip 200, it is contemplated that aspects of the invention are particularly useful in multicore computer chips, and the invention is generally discussed in this context. Chip 200 may include, for example, several general purpose controller, graphics, and digital signal processing computation powerhouses. This allows for maximum increase of localized clock frequencies and improved system throughput. As a consequence, system's processes are distributed over the available processors to minimize context switching overhead.
  • [0020]
    It will be appreciated that a multicore computer chip 200 such as that of FIG. 2 can comprise a plurality of components including but not limited to processors, memories, caches, buses, and so forth. For example, chip 200 is illustrated with shared memory 201-205, exemplary bus 207, main CPUs 210-211, a plurality of Digital Signal Processors (DSP) 220-224, Graphics Processing Units (GPU) 225-227, caches 230-234, crypto processors 240-243, watchdog processors 250-253, additional processors 261-279, routers 280-282, tracing processors 290-292, key storage 295, Operating System (OS) controller 297, and pins 299.
  • [0021]
    Components of chip 200 may be grouped into functional groups. For example, router 282, shared memory 203, a scheduler running on processor 269, cache 230, main CPU 210, crypto processor 240, watchdog processor 250, and key storage 295 may be components of a first functional group. Such a group might generally operate in tighter cooperation with other components in the group than with components outside the group. A functional group may have, for example, caches that are accessible only to the components of the group.
  • [0022]
    In general, processors such as 210 and 211 comprise an “instruction set” which exposes a plurality of functions that can be executed on behalf of applications. Because the term “instruction” is used herein to refer to instructions that an application gives to a processor, an “instruction” in a processor's instruction set will be referred to herein as a “module.”
  • [0023]
    FIG. 3 illustrates an overview of a system with an application layer, and operating system (OS) layer, and a multicore computer chip. The OS 310 is executed by the chip 320 and typically maintains primary control over the activities of the chip 320. Applications 310-303 access hardware such as chip 320 via the OS 310. The OS 310 manages chip 320 various ways that may be invisible to applications 301-303, so that much of the complexity in programming applications 301-303 is removed.
  • [0024]
    A multicore computer chip such as 320 may have multiple processors 331-334 each with various levels of available cache. For example, each processor 331-334 may have a private level one cache 341-344, and a level two cache 351 or 352 that is available to a subgroup of processors, e.g. 331-332 or 334-334, respectively. Any number of further cache levels may also be accessible to processors 331-334, e.g. level three cache 361 which is illustrated as being accessible to processors 331-334. The interoperation of processors 331-334 and the various ways in which caches 341-344, 351-352, and 360 are accessed may be controlled by logic in the processors themselves, e.g. by one or more modules in a processor's instruction set. This may also be controlled by OS 310 and applications 301-303.
  • [0025]
    Data items may be stored in caches 341-344, 351-352, and 360. Typically, data items are identified by the addresses at which they reside in the main memory. The data logically resides at those addresses in main memory, but copies of the data may also reside in one or more caches 341-344, 351-352, and 360. Depending on the cache-coherency protocol in use, the caches may also contain modified data items which have not yet been written back to main memory.
  • [0026]
    Processor instructions usually access data items of several different sizes up to the native “word-size” of the machine (e.g. 32 or 64-bits). Processors contemplated by the invention may identify the “effective address” of data items in any of the ways presently used by processor load and store instructions, or any future developed such technique.
  • [0027]
    Caches 341-344, 351-352, and 360 are typically divided into a number of fixed sized entries called cache-lines. These will frequently be larger than the word-size of the machine, e.g., 64/128 bytes. To keep track of which data items are in a cache, the cache typically remembers the address from which the data item(s) in each cache-line originally came. Each cache line usually has a ‘tag’ which records the address of the data held in that cache line.
  • [0028]
    FIG. 4 illustrates a chip 450 with a processor 440 accepting data addresses 401-406 and corresponding code addresses 401A-406A. The processor 440 looks for the data items 401B-406B in the various caches 410, 420, 430, and once it finds a first data item, e.g. data item 402B (the data item available in a shortest interval), the processor 440 executes code at the corresponding code address 402A.
  • [0029]
    FIG. 4 illustrates a computer chip 450 comprising at least one processor 440, said processor 450 comprising an instruction set 441 and at least one cache 410. A module 442 in said instruction set 441 accepts a plurality of data addresses 401-406 and a plurality of corresponding code addresses 401A-406A. The module 442 then finds a first available data item—here, 402B—in said at least one cache 410. The module 410 transfers control of said processor 440 to a code address—here, 402A—corresponding to said first available data item 402B.
  • [0030]
    It can be appreciated that computer chip 450 may comprise a plurality of processors 411-413 in addition to processor 440, and a plurality of caches, 420, 430 in addition to the at least one cache 410.
  • [0031]
    In another embodiment of the invention, which is also illustrated in FIG. 4, and which may be deployed independently or in conjunction to the aspects discussed above, the module 442 in said instruction set 441 accepts an acceptable interval 407 for fetching at least one of said data items 401B-406B. The module 442 returns to said processor 440 without finding said at least one data item (e.g. 406) cannot be found within said interval 407. The interval 407 may be specified by the computer program executing on processor 440, such as an operating system or an application, or, in other embodiments, may be hard-wired into the processor 440 logic itself.
  • [0032]
    In FIG. 4, L2 cache 420 is illustrated with a cache line 421 in which data item 410B is located. Processor 440 may identify that data item 401B is in cache line 421 by reading cache line tag 422. Such details are familiar to those of skill in the art and it will be appreciated that data items 401B-406B will be found in cache lines such as 421.
  • [0033]
    FIG. 5 illustrates an application 550 that has some instructions 551-553 that need executing. Application 550 gives instructions 551-553 to processor 540, along with an acceptable interval 507. The acceptable interval can be passed to module 542 in instruction set 541. The processor 540 looks in caches 510, 520, 530 for the data item 501-506 it needs to execute the instructions 551-553. Processor 540 will execute instructions based on the data items that are discoverable during the acceptable interval 507.
  • [0034]
    For example, consider a scenario in which instruction 551 needs data addresses 502 and 503, instruction 552 needs data address 501, and instruction 553 needs addresses 504, 505, and 506. A first acceptable interval 507 allows enough time 560 to search L1 cache 510. Processor 540 looks for addresses 501-506, and retrieves addresses 502 and 503 during the available time 506. Processor 540 then executes instruction 551, and not instructions 552 or 553.
  • [0035]
    In another example, processor 540 is given an acceptable interval corresponding to an amount of time 570 sufficient to search L1 Cache 510 and some or all of L2 Cache 520. In such a scenario, processor may go on to execute instructions 551 and 552, but not instruction 553 because instruction 553 requires data item 506, and data item 506 was not found in the acceptable interval 507 corresponding to available time 570. If the data items for instruction 551 are found first, then instruction 551 can be executed first, which may cause processor 540 to move on to other activities rather than executing instruction 552. Alternatively, instruction prioritization processes may be utilized that intelligently determine which of the instructions 551 or 552 that may possibly execute should be executed first.
  • [0036]
    FIG. 6 illustrates method for fetching data for a processor, comprising passing in a list of data addresses and corresponding instructions 601, passing in an acceptable interval 602, initiating a lookup of the listed data items 603, discovering by the processor which data item is available in shortest interval (e.g. first address returned) 604, and stopping processor discovering after the acceptable interval is elapsed 605.
  • [0037]
    Steps 601 and 602 may, in one embodiment, entail the passing of a list of data addresses and code addresses, and/or an acceptable interval by a computer program such as an application or an operating system. Step 603 can entail a processor initiating a search for specified data items by, for example, issuing a command to a memory subsystem. The processor can stall while waiting for return of the specified data items. It should be noted that there are a wide variety of storage media and memory management techniques. For example, addresses may be virtual or physical memory addresses, and memory may be a cache or other memory location that is configured according to any technologies allowing for storage and retrieval of data.
  • [0038]
    Step 604 entails discovering, by a processor, which of a plurality of data items is available in a shortest interval. In one embodiment, the data item that is available in a shortest interval can be the item corresponding to the first information returned to the processor. Such a data item is available in the shortest interval by virtue of the fact that it was available faster than other data items.
  • [0039]
    The processor may immediately execute at least one instruction corresponding to at least one data item that is available in said shortest interval 605. For example, once a data item is returned to a processor, it can immediately look in the list of data addresses and corresponding instructions, and immediately execute one or more instructions corresponding to the returned data item. “Immediately executing” an instruction therefore means that the processor undertakes execution of the instruction without waiting for other data items to be returned to the processor. There may be certain necessary preliminary actions to take prior to executing an instruction, and “immediate execution” does not preclude taking such preliminary actions.
  • [0040]
    If the acceptable interval is elapsed prior to finding any of the specified data items, the processor can stop waiting and move on to other tasks 606. This option may be available in some settings and not others. For example, there may be security reasons to force a processor to stall until certain instructions may be executed. If this is the case, the acceptable interval can be extended indefinitely until such instructions can be executed. Alternatively, the acceptable interval can be deactivated so that the processor temporarily functions without the acceptable interval constraint.
  • [0041]
    Some embodiments of the invention may allow for discovery of a variety of data items prior to moving to execution of corresponding instructions. In such embodiments, instructions are not executed immediately upon return of data items. Instead, the processor waits for the entire duration of a specified interval, for example, prior to moving to code execution. Instructions may next be executed on a “first available” basis or pursuant to a more intelligent prioritization scheme.
  • [0042]
    One exemplary more intelligent prioritization scheme can comprise making control flow decisions based on whether data is modified, owned exclusively, or shared with other processors, i.e., based on the state of a cache-coherency protocol. This in turn could be extended into a primitive which allows a processor to wait for the first of several memory locations to be modified by another processor, i.e., the basis of a inter-processor synchronization mechanism.
  • [0043]
    In another embodiment, the processor may immediately execute an instruction, and allow the memory subsystem to continue searching for information while such instruction is being executed. It may then subsequently execute other instructions corresponding to other data items in an order corresponding to duration of interval required to discover said other data items.
  • [0044]
    FIG. 7 illustrates a method for fetching data for a processor, comprising determining at least one data item that is needed by said processor to execute at least one corresponding instruction 701, determining an acceptable interval for fetching said at least one data item 702, immediately executing said at least one corresponding instruction by said processor if said at least one data item is accessible during said acceptable interval 703, and executing at least one other instruction prior to said at least one corresponding instruction if said at least one data item is not accessible during said acceptable interval 704.
  • [0045]
    The steps of determining at least one data item 701 and determining an acceptable interval 702 for fetching information may be carried out pursuant to software instructions in an application. The application may be, for example, an operating system.
  • [0046]
    Immediately executing said at least one corresponding instruction 703, once again, refers to initiating the appropriate actions needed to execute such corresponding instruction, not necessarily actually executing the instructions. In other words, the at least one corresponding instruction is executed prior to the other instructions corresponding to other data items.
  • [0047]
    If said at least one data item is accessible during said acceptable interval, it may be immediately executed. If not, the processor may move on to execute some other instruction 704. For example, the processor may have other work to do on behalf of the current process or some other process, and can undertake such work while a memory subsystem proceeds to attempt to locate the specified data items.
  • [0048]
    In one embodiment, said at least one corresponding instruction can comprise a plurality of corresponding instructions, said at least one data item can comprise a plurality of data items, and said plurality of corresponding instructions may be executed in an order corresponding to duration of interval required to discover said plurality of data items. Alternatively, some other intelligence may determine which instructions are executed first, and some of the instructions may not be executed at all.
  • [0049]
    FIG. 8 illustrates an exemplary computing device 800 in which the various systems and methods contemplated herein may be deployed. An exemplary computing device 800 suitable for use in connection with the systems and methods of the invention is broadly described. In its most basic configuration, device 800 typically includes a processing unit 802 and memory 803. Depending on the exact configuration and type of computing device, memory 803 may be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two. Additionally, device 800 may also have mass storage (removable 804 and/or non-removable 805) such as magnetic or optical disks or tape. Similarly, device 800 may also have input devices 807 such as a keyboard and mouse, and/or output devices 806 such as a display that presents a GUI as a graphical aid accessing the functions of the computing device 800. Other aspects of device 800 may include communication connections 808 to other devices, computers, networks, servers, etc. using either wired or wireless media. All these devices are well known in the art and need not be discussed at length here.
  • [0050]
    The invention is operational with numerous general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, cell phones, Personal Digital Assistants (PDA), distributed computing environments that include any of the above systems or devices, and the like.
  • [0051]
    In addition to the specific implementations explicitly set forth herein, other aspects and implementations will be apparent to those skilled in the art from consideration of the specification disclosed herein. It is intended that the specification and illustrated implementations be considered as examples only, with a true scope and spirit of the following claims.

Claims (17)

  1. 1. A method for fetching data for a processor, comprising:
    discovering, by a processor, which of a plurality of data items is available in a shortest interval, wherein said data items are needed to execute a corresponding plurality of instructions; and
    immediately executing, by said processor, at least one instruction corresponding to at least one data item that is available in the shortest interval.
  2. 2. The method of claim 1, wherein said method is performed on a computer chip comprising at least one processor and at least one cache.
  3. 3. The method of claim 2, wherein said method is performed on a computer chip comprising a plurality of processors and a plurality of caches.
  4. 4. The method of claim 1, wherein said discovering is conducted during an acceptable interval for fetching said data items.
  5. 5. The method of claim 4, further comprising stopping said discovering after said acceptable interval, wherein at least one data item was not discovered during said acceptable interval.
  6. 6. The method of claim 1, wherein the step of discovering, by a processor is carried out pursuant to a software instruction in an application.
  7. 7. The method of claim 6, wherein said application is an operating system.
  8. 8. The method of claim 1, further comprising subsequently executing, by said processor, said corresponding plurality of instructions in an order corresponding to duration of interval required to discover said plurality of data addresses.
  9. 9. A method for fetching data for a processor, comprising:
    determining at least one data item that is needed by said processor to execute at least one corresponding instruction;
    determining an acceptable interval for fetching said at least one data item;
    immediately executing said at least one corresponding instruction by said processor if said at least one data item is accessible during said acceptable interval;
    executing at least one other instruction prior to said at least one corresponding instruction if said at least one data item is not accessible during said acceptable interval.
  10. 10. The method of claim 9, wherein said method is performed on a computer chip comprising at least one processor and at least one cache.
  11. 11. The method of claim 9, wherein said method is performed on a computer chip comprising a plurality of processors and a plurality of caches.
  12. 12. The method of claim 9, wherein said at least one corresponding instruction comprises a plurality of corresponding instructions, wherein said at least one data item comprises a plurality of data items, and wherein said plurality of corresponding instructions are executed in an order corresponding to duration of interval required to discover said plurality of data items.
  13. 13. The method of claim 9, wherein the steps of determining at least one data item and determining an acceptable interval for fetching information are carried out pursuant to software instructions in an application.
  14. 14. The method of claim 13, wherein said application is an operating system.
  15. 15. A computer chip comprising at least one processor, said processor comprising:
    an instruction set;
    at least one cache;
    wherein a module in said instruction set accepts a plurality of data addresses and a plurality of corresponding code addresses, and finds a first available data item in said at least one cache, then transfers control of said processor to a code address corresponding to said first available data item.
  16. 16. The computer chip of claim 15, wherein said computer chip comprises a plurality of processors and a plurality of caches.
  17. 17. The computer chip of claim 15, wherein said module in said instruction set accepts an acceptable interval for fetching at least one of said data items, and returns to said processor without finding said data item if said data item cannot be found within said interval.
US11454245 2006-06-16 2006-06-16 Switch prefetch in a multicore computer chip Active 2026-12-23 US7502913B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11454245 US7502913B2 (en) 2006-06-16 2006-06-16 Switch prefetch in a multicore computer chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11454245 US7502913B2 (en) 2006-06-16 2006-06-16 Switch prefetch in a multicore computer chip

Publications (2)

Publication Number Publication Date
US20070294516A1 true true US20070294516A1 (en) 2007-12-20
US7502913B2 US7502913B2 (en) 2009-03-10

Family

ID=38862874

Family Applications (1)

Application Number Title Priority Date Filing Date
US11454245 Active 2026-12-23 US7502913B2 (en) 2006-06-16 2006-06-16 Switch prefetch in a multicore computer chip

Country Status (1)

Country Link
US (1) US7502913B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130124805A1 (en) * 2011-11-10 2013-05-16 Advanced Micro Devices, Inc. Apparatus and method for servicing latency-sensitive memory requests
EP2791933A4 (en) * 2011-12-13 2015-08-05 Ati Technologies Ulc Mechanism for using a gpu controller for preloading caches

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008217623A (en) * 2007-03-07 2008-09-18 Renesas Technology Corp Data processor
US9086980B2 (en) 2012-08-01 2015-07-21 International Business Machines Corporation Data processing, method, device, and system for processing requests in a multi-core system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5490280A (en) * 1994-03-31 1996-02-06 Intel Corporation Apparatus and method for entry allocation for a resource buffer
US5627984A (en) * 1993-03-31 1997-05-06 Intel Corporation Apparatus and method for entry allocation for a buffer resource utilizing an internal two cycle pipeline
US5689674A (en) * 1995-10-31 1997-11-18 Intel Corporation Method and apparatus for binding instructions to dispatch ports of a reservation station
US5748937A (en) * 1993-08-26 1998-05-05 Intel Corporation Computer system that maintains processor ordering consistency by snooping an external bus for conflicts during out of order execution of memory access instructions
US5974523A (en) * 1994-08-19 1999-10-26 Intel Corporation Mechanism for efficiently overlapping multiple operand types in a microprocessor
US6272520B1 (en) * 1997-12-31 2001-08-07 Intel Corporation Method for detecting thread switch events
US6341347B1 (en) * 1999-05-11 2002-01-22 Sun Microsystems, Inc. Thread switch logic in a multiple-thread processor
US20020078122A1 (en) * 1999-05-11 2002-06-20 Joy William N. Switching method in a multi-threaded processor
US20020138717A1 (en) * 1999-05-11 2002-09-26 Joy William N. Multiple-thread processor with single-thread interface shared among threads
US20050125802A1 (en) * 2003-12-05 2005-06-09 Wang Perry H. User-programmable low-overhead multithreading
US20050210204A1 (en) * 2003-01-27 2005-09-22 Fujitsu Limited Memory control device, data cache control device, central processing device, storage device control method, data cache control method, and cache control method
US20060026594A1 (en) * 2004-07-29 2006-02-02 Fujitsu Limited Multithread processor and thread switching control method

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5440707A (en) 1992-04-29 1995-08-08 Sun Microsystems, Inc. Instruction and data cache with a shared TLB for split accesses and snooping in the same clock cycle
US5809275A (en) 1996-03-01 1998-09-15 Hewlett-Packard Company Store-to-load hazard resolution system and method for a processor that executes instructions out of order
US5802569A (en) 1996-04-22 1998-09-01 International Business Machines Corp. Computer system having cache prefetching amount based on CPU request types
US6128703A (en) 1997-09-05 2000-10-03 Integrated Device Technology, Inc. Method and apparatus for memory prefetch operation of volatile non-coherent data
US6446143B1 (en) 1998-11-25 2002-09-03 Compaq Information Technologies Group, L.P. Methods and apparatus for minimizing the impact of excessive instruction retrieval
EP1182559B1 (en) 2000-08-21 2009-01-21 Texas Instruments Incorporated Improved microprocessor
EP1421463B1 (en) 2001-08-29 2007-10-17 Analog Devices, Incorporated Phase locked loops fast power up methods and apparatus
US7493480B2 (en) 2002-07-18 2009-02-17 International Business Machines Corporation Method and apparatus for prefetching branch history information
US6993630B1 (en) 2002-09-26 2006-01-31 Unisys Corporation Data pre-fetch system and method for a cache memory
US7055003B2 (en) 2003-04-25 2006-05-30 International Business Machines Corporation Data cache scrub mechanism for large L2/L3 data cache structures
US7617499B2 (en) 2003-12-18 2009-11-10 International Business Machines Corporation Context switch instruction prefetching in multithreaded computer

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5627984A (en) * 1993-03-31 1997-05-06 Intel Corporation Apparatus and method for entry allocation for a buffer resource utilizing an internal two cycle pipeline
US5748937A (en) * 1993-08-26 1998-05-05 Intel Corporation Computer system that maintains processor ordering consistency by snooping an external bus for conflicts during out of order execution of memory access instructions
US5490280A (en) * 1994-03-31 1996-02-06 Intel Corporation Apparatus and method for entry allocation for a resource buffer
US5974523A (en) * 1994-08-19 1999-10-26 Intel Corporation Mechanism for efficiently overlapping multiple operand types in a microprocessor
US5689674A (en) * 1995-10-31 1997-11-18 Intel Corporation Method and apparatus for binding instructions to dispatch ports of a reservation station
US6272520B1 (en) * 1997-12-31 2001-08-07 Intel Corporation Method for detecting thread switch events
US6507862B1 (en) * 1999-05-11 2003-01-14 Sun Microsystems, Inc. Switching method in a multi-threaded processor
US20020078122A1 (en) * 1999-05-11 2002-06-20 Joy William N. Switching method in a multi-threaded processor
US20020138717A1 (en) * 1999-05-11 2002-09-26 Joy William N. Multiple-thread processor with single-thread interface shared among threads
US6341347B1 (en) * 1999-05-11 2002-01-22 Sun Microsystems, Inc. Thread switch logic in a multiple-thread processor
US6542991B1 (en) * 1999-05-11 2003-04-01 Sun Microsystems, Inc. Multiple-thread processor with single-thread interface shared among threads
US20030191927A1 (en) * 1999-05-11 2003-10-09 Sun Microsystems, Inc. Multiple-thread processor with in-pipeline, thread selectable storage
US6694347B2 (en) * 1999-05-11 2004-02-17 Sun Microsystems, Inc. Switching method in a multi-threaded processor
US20040162971A1 (en) * 1999-05-11 2004-08-19 Sun Microsystems, Inc. Switching method in a multi-threaded processor
US6801997B2 (en) * 1999-05-11 2004-10-05 Sun Microsystems, Inc. Multiple-thread processor with single-thread interface shared among threads
US20050210204A1 (en) * 2003-01-27 2005-09-22 Fujitsu Limited Memory control device, data cache control device, central processing device, storage device control method, data cache control method, and cache control method
US20050125802A1 (en) * 2003-12-05 2005-06-09 Wang Perry H. User-programmable low-overhead multithreading
US20060026594A1 (en) * 2004-07-29 2006-02-02 Fujitsu Limited Multithread processor and thread switching control method
US7310705B2 (en) * 2004-07-29 2007-12-18 Fujitsu Limited Multithread processor and thread switching control method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130124805A1 (en) * 2011-11-10 2013-05-16 Advanced Micro Devices, Inc. Apparatus and method for servicing latency-sensitive memory requests
EP2791933A4 (en) * 2011-12-13 2015-08-05 Ati Technologies Ulc Mechanism for using a gpu controller for preloading caches
US9239793B2 (en) 2011-12-13 2016-01-19 Ati Technologies Ulc Mechanism for using a GPU controller for preloading caches

Also Published As

Publication number Publication date Type
US7502913B2 (en) 2009-03-10 grant

Similar Documents

Publication Publication Date Title
US7290116B1 (en) Level 2 cache index hashing to avoid hot spots
US7493452B2 (en) Method to efficiently prefetch and batch compiler-assisted software cache accesses
US20040268044A1 (en) Multiprocessor system with dynamic cache coherency regions
US6711651B1 (en) Method and apparatus for history-based movement of shared-data in coherent cache memories of a multiprocessor system using push prefetching
US20100293420A1 (en) Cache coherent support for flash in a memory hierarchy
US20060168583A1 (en) Systems and methods for TDM multithreading
US20140032845A1 (en) Systems and methods for supporting a plurality of load accesses of a cache in a single cycle
US20120017039A1 (en) Caching using virtual memory
US20090106507A1 (en) Memory System and Method for Using a Memory System with Virtual Address Translation Capabilities
US20090172292A1 (en) Accelerating software lookups by using buffered or ephemeral stores
US20100268790A1 (en) Complex Remote Update Programming Idiom Accelerator
US20070226424A1 (en) Low-cost cache coherency for accelerators
US20070266206A1 (en) Scatter-gather intelligent memory architecture for unstructured streaming data on multiprocessor systems
US20140032856A1 (en) Systems and methods for maintaining the coherency of a store coalescing cache and a load cache
US20110209155A1 (en) Speculative thread execution with hardware transactional memory
US20110167222A1 (en) Unbounded transactional memory system and method
US20110173632A1 (en) Hardware Wake-and-Go Mechanism with Look-Ahead Polling
US20070294693A1 (en) Scheduling thread execution among a plurality of processors based on evaluation of memory access data
US20090199184A1 (en) Wake-and-Go Mechanism With Software Save of Thread State
US8127080B2 (en) Wake-and-go mechanism with system address bus transaction master
US8015379B2 (en) Wake-and-go mechanism with exclusive system bus response
US20090199029A1 (en) Wake-and-Go Mechanism with Data Monitoring
US20160041913A1 (en) Systems and methods for supporting a plurality of load and store accesses of a cache
US20090199028A1 (en) Wake-and-Go Mechanism with Data Exclusivity
US20110173593A1 (en) Compiler Providing Idiom to Idiom Accelerator

Legal Events

Date Code Title Description
AS Assignment

Owner name: MICROSOFT CORPORATION, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:BARHAM, PAUL R.;REEL/FRAME:017994/0298

Effective date: 20060619

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034542/0001

Effective date: 20141014

FPAY Fee payment

Year of fee payment: 8