TWI506549B

TWI506549B - System and method for out-of-order prefetch instructions in an in-order pipeline

Info

Publication number: TWI506549B
Application number: TW101142862A
Authority: TW
Inventors: Mccormick, Jr
Original assignee: Intel Corp
Priority date: 2011-12-20
Filing date: 2012-11-16
Publication date: 2015-11-01
Also published as: US20140195772A1; WO2013095401A1; TW201346755A; US9442861B2

Description

System and method for out-of-order prefetch instructions in a sequential instruction pipeline

本發明之實施例與處理器架構之順序管線中的亂序預取指令有關。Embodiments of the present invention are related to out-of-order prefetch instructions in a sequence pipeline of a processor architecture.

長久以來，處理器性能提升的速度快於記憶體性能。處理器與記憶體性能間成長上的差距，意味著現今大多數處理器花費了很多時間來等待資料。現代化的處理器通常具有數個階層的晶片上及可能之晶片外的快取記憶體。這些快取記憶體藉由將經常被存取的線保存在較近、較快的快取記憶體中，以有助於縮短資料存取時間。資料預取係在資料被軟體需要之前，先將資料從快取記憶體/記憶體階層之較慢層移動到較快層的做法。資料預取可由軟體來做。資料預取也可由硬體來做。軟體技術與硬體技術皆有其性能上的限制。Processor performance has improved over time for memory performance. The gap between processor and memory performance means that most processors today spend a lot of time waiting for data. Modern processors typically have cache memory on several levels of the wafer and possibly outside the wafer. These cache memories help to reduce data access time by keeping frequently accessed lines in closer, faster cache memory. Data prefetching is the practice of moving data from the slower layer of the cache memory/memory hierarchy to the faster layer before the data is needed by the software. Data prefetching can be done by software. Data prefetching can also be done by hardware. Both software and hardware technologies have performance limitations.

描述提供具有用於順序管線之亂序預取指令之處理器架構的設備、系統與方法。在一實施例中，實施包括硬體(例如資料預取佇列)與軟體資料預取的系統。在此系統中，整體微架構之特徵、指令集架構之特性、以及軟體基的特性，乃蘊含於各種資料預取技術及特徵的設計、選擇及合成。Description Apparatus, systems, and methods are provided that provide a processor architecture with out-of-order prefetch instructions for sequential pipelines. In one embodiment, a system including hardware (eg, data prefetch queues) and software data prefetching is implemented. In this system, the characteristics of the overall microarchitecture, the characteristics of the instruction set architecture, and the characteristics of the software base are embedded in the design, selection, and synthesis of various data prefetching techniques and features.

順序管線按順序執行指令，然而亂序管線允許大部分的指令以亂序執行，其包括顯式資料預取指令。順序管線的一缺點是在於執行特定指令所需的資源無法立即可用可致使管線(且因此該指令及所有後續的指令)拖延並等待資源。這些拖延甚至可以是由顯式資料預取指令所造成。亂序管線的缺點是在於全亂序執行所需用的機器昂貴。本發明之實施例消除了其中一些可被顯式資料預取指令等待不可用之資源所觸發的拖延。本文所描述用於順序管線的處理器架構比亂序管線所需的便宜很多。The sequential pipeline executes the instructions sequentially, while the out-of-order pipeline allows most of the instructions to be executed out of order, including explicit data prefetch instructions. One disadvantage of the sequential pipeline is that the resources required to execute a particular instruction are not immediately available, causing the pipeline (and therefore the instruction and all subsequent instructions) to stall and wait for resources. These delays can even be caused by explicit data prefetch instructions. The disadvantage of out-of-order pipelines is that the machines required for full out-of-order execution are expensive. Embodiments of the present invention eliminate some of the delays that can be triggered by resources that are expected to be unavailable by explicit data prefetch instructions. The processor architecture described herein for sequential pipelines is much less expensive than that required for out-of-order pipelines.

本發明之實施例提供的能力可將某些因不可用資源而無法執行的顯式資料預取指令擱置，而不致拖延後續指令。因此，後續的指令相對於資料預取來說，實際上是亂序執行。在從源暫存器讀取了資料預取指令的位址之後，發生此資料預取離開主管線之調度並進入此資料預取佇列。例如，ALU管線在將資料預取指令送往資料預取佇列之前，會先讀取資料預取指令的位址。當該資料預取在資料預取佇列中等待其執行所需之資源的同時，後續指令可繼續執行。The capabilities provided by embodiments of the present invention can suspend certain explicit data prefetch instructions that cannot be executed due to unavailable resources without delaying subsequent instructions. Therefore, subsequent instructions are actually executed out of order with respect to data prefetching. After the address of the data prefetch instruction is read from the source register, this data prefetch is dispatched from the host line and enters the data prefetch queue. For example, the ALU pipeline reads the address of the data prefetch instruction before sending the data prefetch command to the data prefetch queue. Subsequent instructions can continue to execute while the data is prefetched while waiting for its resources to be executed in the data prefetch queue.

在以下的描述中，為了提供更透徹的瞭解，陳述了諸多特定的細節，諸如邏輯實施、信號與匯流排之大小與名稱、系統組件的類型與相互關係、以及邏輯分割/整合選擇等。不過，須瞭解，熟悉此方面技術之人士沒有這些特定的細節仍可實行本發明的實施例。在其它實例中，沒有詳細顯示出控制結構與閘層次的電路，以避免模糊了本發明的實施例。那些熟悉此方面一般技術之人士有了本文的描述，沒有太多經驗也能實施適當的邏輯電路。In the following description, in order to provide a more thorough understanding, various specific details are set forth, such as logic implementation, size and name of signals and busbars, types and interrelationships of system components, and logical segmentation/integration selection. However, it will be appreciated that those skilled in the art can practice embodiments of the invention without these specific details. In other examples, the circuit of the control structure and the gate level is not shown in detail to avoid obscuring the present. An embodiment of the invention. Those who are familiar with the general techniques in this area have the description of this article and can implement appropriate logic without much experience.

在以下的描述中使用某些術語來描述本發明之實施例的特徵。例如，“邏輯”代表配置為執行一或多項功能的硬體及/或軟體。例如，“硬體”的例子包括但不限於積體電路、有限狀態機、或甚至組合邏輯。積體電路可以是處理器的形式，諸如微處理器、特定用途積體電路、數位信號處理器、微控制器、諸如此類。晶片間的互連，可以每一皆是點對點或可以每一皆是多點配置，或某些是點對點而其它是多點配置。Certain terms are used in the following description to describe the features of the embodiments of the invention. For example, "logic" means hardware and/or software configured to perform one or more functions. For example, examples of "hardware" include, but are not limited to, integrated circuits, finite state machines, or even combinatorial logic. The integrated circuit can be in the form of a processor, such as a microprocessor, a special purpose integrated circuit, a digital signal processor, a microcontroller, and the like. The interconnections between the wafers can each be point-to-point or can be configured in multiple points, or some are point-to-point and others are multi-point configurations.

圖1說明一實施例的流程圖，圖中提供按照一實施例之順序管線之亂序預取指令的電腦實施方法100。方法100是由處理邏輯來執行，其可包含硬體(電路、專用邏輯等)、軟體(諸如在通用電腦系統或專用機器或裝置上執行)、或兩者的組合。在一實施例中，方法100是由與本文所討論之架構相關的處理邏輯來執行。1 illustrates a flow diagram of an embodiment of a computer implementation method 100 for providing out-of-order prefetch instructions for a sequential pipeline in accordance with an embodiment. Method 100 is performed by processing logic, which can include hardware (circuitry, dedicated logic, etc.), software (such as performed on a general purpose computer system or a special purpose machine or device), or a combination of both. In an embodiment, method 100 is performed by processing logic associated with the architectures discussed herein.

在方塊102，處理邏輯根據一或多個因素(例如第二類型順序管線之一或多個發出槽的可用性、資料預取指令之優先權)決定是否將資料預取指令(例如lfetch)發給第一類型順序管線(例如，算術邏輯單元(ALU)管線、整數管線)或給第二類型順序管線(例如記憶體管線)。例如，對於軟體來說，可以經由某些指令束編碼，強迫lfetch在另一需要使用相同管線的指令之前，先下到第二類型管線為可能的。Lfetch可以是最低優先或最高優先。軟體排程器可做此決定。第二類型順序管線的可用發出槽可能有限(例如每個時鐘周期2個)。在方塊104，第一類型順序管線根據一或多個因素及軟體排程器之決定來接收資料預取指令。在方塊106，第一類型順序管線讀取該資料預取指令的位址暫存器，並將資料預取指令發給資料預取佇列。在方塊108，當第二類型順序管線的至少一執行槽可用之時，或藉由先取得想使用第二類型順序管線的其它指令，資料預取佇列將該資料預取指令發給第二類型順序管線。可先取得另一指令以避免資料預取佇列的容量溢流因而丟掉資料預取指令(例如lfetch )。接著，當管線被拖延或重播時，將lfetch 從資料預取佇列發給第二類型管線。在方塊110，第二類型順序管線也使用此管線的發出槽接收其它指令(例如載入、儲存)。At block 102, processing logic determines whether to issue a data prefetch instruction (eg, lfetch) to one or more factors (eg, the availability of one or more issue slots of the second type of sequential pipeline, the priority of the data prefetch instruction) A first type of sequential pipeline (eg, an arithmetic logic unit (ALU) pipeline, an integer pipeline) or a second type of sequential pipeline (eg, a memory pipeline). For example, for software, it is possible to encode via some instruction bundles, forcing lfetch to proceed to the second type of pipeline before another instruction that requires the same pipeline. Lfetch can be the lowest priority or the highest priority. The software scheduler can make this decision. The available issue slots for the second type of sequential pipeline may be limited (eg, 2 per clock cycle). At block 104, the first type of sequential pipeline receives the data prefetch command based on one or more factors and the decision of the software scheduler. At block 106, the first type of sequence pipeline reads the address register of the data prefetch instruction and sends the data prefetch instruction to the data prefetch queue. At block 108, when at least one execution slot of the second type of sequence pipeline is available, or by first obtaining other instructions that want to use the second type of sequence pipeline, the data prefetch queue sends the data prefetch command to the second Type sequence pipeline. Another instruction can be fetched first to avoid the data overflow of the data prefetch queue and thus the data prefetch instruction (eg lfetch ). Then, when the pipeline is delayed or replayed, lfetch is sent from the data prefetch queue to the second type pipeline. At block 110, the second type of sequence pipeline also uses the issue slot of the pipeline to receive other instructions (e.g., load, store).

在一實施例中，第一類型順序管線係算術邏輯單元(ALU)，用來接收ALU指令與資料預取指令，而第二類型順序管線係記憶體管線。In one embodiment, the first type of sequential pipeline is an arithmetic logic unit (ALU) for receiving ALU instructions and data prefetch instructions, and the second type of sequential pipeline is a memory pipeline.

圖2說明按照一實施例之處理器架構的方塊圖。處理器架構200包括順序管線220與用於接收資料預取指令及其它指令的選用順序管線221。順序管線220、221可以是用來接收ALU指令與資料預取指令(例如lfetch-on-A)的算術邏輯單元(ALU)管線。或者，管線220、221至少其中之一可以是用來接收整數指令與資料預取指令的整數管線。個別的管線220、221可一起動作以形成單一的多指令寬順序管線。換言之，在橫跨管線及管線內保持指令順序。2 illustrates a block diagram of a processor architecture in accordance with an embodiment. The processor architecture 200 includes a sequence pipeline 220 and an optional sequence pipeline 221 for receiving data prefetch instructions and other instructions. The sequence pipelines 220, 221 may be an arithmetic logic unit (ALU) pipeline for receiving ALU instructions and data prefetch instructions (eg, lfetch-on-A). Alternatively, at least one of the pipelines 220, 221 may be an integer pipeline for receiving integer instructions and data prefetch instructions. The individual pipelines 220, 221 can act together to form a single multi-instruction wide sequential pipeline. In other words, keeping instructions across pipelines and pipelines order.

處理器架構200進一步包括具有發出槽的第二類型順序管線230、231，其經由多工器218、219接收其它指令。槽指的是管線中可包含操作的入口(entry)。在實施例中，該架構包括管線230、231至少其中之一。處理器架構200包括轉譯後備緩衝區(translation lookaside buffer；TLB)240，具有若干用來將虛擬位址映射到實體位址的埠。埠指的是大結構的輸入，如可接受操作的陣列。TLB 240與TLB 241可分別位於管線230與231內。當在TLB 240或241中沒發現與資料預取指令相關之個別的虛擬位址時(例如TLB遺失lfetch)，資料預取佇列210接收資料預取指令。硬體分頁查核行程器(hardware page walker)250藉由發出特殊的載入指令沿著記憶體管線來存取(例如“查核行程”)記憶體中的分頁表。當在TLB 240或241中未發現資料預取指令時，硬體分頁查核行程被初始化。硬體分頁查核行程器經由多工器252接收硬體分頁查核行程，且包括一些緩衝，以使其能處理同時的多硬體分頁查核行程而不會拖延管線。The processor architecture 200 further includes a second type of sequence pipeline 230, 231 having an issue slot that receives other instructions via the multiplexers 218, 219. A slot refers to an entry in a pipeline that can contain operations. In an embodiment, the architecture includes at least one of the pipelines 230, 231. The processor architecture 200 includes a translation lookaside buffer (TLB) 240 having a number of ports for mapping virtual addresses to physical addresses.埠 refers to the input of large structures, such as an array of acceptable operations. TLB 240 and TLB 241 can be located in lines 230 and 231, respectively. When no individual virtual address associated with the data prefetch instruction is found in the TLB 240 or 241 (eg, the TLB lost lfetch), the data prefetch queue 210 receives the data prefetch instruction. The hardware page walker 250 accesses (eg, "checks the itinerary") the page table in the memory by issuing a special load instruction along the memory pipeline. When no data prefetch instruction is found in TLB 240 or 241, the hardware page check check is initialized. The hardware paged checker receives a hardware page check run via multiplexer 252 and includes some buffering to enable it to process simultaneous multi-hard page check passes without delaying the pipeline.

資料預取佇列210在這些順序管線的至少一執行槽可用之時，或藉由先取得想使用第二類型順序管線的其它指令，將該資料預取指令發給第二類型順序管線230、231至少其中之一。若無硬體分頁查核行程未處理，資料預取指令可被發出。本設計在發出資料預取指令之前，並不總是等待全部的硬體分頁查核行程都被處理。例如，在實施例中，僅由於TLB未命中而被插入資料預取的那些資料預取，才會在資料預取發出前等待全部硬體分頁查核行程都被處理。硬體分頁查核行程器250可為各自的資料預取指令或已失敗的硬體分頁查核行程將各自的轉譯插入各自的TLB。如果硬體分頁查核行程與轉譯第二次不在TLB中，則資料預取指令被丟棄。當到達相同頁的多個資料預取指令在其各自的TLB中未被發現時，多個硬體分頁查核行程可被合併成單一個分頁查核行程。The data prefetch queue 210 sends the data prefetch instruction to the second type sequence pipeline 230 when at least one execution slot of the sequence pipeline is available, or by first obtaining other instructions that want to use the second type sequence pipeline. 231 at least one of them. If no hardware paging check is processed, the data prefetch command can be issued. This design does not always wait for all hardware page check passes to be processed before issuing a data prefetch command. For example, in implementation In the example, only those data prefetched by the data prefetching due to the TLB miss will wait for all hardware paging check strokes to be processed before the data prefetch is issued. The hardware paged checker 250 can insert the respective translations into their respective TLBs for their respective data prefetch instructions or failed hardware page check passes. If the hardware page check and the translation are not in the TLB for the second time, the data prefetch command is discarded. When multiple data prefetch instructions arriving on the same page are not found in their respective TLBs, multiple hardware page check passes can be combined into a single page check run.

在發生多硬體分頁查核行程之時，可執行第二類型順序管線。A second type of sequential pipeline can be executed when a multi-hardware page check run occurs.

本設計的處理器架構增加若干資料預取特徵(例如，發送lfetch 指令給第一類型管線、如下文中所描述的無阻塞lfetch 等)。所建立的微架構能以最低的成本與複雜度致能所有這些預取機制，也很容易致能再增加的其它預取機制。The processor architecture of this design adds several data prefetch features (eg, sending lfetch instructions to the first type of pipeline, non-blocking lfetch as described below, etc.). The built micro-architecture enables all of these prefetching mechanisms with minimal cost and complexity, and it is easy to add additional prefetching mechanisms.

圖3說明按照實施例之具有資料預取佇列(DPQ)310的處理器架構300。DPQ 310可以是先進先出(FIFO)結構，其暫時儲存來自本文所描述之某些或全部軟體及硬體預取源的預取請求。此結構允許接受預取的短叢發而不會回壓管線。圖3顯示包括有DPQ 310、引擎314、MLD預取器360、及多工器311、312、318、及319的預取系統302如何連接到現有的管線320、321、330、及331，以及資料預取佇列310如何是預取系統302的中心樞紐。中階資料快取(MLD)370的預取可來自MLD預取器方塊360 。來自lfetch-on-A管線特徵的Lfetch指令可來自第一類型順序管線320、321其中之一(例如，A管線)。與無阻塞資料TLB或第一階資料快取(FLD)硬體預取特徵相關的預取可來自第二類型順序管線330、331其中之一(例如，M管線)。接著，在當主管線指令緩衝器邏輯(IBL)302未發出指令進入同一M管線的周期中，DPQ將預取插入兩者之一的M管線。有時為了避免丟棄lfetch 指令，DPQ取得高於等待要從主管線指令緩衝器發出之其它的M管線指令的優先權。FIG. 3 illustrates a processor architecture 300 having a data prefetch queue (DPQ) 310 in accordance with an embodiment. DPQ 310 may be a first in first out (FIFO) structure that temporarily stores prefetch requests from some or all of the software and hardware prefetch sources described herein. This structure allows the pre-fetched short bursts to be accepted without back pressure. 3 shows how the prefetch system 302 including the DPQ 310, the engine 314, the MLD prefetcher 360, and the multiplexers 311, 312, 318, and 319 are connected to the existing pipelines 320, 321, 330, and 331, and How the data prefetch queue 310 is the central hub of the prefetch system 302. The prefetch of the mid-level data cache (MLD) 370 can come from the MLD prefetcher block 360. The Lfetch instruction from the lfetch-on-A pipeline feature may come from one of the first type of sequence pipelines 320, 321 (eg, the A pipeline). The prefetch associated with the non-blocking data TLB or the first order data cache (FLD) hardware prefetch feature may be from one of the second type of sequence pipelines 330, 331 (eg, the M pipeline). Next, in the period when the supervisor line instruction buffer logic (IBL) 302 does not issue an instruction into the same M pipeline, the DPQ will prefetch the M pipeline inserted into either one. Sometimes in order to avoid discarding the lfetch instruction, the DPQ takes precedence over waiting for other M-line instructions to be issued from the host line instruction buffer.

在一實施例中，DPQ是一8入的FIFO。每一個預取請求剛好佔據DPQ中的一個入口，即使其最終將被擴展成數個單獨的預取。當預取請求到達它的FIFO頭時，其被移入擴展引擎(expansion engine；EE)314。EE314將來自DPQ的預取請求擴展成單獨預取的群組，並接著將這些單獨的預取順序地注入M管線。EE也允許單獨的預取跨越並發到相對的M管線，以便最有效率地使用未被使用的管線槽。如圖3之說明，DPQ可具有兩個寫入埠。第一埠316可取得來自管線330或管線320的寫入，第二埠317可取得來自管線331或管線321或MLD預取器的寫入。DPQ可接受每周期每埠一個預取請求。A埠上的lfetch 應被插入到DPQ。如果未命中資料TLB，M埠上的lfetch 需要被插入到DPQ。如果在DPQ的單個埠上同時有兩個DPQ插入請求，則僅出現來自A埠的插入。MLD硬體預取方塊360的輸出端包括小的FIFO佇列(Q)，如果它的請求與其它的預取請求發生衝突，可允許它們被緩衝並於稍後再插入DPQ。在DPQ內，所有類型的預取都按順序被保持，但lfetch 指令被給予比被硬體所初始化的預取更高的重要性。例如，如果lfetch 已在擴展引擎中等待太久仍未找到未被使用的管線槽來使用，其會觸發管線泡來強迫建立一空的槽。不過，如果硬體預取等待太久，其可能被丟棄。此外，如果DPQ開始被填滿，等待中的硬體預取可能被刪除以製造更多空間給新的lfetch 。DPQ提供一有效率、集中化、可共享的資源來處理來自各種來源的預取。In one embodiment, the DPQ is an 8-input FIFO. Each prefetch request just occupies one entry in the DPQ, even though it will eventually be expanded into several separate prefetches. When the prefetch request reaches its FIFO header, it is moved into the expansion engine (EE) 314. The EE 314 expands the prefetch request from the DPQ into a separate prefetched group and then sequentially injects these separate prefetches into the M pipeline. The EE also allows for separate prefetches to be spanned to the opposite M line in order to use the unused line slots most efficiently. As illustrated in Figure 3, the DPQ can have two write buffers. The first port 316 can take a write from line 330 or line 320, and the second port 317 can take a write from line 331 or line 321 or the MLD prefetcher. The DPQ accepts one prefetch request per cycle per cycle. The lfetch on A 埠 should be inserted into the DPQ. If the data TLB is missed, the lfetch on M 埠 needs to be inserted into the DPQ. If there are two DPQ insertion requests on a single DP of the DPQ, only the insertion from A埠 occurs. The output of the MLD hardware prefetch block 360 includes a small FIFO queue (Q) that can be buffered and later inserted into the DPQ if its request conflicts with other prefetch requests. Within DPQ, all types of prefetches are maintained in order, but the lfetch instruction is given a higher importance than the prefetch initiated by the hardware. For example, if lfetch has been waiting in the extension engine for too long and still does not find an unused pipeline slot to use, it will trigger a pipeline bubble to force an empty slot. However, if the hardware prefetch waits too long, it may be discarded. In addition, if the DPQ begins to fill up, the waiting hardware prefetch may be deleted to create more space for the new lfetch . DPQ provides an efficient, centralized, and sharable resource to handle prefetching from a variety of sources.

圖4說明按照一實施例之系統1300的方塊圖。系統1300包括一或多個處理器1310、1315，其耦接至圖形與記憶體控制器集線器(graphics memory controller hub；GMCH)1320。選用性質的額外處理器1315在圖4中以虛線來表示。一或多個處理器1310、1315一部分包括以上所討論的處理器架構(例如200、300)。在實施例中，其架構包括第一類型順序管線220與選用第二管線221。這些管線(例如ALU管線)可接收ALU指令與資料預取指令。這些管線接收來自指令緩衝器邏輯(instruction buffer logic；IBL)202之至少一個資料預取指令。第二類型順序管線230、231(例如記憶體管線)具有發出槽與執行槽。以發出槽接收來自IB 202的其它指令。資料預取佇列210從管線220、221其中之一或兩者接收至少一個資料預取指令。當管線230、231之至少一執行槽可用時，資料預取佇列210發出至少一資料預取指令給第二類型順序管線230、231至少其中之一。系統進一步包括一或多個執行單元232、234，用來執行與第二類型順序管線230、231之執行槽相關的指令。執行單元可位於順序管線230與231中，或與管線230、231相關。軟體排程器根據第二類型順序管線之一或多個發出槽的可用性來決定是否發送至少一個資料預取指令給第一類型順序管線(例如220、221)或給第二類型順序管線(例如230、231)。在實施例中，第一類型順序管線係整數管線，用來接收整數指令及資料預取指令。系統1300進一步包括記憶體1340，耦接至一或多個處理單元。第二類型順序管線的一或多個執行單元發送與被執行之指令相關的資料給記憶體。FIG. 4 illustrates a block diagram of a system 1300 in accordance with an embodiment. System 1300 includes one or more processors 1310, 1315 coupled to a graphics memory controller hub (GMCH) 1320. An additional processor 1315 of a selected nature is indicated in Figure 4 by a dashed line. A portion of one or more processors 1310, 1315 includes the processor architecture (e.g., 200, 300) discussed above. In an embodiment, the architecture includes a first type of sequential pipeline 220 and an optional second pipeline 221. These pipelines (such as the ALU pipeline) can receive ALU commands and data prefetch instructions. These pipelines receive at least one data prefetch instruction from instruction buffer logic (IBL) 202. The second type of sequence lines 230, 231 (e.g., memory lines) have firing slots and execution slots. Other instructions from IB 202 are received in the issue slot. The data prefetch queue 210 receives at least one data prefetch instruction from one or both of the pipelines 220, 221. When at least one execution slot of the pipelines 230, 231 is available, the data is pre- The queue 210 issues at least one data prefetch command to at least one of the second type sequence pipelines 230, 231. The system further includes one or more execution units 232, 234 for executing instructions associated with execution slots of the second type of sequence pipelines 230, 231. Execution units may be located in sequential pipelines 230 and 231 or associated with pipelines 230, 231. The software scheduler determines whether to send at least one material prefetch instruction to the first type of sequential pipeline (eg, 220, 221) or to the second type of sequential pipeline (eg, according to the availability of one or more of the second type of sequential pipelines) 230, 231). In an embodiment, the first type of sequential pipeline is an integer pipeline for receiving integer instructions and data prefetch instructions. System 1300 further includes a memory 1340 coupled to one or more processing units. One or more execution units of the second type of sequential pipeline send data associated with the executed instructions to the memory.

圖4說明GMCH 1320耦接至記憶體1340，記憶體1340例如是動態隨機存取記憶體(DRAM)。至少在一實施例中，DRAM與非揮發性快取記憶體相關。4 illustrates that the GMCH 1320 is coupled to a memory 1340, such as a dynamic random access memory (DRAM). In at least one embodiment, the DRAM is associated with a non-volatile cache memory.

GMCH 1320可以是晶片組或晶片組的一部分。GMCH 1320可以與處理器1310、1315通信，並控制處理器1310、1315與記憶體1340之間的相互作用。GMCH 1320也做為處理器1310、1315與系統1300之其它元件間的加速匯流排介面。關於至少一實施例，GMCH 1320經由多點匯流排(諸如前端匯流排(front side bus；FSB)1395與處理器1310、1315通信。The GMCH 1320 can be part of a wafer set or wafer set. The GMCH 1320 can communicate with the processors 1310, 1315 and control the interaction between the processors 1310, 1315 and the memory 1340. The GMCH 1320 also serves as an acceleration bus interface between the processors 1310, 1315 and other components of the system 1300. With respect to at least one embodiment, the GMCH 1320 communicates with the processors 1310, 1315 via a multi-point bus (such as a front side bus (FSB) 1395).

此外，GMCH 1320也耦接至顯示器1345(諸如平面顯示器)。GMCH 1320可包括整合的繪圖加速器。GMCH 1320進一步耦接至輸入/輸出(I/O)控制器集線器(ICH)1350，其用來將各種不同的周邊裝置耦接至系統1300。圖4之實施例中所顯示的例子係外接的繪圖裝置1360連同另一周邊裝置1370，前者可以是耦接至ICH 1350之分離的繪圖裝置。In addition, the GMCH 1320 is also coupled to a display 1345 (such as a flat panel display). The GMCH 1320 can include an integrated graphics accelerator. GMCH The 1320 is further coupled to an input/output (I/O) controller hub (ICH) 1350 that is used to couple various peripheral devices to the system 1300. The example shown in the embodiment of FIG. 4 is an external drawing device 1360 along with another peripheral device 1370, which may be a separate drawing device coupled to the ICH 1350.

在系統1300中也可存在有額外或不同的處理器。例如，額外的處理器1315可包括與處理器1310相同的額外處理器，額外處理器係異質於或非對稱於處理器1310、加速器(諸如繪圖加速器或數位信號處理(DSP)單元)、現場可程式閘陣列、或任何其它的處理器。從量度各種不同之特徵而論，實體資源1310、1315之間的各種差異包括架構、微架構、熱、耗電特性、諸如此類。這些差別可有效地顯示出處理元件1310、1315它們本身為非對稱與異質的。關於至少一實施例，不同的處理元件1310、1315可存在於同一晶片封裝內。在軟體(例如軟體排程器)被處理元件1310、1315執行期間，其也完全或至少部分駐在於處理元件1310、1315內。處理元件1310、1315也構成機器可存取儲存媒體與處理器架構200。There may also be additional or different processors in system 1300. For example, the additional processor 1315 can include the same additional processor as the processor 1310, the additional processor being heterogeneous or asymmetric to the processor 1310, an accelerator (such as a graphics accelerator or digital signal processing (DSP) unit), on-site Program gate array, or any other processor. In terms of measuring various features, various differences between physical resources 1310, 1315 include architecture, microarchitecture, heat, power consumption characteristics, and the like. These differences effectively show that the processing elements 1310, 1315 are themselves asymmetric and heterogeneous. With regard to at least one embodiment, different processing elements 1310, 1315 can be present within the same wafer package. During execution of the software (e.g., software scheduler) by the processing elements 1310, 1315, it is also wholly or at least partially resident within the processing elements 1310, 1315. Processing elements 1310, 1315 also form a machine-accessible storage medium and processor architecture 200.

現參閱圖5，顯示按照本發明實施例之第二系統1400的方塊圖。如圖5所示，多處理器系統1400係點對點互連系統，且包括第一處理器1470與第二處理器1480經由點對點互連1450耦接。如圖5所示，每一個處理器1470、1480都包括本文所描述的處理器架構(例如200、300)。當軟體(例如軟體排程器)被處理器執行期間，其也完全或至少部分駐在處理器內。處理器也構成機器可存取儲存媒體。或者，一或多個處理器1470、1480可以是處理器以外的元件，諸如加速器或現場可程式閘陣列。雖然僅以兩個處理器1470、1480來顯示，但須瞭解，本發明之實施例的範圍並不限於此。在其它實施例中，在某處理器中可存在有一或多個額外的處理元件。Referring now to Figure 5, a block diagram of a second system 1400 in accordance with an embodiment of the present invention is shown. As shown in FIG. 5, multiprocessor system 1400 is a point-to-point interconnect system and includes a first processor 1470 coupled to a second processor 1480 via a point-to-point interconnect 1450. As shown in FIG. 5, each processor 1470, 1480 includes a processor architecture (e.g., 200, 300) as described herein. When a software (such as a software scheduler) is executed by the processor, it is also completely Or at least partially resident in the processor. The processor also constitutes a machine-accessible storage medium. Alternatively, the one or more processors 1470, 1480 can be components other than the processor, such as an accelerator or a field programmable gate array. Although only two processors 1470, 1480 are shown, it is to be understood that the scope of embodiments of the present invention is not limited thereto. In other embodiments, there may be one or more additional processing elements in a processor.

處理器1470可進一步包括整合式記憶體控制器集線器(IMC)1472與點對點(P-P)介面1476與1478。同樣地，第二處理器1480可包括IMC1482與點對點(P-P)介面1486與1488。處理器1470、1480使用點對點(PtP)介面電路1478、1488經由點對點(PtP)介面1450交換資料。如圖5所示，IMC 1472、1482將處理器耦接至各自記憶體，即記憶體1442與記憶體1444，其可以是主記憶體局部地附接於各個處理器的一部分。Processor 1470 can further include an integrated memory controller hub (IMC) 1472 and point-to-point (P-P) interfaces 1476 and 1478. Likewise, second processor 1480 can include IMC 1482 and point-to-point (P-P) interfaces 1486 and 1488. Processors 1470, 1480 exchange data via point-to-point (PtP) interface 1450 using point-to-point (PtP) interface circuits 1478, 1488. As shown in FIG. 5, IMCs 1472, 1482 couple the processors to respective memories, namely memory 1442 and memory 1444, which may be part of the main memory that is locally attached to each processor.

處理器1470、1480使用點對點介面電路1476、1494、1486、1498經由個別的P-P介面1452,1454與晶片組1490交換資料。晶片組1490也經由高性能圖形介面1439與高性能繪圖電路1438交換資料。Processors 1470, 1480 exchange data with wafer set 1490 via point-to-point interface circuits 1476, 1494, 1486, 1498 via individual P-P interfaces 1452, 1454. Wafer set 1490 also exchanges data with high performance graphics circuitry 1438 via high performance graphics interface 1439.

可包括在任一處理器中的共用快取記憶體(未顯示)在兩處理器的外部，但是經由P-P互連與處理器連接，以使得如果處理器進入低耗電模式，任一或兩者處理器的本地快取資訊可儲存在共用快取記憶體中。A shared cache memory (not shown) that may be included in either processor is external to both processors, but is coupled to the processor via a PP interconnect such that if the processor enters a low power mode, either or both The processor's local cache information can be stored in the shared cache memory.

晶片組1490經由介面1496耦接至第一匯流排1416。在一實施例中，第一匯流排1416可以是周邊組件互連 (PCI)匯流排，或諸如PCI快捷匯流排或其它第三代I/O互連匯流排，雖然本發明之實施例的範圍不限於此。Wafer set 1490 is coupled to first bus bar 1416 via interface 1496. In an embodiment, the first bus bar 1416 can be a peripheral component interconnect (PCI) bus, or such as a PCI Express Bus or other third generation I/O interconnect bus, although the scope of embodiments of the present invention is not limited in this respect.

如圖5所示，各種輸入/輸出裝置1414可耦接至第一匯流排1416，連同耦接第一匯流排1416的匯流排電橋1418耦接至第二匯流排1420。在一實施例中，第二匯流排1420可以是低接腳數(LPC)匯流排。在一實施例中，可耦接至第二匯流排1420的各種裝置例如包括鍵盤/滑鼠1422、通信裝置1426、及資料儲存單元1428，諸如磁碟機或其它包括有碼1430的大量儲存裝置。此外，音響I/O1424可耦接至第二匯流排1420。請注意，其它的架構也都可行。例如，系統可實施為多點傳輸匯流排或其它這類架構的來取代圖5的點對點架構。As shown in FIG. 5 , various input/output devices 1414 can be coupled to the first bus bar 1416 , and the bus bar bridge 1418 coupled to the first bus bar 1416 is coupled to the second bus bar 1420 . In an embodiment, the second bus bar 1420 can be a low pin count (LPC) bus bar. In an embodiment, various devices that can be coupled to the second busbar 1420 include, for example, a keyboard/mouse 1422, a communication device 1426, and a data storage unit 1428, such as a disk drive or other mass storage device including a code 1430. . Additionally, the audio I/O 1424 can be coupled to the second bus bar 1420. Please note that other architectures are also available. For example, the system can be implemented as a multi-point transmission bus or other such architecture instead of the point-to-point architecture of FIG.

現請參閱圖6，圖中顯示按照本發明之實施例之第三系統1500的方塊圖。圖5與6中同樣的元件負有同樣的參考編號，且為了避免模糊了圖6的其它態樣，圖6的某些態樣已從圖6中刪除。Referring now to Figure 6, a block diagram of a third system 1500 in accordance with an embodiment of the present invention is shown. The same elements in Figures 5 and 6 bear the same reference numerals, and in order to avoid obscuring the other aspects of Figure 6, certain aspects of Figure 6 have been removed from Figure 6.

圖6說明的處理元件1470、1480分別可包括處理器架構(例如200、300)、整合式記憶體與I/O控制邏輯("CL")1472與1482。對於至少一實施例而言，CL 1472、1482可包括記憶體控制器集線器邏輯(IMC)，諸如前文與圖4及5相關的描述。此外，CL 1472、1482可包括I/O控制邏輯。圖6不僅說明記憶體1442、1444耦接至CL 1472、1482，還包括I/O裝置1514也耦接至控制邏輯1472、1482。傳統的I/O裝置1515耦接至晶片組1490。The processing elements 1470, 1480 illustrated in FIG. 6 may each include a processor architecture (eg, 200, 300), integrated memory and I/O control logic ("CL") 1472 and 1482, respectively. For at least one embodiment, CL 1472, 1482 may include memory controller hub logic (IMC), such as those previously described in relation to Figures 4 and 5. Additionally, CL 1472, 1482 can include I/O control logic. 6 illustrates not only that memory 1442, 1444 is coupled to CL 1472, 1482, but also that I/O device 1514 is also coupled to control logic 1472, 1482. A conventional I/O device 1515 is coupled to the chip set 1490.

圖7說明係用來說明按照一實施例所實施之系統700的功能性方塊圖。處理系統700的說明實施例包括一或多個具有處理器架構790(例如，處理器架構200、處理器架構300)的處理器(或中央處理單元)705、系統記憶體710、非揮性("NV")記憶體715、資料儲存單元("DSU")720、通信鏈結725、以及晶片組730。所說明的處理系統700可代表任何計算系統，包括桌上型電腦、筆記型電腦、工作站、手持式電腦、伺服器、刀鋒伺服器、諸如此類。Figure 7 illustrates a functional block diagram of a system 700 implemented in accordance with an embodiment. The illustrated embodiment of processing system 700 includes one or more processors (or central processing units) 705 having processor architecture 790 (eg, processor architecture 200, processor architecture 300), system memory 710, non-volatile ( "NV" memory 715, data storage unit ("DSU") 720, communication link 725, and chipset 730. The illustrated processing system 700 can represent any computing system, including desktop computers, notebook computers, workstations, handheld computers, servers, blade servers, and the like.

處理系統700的之元件的互連如下。處理器705經由晶片組730通信地耦接至系統記憶體710、NV記憶體715、資料儲存單元720、及通信鏈結725，以往/來發送與接收指令或資料。在一實施例中，NV記憶體715係快閃記憶體裝置。在其它實施例中，NV記憶體715包括唯讀記憶體(ROM)、可程式ROM、可抹除可程式ROM、電氣可抹除可程式ROM、諸如此類任何一種。在一實施例中，系統記憶體710包括隨機存取記憶體(RAM)，諸如動態RAM(DRAM)、同步DRAM(SDRAM)、雙資料率SDRAM(DDR SDRAM)、靜態RAM(SRAM)、諸如此類。DSU 720代表用於軟體資料、應用程式、及/或作業系統的任何儲存裝置，但最典型的是非揮發性儲存裝置。DSU 720可選擇性地包括一或多個整合驅動電子(integrated drive electronic；IDE)硬式磁碟機、增強型IDE(enhanced IDE；EIDE)硬式磁碟機、多磁碟機陣列(redundant array of independent disks；RAID)、小型電腦系統介面(small computer system interface；SCSI)硬式磁碟機、諸如此類。雖然所說明的DSU 720是在處理系統700的內部，但DSU 720可外部耦接於處理系統700。通信鏈結725將處理系統700耦接至網路，以使處理系統700可在網路上與一或多部其它電腦通信。通信鏈結725可包括數據機、乙太網路卡、十億位元乙太網路卡、通用序列匯流排(USB)埠、無線網路介面卡、光纖介面、諸如此類。The interconnection of the components of processing system 700 is as follows. The processor 705 is communicatively coupled to the system memory 710, the NV memory 715, the data storage unit 720, and the communication link 725 via the chipset 730 to transmit/receive commands or data. In one embodiment, the NV memory 715 is a flash memory device. In other embodiments, NV memory 715 includes any of read only memory (ROM), programmable ROM, erasable programmable ROM, electrically erasable programmable ROM, and the like. In one embodiment, system memory 710 includes random access memory (RAM) such as dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDR SDRAM), static RAM (SRAM), and the like. The DSU 720 represents any storage device for software materials, applications, and/or operating systems, but is most typically a non-volatile storage device. The DSU 720 can optionally include one or more integrated drive electronic (IDE) hard disk drives, an enhanced IDE (EIDE) hard disk drive, and a multi-disk array (redundant array of independent Disks; RAID), small computer system interface (small Computer system interface; SCSI) hard disk drive, and the like. Although the illustrated DSU 720 is internal to the processing system 700, the DSU 720 can be externally coupled to the processing system 700. Communication link 725 couples processing system 700 to the network such that processing system 700 can communicate with one or more other computers over the network. Communication link 725 may include a data machine, an Ethernet card, a one billion bit Ethernet card, a universal serial bus (USB) port, a wireless network interface card, a fiber optic interface, and the like.

DSU 720可包括其上儲存有一或多組指令(例如軟體)的機器可存取媒體707，具體化本文所描述的任何一或多個方法或功能。軟體(軟體排程器)在被處理器705執行期間，其也完全或至少部分駐在處理器705內，處理器705也構成機器可存取儲存媒體。The DSU 720 can include machine-accessible media 707 on which one or more sets of instructions (e.g., software) are stored, embodying any one or more of the methods or functions described herein. The software (software scheduler), while being executed by the processor 705, also resides entirely or at least partially within the processor 705, which also constitutes a machine-accessible storage medium.

雖然例示性實施例中所顯示的機器可存取媒體707是單一媒體，但“機器可存取媒體”一詞應包括用來儲存一或多組指令的單個媒體或多個媒體(例如，集中式或分散式資料庫、及/或相關的快取記憶體與伺服器)。“機器可存取媒體”也應包括能夠儲存、編碼、或載有供機器執行之指令組並致使機器實施本發明之實施例之任一項或多項方法的任何媒體。因此，“機器可存取媒體”一詞應包括但不限於固態記憶體、光學、及磁性媒體。Although the machine-accessible medium 707 shown in the illustrative embodiments is a single medium, the term "machine-accessible media" shall include a single medium or multiple media (eg, centralized) for storing one or more sets of instructions. Or decentralized database, and/or associated cache and server). "Machine-accessible media" shall also include any medium that can store, encode, or carry a set of instructions for machine execution and cause the machine to perform any one or more of the embodiments of the present invention. Thus, the term "machine-accessible media" shall include, but is not limited to, solid state memory, optical, and magnetic media.

因此，機器可存取媒體包括所提供(即儲存及/或傳送)之資訊的形式可被機器(例如電腦、網路裝置、個人數位助理、製造機具、任何具有一或多個處理器組的裝置等)存取的任何機制。例如，機器可存取媒體包括可記錄/不可記錄媒體(例如唯讀記憶體(ROM)；隨機存取記憶體(RAM)；磁碟儲存媒體；光學儲存媒體；快閃記憶體裝置等)、以及電、光、聲、或其它傳播信號之形式(例如載波、紅外線信號、數位信號等)；等。Thus, machine-accessible media, including the form of information provided (ie, stored and/or transmitted), can be used by a machine (eg, a computer, a network device, a personal digital assistant, a manufacturing implement, any one or more processor groups) Any mechanism for accessing devices, etc.). For example, machine accessible media includes recordable/not Recordable media (eg, read only memory (ROM); random access memory (RAM); disk storage media; optical storage media; flash memory devices, etc.), and electrical, optical, acoustic, or other propagating signals The form (such as carrier wave, infrared signal, digital signal, etc.);

如圖7之說明，處理器系統700的每一個子組件包括用於彼此間通信的輸入/輸出(I/O)電路750。I/O電路750可包括阻抗匹配電路，其可被調整以獲得所想要的輸入阻抗，從而減少信號反射及各子組件間的干擾。在一實施例中，PLL架構700(例如PLL架構100)可包括在各種不同的數位系統中。例如，PLL架構790可包括在處理器705及/或通信地耦接於處理器以提供具彈性的時鐘源。時鐘源可提供給處理器705的狀態元件。As illustrated in Figure 7, each sub-assembly of processor system 700 includes input/output (I/O) circuitry 750 for communicating with one another. I/O circuit 750 can include an impedance matching circuit that can be adjusted to achieve a desired input impedance to reduce signal reflections and interference between sub-components. In an embodiment, PLL architecture 700 (e.g., PLL architecture 100) may be included in a variety of different digital systems. For example, PLL architecture 790 can be included in processor 705 and/or communicatively coupled to the processor to provide a resilient clock source. A clock source can be provided to the state element of processor 705.

須明瞭，為清晰之目的，處理系統700的各種其它元件已從圖7及對其的討論中排除。例如，處理系統700可進一步包括繪圖卡、額外的DSU、其它的永久性資料儲存裝置、諸如此類。晶片組730也可包括用來互連子組件的系統匯流排及及種不同的其它資料匯流排，諸如記憶體控制器集線器與輸入/輸出(I/O)控制器集線器，以及包括用來將周邊裝置連接至晶片組730的資料匯流排(例如周邊組件互連匯流排)。相應地，處理系統700少了圖中所說明的一或多個元件也可操作。例如，處理系統700不需要包括DSU 720。It should be understood that various other components of processing system 700 have been excluded from Figure 7 and its discussion for clarity. For example, processing system 700 can further include a graphics card, additional DSUs, other persistent data storage devices, and the like. Wafer set 730 may also include a system bus for interconnecting sub-assemblies and a variety of other data busses, such as a memory controller hub and an input/output (I/O) controller hub, and The peripheral devices are connected to the data busbars of the wafer set 730 (eg, peripheral component interconnect busbars). Accordingly, processing system 700 is also operable with less than one or more of the elements illustrated in the figures. For example, processing system 700 need not include DSU 720.

本文所描述的處理器設計包括了強勢的新微架構設計。在特定的實施例中，此設計包含位在單片矽上的8個多緒核心，且可每個周期發出多達12個指令給執行管線。12條管線可包括2條M管線(記憶體)、2條A管線(ALU)、2條I管線(整數)、2條F管線(浮點)、3條B管線(分支)、及1條N管線(NOP)。M管線的數量從先前Itanium^® 處理器的4減少到2。如先前的Itanium^® 處理器設計，指令按順序被發出與止用。記憶體操作在止用前偵測任何錯誤，但它們可在記憶體操作完成之前止用。使用載入目標暫存器的指令延遲它們的執行，直到載入完成。在儲存完成之前，可止用使用儲存之記憶體結果的記憶體指令。快取階層保證這類記憶體操作將按正常的順序完成。The processor design described in this article includes a strong new microarchitecture design. In a particular embodiment, this design includes eight multi-thread cores located on a single chip and can issue up to 12 instructions per cycle to the execution pipeline. 12 pipelines can include 2 M pipelines (memory), 2 A pipelines (ALU), 2 I pipelines (integer), 2 F pipelines (floating point), 3 B pipelines (branch), and 1 N line (NOP). Line number M is reduced from the previous Itanium ^® processor 4-2. As previously Itanium ^® processor design, the stop command is issued with the order. Memory operations detect any errors before they are stopped, but they can be stopped before the memory operation is completed. Instructions that load the target scratchpad are used to delay their execution until the loading is complete. The memory command using the stored memory result can be stopped before the storage is completed. The cache layer ensures that such memory operations will be completed in the normal order.

資料快取階層可由以下的快取層級組成：16 KB第一級資料快取(FLD-核心私用)The data cache hierarchy can be composed of the following cache levels: 16 KB first level data cache (FLD-core private)

256 KB中級資料快取(MLD-核心私用)256 KB intermediate data cache (MLD-core private)

32 MB最後一級指令與資料快取(LLC-所有8核心共用)32 MB last level instruction and data cache (LLC - all 8 cores shared)

LLC包含所有其它快取。所有8個核心可共用LLC。MLD與FLD供單個核心私用。特定核心上的執行緒共用所有層級的快取。所有的資料快取可具有64-位元組快取線。為了仿真先前Itanium^® 處理器之128-位元組快取線的性能，典型上，MLD未命中觸發對兩條64-位元組線的提取，其拼湊成一個對準的128-位元組區塊。此最後的特徵稱為MLD成對線預取處理器架構(例如Itanium^® 架構)，其定義包括或不包括偵錯錯誤位址軟體可用來預取資料進入各種不同的快取層級的lfetch 指令。此lfetch 指令不需要關於其它記憶體操作的架構順序。LLC contains all other caches. All 8 cores can share LLC. MLD and FLD are for private use by a single core. The threads on a particular core share all levels of cache. All data caches can have a 64-bit cache line. To simulate the performance of a previous group of 128-bit cache line Itanium ^® processors, typical, triggering the extraction of the MLD miss two 64- byte lines, which together into a set of 128-bit aligned Block. This last feature is called MLD paired line prefetch processor architecture (e.g. Itanium ^® Architecture), which is defined or may not include an error address debug software may be used to prefetch data into a different cache levels lfetch various commands. This lfetch instruction does not require an architectural order for other memory operations.

由於Itanium^® 架構對於軟體最佳化的支援與焦點包括了軟體資料預取，因此，在本文所描述之處理器設計上執行的軟體，要比在其它架構中之情況更有可能包含軟體資料預取。此軟體資料預取在提高性能方面已十分成功。在一實施例中，在本處理器設計上執行的例示性軟體會是大企業級的應用程式。這些應用程式傾向具有大的快取與記憶體使用量及需要高的記憶體帶寬。資料預取，如所有形式的臆測，當臆測不正確時，會導致性能損失。基於此，使無用的資料預取(未消除快取未命中的資料預取)次數減至最少很重要。資料預取消耗進入、離開記憶體階層之不同層級與在其間的有限帶寬。資料預取移置其它的線離開快取記憶體。無用的資料預取消耗這些資源而無任何獲益，並損害對這類資源之可能的更佳使用。在前文所描述的多執行緒多核心處理器中，諸如通信鏈結與快取記憶體等共用資源會被非臆測的存取極重度地利用。大企業的應用程式傾向強調這些共用資源。在這樣的系統中，限制無用之預取的數量以避免浪費已被非臆測之存取所使用之資源是個關鍵。有趣的是，軟體資料預取技術具有比很多硬體資料預取技術產生較少無用預取的傾向。不過，由於其輸入的動態特性，硬體資料預取技術有能力產生軟體有時無法識別的有用資料預取。軟體與硬體資料預取具有種種其它互補的強項與弱點。本處理器設計使得軟體預取更有效率、提高守恆、互補的高精確度硬體資料預取且不會損害到軟體資料預取、獲得無重大損失及少許的輕微損失之具有平均廣泛增益的強固性能增益、使所需的設計資源減至最少。Because Itanium ^® architecture optimized for software support includes the software and focus data prefetching, so software running on the processor design described in this document, other than in the case of the architecture of software are more likely to contain information on pre take. This software data prefetch has been very successful in improving performance. In one embodiment, the exemplary software executed on the processor design would be a large enterprise application. These applications tend to have large cache and memory usage and require high memory bandwidth. Data prefetching, such as all forms of speculation, can result in performance loss when the guess is incorrect. Based on this, it is important to minimize the number of useless data prefetches (not prefetching data for cache misses). Data prefetch consumes different levels of entry and exit from the memory hierarchy and limited bandwidth between them. Data prefetch moves other lines away from the cache memory. Unwanted data prefetch consumes these resources without any benefit and undermines the possible better use of such resources. In the multi-threaded multi-core processor described above, shared resources such as communication links and cache memory are heavily utilized by non-measured access. Large enterprise applications tend to emphasize these shared resources. In such systems, it is critical to limit the amount of useless prefetches to avoid wasting resources that have been used by unintended access. Interestingly, software data prefetching techniques have a tendency to produce less useless prefetching than many hardware data prefetching techniques. However, due to the dynamic nature of its input, hardware data prefetching techniques have the ability to generate useful data prefetches that are sometimes unrecognizable by software. Software and hardware data prefetching have a variety of other complementary strengths and weaknesses. The processor design makes the software prefetch more efficient, improves the conservation, complements the high precision hardware data prefetching without damaging the software data prefetching, obtaining the average wide gain without significant loss and slight loss. Robust performance gains minimize the design resources required.

本處理設計的一些特徵增進了軟體資料預取的效率。這些特徵稱為lfetch -on-A與無阻塞lfetch 。硬體資料預取特徵包括MLD硬體預取與FLD硬體預取。本處理器設計的微架構特徵係資料預取佇列(DPQ)，其為涉及執行與本文所描述之所有特徵相關之資料預取的共用資源。在處理器(例如Itanium^® 處理器)上執行的軟體碼，可用執行指令之每個周期中可用的執行單元之類型與數量的知識來排程。在先前的Itanium^® 處理器上，lfetch 指令連同所有其它記憶體操作(諸如載入與儲存)都在M管線上執行。在一實施例中，如本文之描述，軟體每周期最多可使用2條M管線的發出槽。於是，使用M-管線發出槽的需求，乃是與lfetch 相關的重要成本。有趣的是，雖然M-管線上的發出槽可能不足，但由於本設計之管線中的拖延與重播，周期中有很大一部分的M管線執行槽未被使用。此閑置的帶寬無法為軟體所用，這是因為在順序管線中定義一指令的拖延或重播會拖延或重播所有後續的指令。除了兩條M管線之外，本處理器架構也具有兩條A管線與兩條I管線。A管線的重要性遠比M管線低，且遠比M管線來的有空，這是因為被A管線執行的ALU指令，也可由I管線或M管線來執行。如先前所提及，相對於其它的記憶體操作，lfetch 允許以任何順序來執行。因此，lfetch 的非偵錯(non-faulting)特點，僅需要關於其它指令按順序存取它的暫存器。記憶體存取lfetch 的部分可被推遲。Some features of this process design enhance the efficiency of software data prefetching. These features are called lfetch -on-A and non-blocking lfetch . Hardware data prefetching features include MLD hardware prefetching and FLD hardware prefetching. The microarchitecture feature of the processor design is a Data Prefetching Queue (DPQ), which is a shared resource that involves performing data prefetching associated with all of the features described herein. Software codes executed on a processor (e.g., processor Itanium ^®), may be implemented for each type of instruction cycle execution units available and the number of knowledge to schedule. On the previous Itanium ^® processor, lfetch instructions are executed on the M line, together with all the other memory operations (such as the loading and storage). In one embodiment, as described herein, the software can use up to two M-line firing slots per cycle. Thus, the need to use the M-line to issue slots is an important cost associated with lfetch . Interestingly, although the firing slots on the M-line may be insufficient, a large portion of the M-line execution slots in the cycle are not used due to delays and replays in the pipeline of the design. This idle bandwidth cannot be used by the software because the delay or replay of an instruction defined in the sequential pipeline will delay or replay all subsequent instructions. In addition to the two M pipelines, the processor architecture also has two A pipelines and two I pipelines. The importance of the A pipeline is much lower than that of the M pipeline and is much longer than that of the M pipeline. This is because the ALU command executed by the A pipeline can also be executed by the I pipeline or the M pipeline. As mentioned previously, lfetch allows execution in any order relative to other memory operations. Therefore, lfetch 's non-faulting feature only requires a register to access its instructions sequentially. The portion of the memory access lfetch can be deferred.

在對於降低發出lfetch 指令之成本的努力，本設計允許lfetch 發給A管線或M管線。當lfetch 被向下發給A管線時，其單純地讀取它的位址暫存器，並將其置入DPQ。接著，當M管線被拖延或重播時，lfetch 可從DPQ發送到M管線。被發送到A管線的lfetch 指令具有較長的潛時(例如最少+7個周期)，但其僅需使用M管線的執行槽而非M管線指令的發出槽。軟體排程器可控制lfetch 下行到哪條管線，因此，此特徵給予軟體關於M管線發出帶寬取捨lfetch 之潛時的能力。This design allows lfetch to be sent to the A or M pipeline for efforts to reduce the cost of issuing the lfetch command. When lfetch is sent down to the A pipeline, it simply reads its address register and places it into the DPQ. Then, when the M pipeline is delayed or replayed, lfetch can be sent from the DPQ to the M pipeline. The lfetch instruction sent to the A pipeline has a longer latency (eg, a minimum of +7 cycles), but it only needs to use the execution slot of the M pipeline instead of the issue slot of the M pipeline instruction. The software scheduler controls which line lfetch goes down to, so this feature gives the software the ability to issue a bandwidth lfetch for the M pipeline.

處理器(例如Itanium^® 處理器)具有硬體分頁查核行程器，其可查詢記憶體中之虛擬湊雜分頁表(virtual hash page table；VHPT)中的轉譯，並將其插入到TLB。在先前的Itanium^® 處理器中，當lfetch 未命中資料TLB並初始化硬體分頁查核行程時，管線拖延一段硬體分頁查核行程的持續時間。以此方法的問題是無用的lfetch 會拖延管線一段長的時間。由於lfetch 指令固有的臆測，其會無用地嘗試去參考從未被非臆測之指令參考的分頁。此情況的一例是當在迴圈中使用lfetch 指令來預取在迴圈之稍後的迭代中可能需要的資料時。在此情況，當迴圈存在時，一些無用的lfetch 指令已被發出。這類指令很容易造成無用的硬體分頁查核行程及相關之長潛時的管線拖延。值得注意的是，總是丟棄未命中資料TLB的lfetch 指令也不是好的選擇，因為有時預取是被需要的。此情況的例子是從大位址空間存取資料的迴圈。這類迴圈需要引發相當多的硬體分頁查核行程。如果lfetch 指令被丟棄，當它們未命中資料TLB時，則可能喪失很多有用的預取。A processor (e.g., processor Itanium ^®) having a trip check paging hardware device, which can be found in the memory of the virtual hash page table heteroaryl (virtual hash page table; VHPT) in translation, and inserted into the TLB. In the previous Itanium ^® processor, when lfetch data TLB misses the trip to check paging and initializes the hardware, the duration of the delay line to check travel period paging hardware. The problem with this approach is that useless lfetch will delay the pipeline for a long time. Due to the inherent speculation of the lfetch instruction, it would uselessly attempt to refer to a page that has never been referenced by an unintended instruction. An example of this is when the lfetch instruction is used in the loop to prefetch data that may be needed in a later iteration of the loop. In this case, some useless lfetch instructions have been issued when the loop exists. Such instructions can easily lead to useless hardware paging checks and associated long-term pipeline delays. It is worth noting that the lfetch instruction that always discards the missed material TLB is not a good choice because sometimes prefetching is required. An example of this is a loop that accesses data from a large address space. This type of loop needs to trigger quite a lot of hardware paging checks. If the lfetch instructions are discarded, they may lose a lot of useful prefetches when they miss the data TLB.

為提出此議題並使軟體資料預取更有效率，本設計使用大部分lfetch 指令係非偵錯類型，且這類lfetch 可關於所有其它指令亂序執行的事實。首先，本設計將硬體分頁查核行程器的能力延伸到使其能夠同時處理多個硬體分頁查核行程。第二，本設計使用DPQ來佇列未命中資料TLB的lfetch 指令。因此，在本設計中，未命中資料TLB的lfetch 可引發硬體分頁查核行程，並接著被置入DPQ，在硬體分頁查核行程將轉譯插入TLB中之後再發出。當到達相同頁之多個lfetch 指令未命中資料TLB時，多個可能的硬體分頁查核行程被合併成單一個行走，且所有的lfetch 指令被置入DPQ。如果DPQ被lfetch 指令填滿，其將會拖延主管線以避免丟棄lfetch 。此技術類似於使快取不阻塞的技術。如同無阻塞快取，當佇列入口被用完時，無阻塞TLB存取變成阻塞存取。To raise this issue and make software data prefetching more efficient, this design uses most of the lfetch directives as non- detection types, and such lfetch can be about the fact that all other instructions are executed out of order. First, the design extends the ability of the hardware page checker to allow it to handle multiple hardware page check passes simultaneously. Second, the design uses DPQ to queue the lfetch instructions for the missing data TLB. Therefore, in this design, the lfetch of the miss data TLB can trigger a hardware page check check and then be placed into the DPQ, which is then sent after the hardware page check check is inserted into the TLB. When multiple lfetch instruction misses the material TLB on the same page, multiple possible hardware page check passes are merged into a single walk, and all lfetch instructions are placed into the DPQ. If the DPQ is filled with the lfetch command, it will delay the supervisor line to avoid dropping lfetch . This technique is similar to the technique of making the cache unblocked. Like a non-blocking cache, a non-blocking TLB access becomes a blocking access when the queue entry is exhausted.

MDL硬體預取器係一順序預取器，其將線從較高階快取或記憶體移入MLD。其追蹤中階資料快取未命中的空間位置，及在觸發未命中的附近可能請求額外的線。如圖3中之說明，預取器根據藉由監視MLD到環380的發送、介面到LLC快取的存取，來追蹤以4K頁面為基礎之多達8個未命中位址流。對於每一個位址流，其記錄最近的未命中位址以及目前的預取方向與深度。對於在先前未命中之5條快取線以內的每一個未命中，預取器首先在向前或向後的方向發出對應於記錄在對應之歷史登錄之預取深度欄中之對應數量的順序預取。接著，其增加該位址流的預取深度直到4條快取線。本質上，此預取演算法動態地調整中層資料快取之有效的線大小，取決於被觀察之快取未命中的空間位置而定。為減少硬體所引發之預取之潛在的負面影響，MLD預取器僅回應要求的載入未命中做為觸發。軟體引發的預取(lfetch )、儲存未命中、與硬體引發之預取都不予理會。此外，MLD預取要求在非最近使用的狀態中填滿中層資料快取。因此，在相同的組中無用的預取具有較高的可能性先於其它線被逐出，而有用的預取將會在存取該線的第一個要求上被標示為最近使用的。The MDL hardware prefetcher is a sequential prefetcher that moves lines from higher order caches or memories into MLD. It tracks the spatial location of the mid-level data cache miss and may request additional lines near the trigger miss. As illustrated in FIG. 3, the prefetcher tracks up to 8 missed address streams based on 4K pages based on monitoring MLD to ring 380 transmissions, interface to LLC cache accesses. For each address stream, it records the most recent miss address and the current prefetch direction and depth. For each miss within the five cache lines that were previously missed, the prefetcher first issues a sequence in the forward or backward direction corresponding to the corresponding number recorded in the prefetch depth column of the corresponding historical login. take. Next, it increases the prefetch depth of the address stream up to 4 cache lines. Essentially, this prefetch algorithm dynamically adjusts the effective line size of the mid-level data cache, depending on the spatial location of the cache miss being observed. To reduce the potential negative effects of prefetching caused by hardware, the MLD prefetcher only responds to the requested load miss as a trigger. Software-induced prefetching ( lfetch ), storage misses, and hardware-initiated prefetches are ignored. In addition, MLD prefetch requires filling up the mid-level data cache in non-recently used states. Therefore, useless prefetching in the same group has a higher probability of being evicted before other lines, and useful prefetching will be marked as the most recently used on the first request to access the line.

須明瞭，本說明書從頭的尾提到的“一實施例”或“實施例”意指所描述有關該實施例的特定特徵、結構、或特性包括在至少一實施例中。因此，此強調且需明瞭，在本說明書各部分中所提到的兩或更多個“實施例”、“一實施例”、或“替代實施例”並不必然全都參考同一實施例。此外，在一或多個實施例中也適合結合這些特定特徵、結構、或特性。It is to be understood that the phrase "a" or "an" or "an" Therefore, it is emphasized that the two or more "embodiments", "an embodiment", or "alternative embodiment" referred to in the various parts of the specification are not necessarily all referring to the same embodiment. Furthermore, it is also appropriate to combine these specific features, structures, or characteristics in one or more embodiments.

在以上對各實施例的詳細描述中有參考附圖，這些附圖構成本說明書的一部分，且其中係藉由繪示但非限制可實行本發明的特定實施例來顯示。在各圖中，在從頭到尾數個圖中，相同的數字描述實質上類似的組件。對說明之實施例做充分詳細地描述，以使熟悉此方面技術之人士能夠實行本文所揭示的教學。其它的實施例則是利用或衍生自該些實施例，以致於可做結構與邏輯的置換與變化，不會偏離本揭示的範圍。因此，以下的詳細描述並無限制之意，且各不同實施例之範圍僅由所附申請專利範圍連同這些申請專利範圍所規範之完整範圍的相等物來定義。The detailed description of the various embodiments of the present invention is set forth in the accompanying drawings. In each figure, from beginning to end In the figures, the same numbers describe substantially similar components. The embodiments described are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed herein. Other embodiments are utilized or derived from the embodiments, such that structural and logical substitutions and changes can be made without departing from the scope of the disclosure. Therefore, the following detailed description is not to be construed as limiting the scope of the claims

200‧‧‧處理器架構200‧‧‧ processor architecture

202‧‧‧指令緩衝器邏輯202‧‧‧Instruction Buffer Logic

220‧‧‧順序管線220‧‧‧Sequential pipeline

221‧‧‧順序管線221‧‧‧Sequential pipeline

230‧‧‧第二類型順序管線230‧‧‧Second type sequential pipeline

231‧‧‧第二類型順序管線231‧‧‧Second type sequential pipeline

218‧‧‧多工器218‧‧‧Multiplexer

219‧‧‧多工器219‧‧‧Multiplexer

240‧‧‧轉譯後備緩衝區240‧‧‧Translated backup buffer

241‧‧‧轉譯後備緩衝區241‧‧‧Translated backup buffer

210‧‧‧資料預取佇列210‧‧‧ Data prefetch queue

250‧‧‧硬體分頁查核行程器250‧‧‧Hardware paging checker

252‧‧‧多工器252‧‧‧Multiplexer

300‧‧‧處理器架構300‧‧‧ Processor Architecture

302‧‧‧指令緩衝器邏輯302‧‧‧Instruction Buffer Logic

310‧‧‧資料預取佇列310‧‧‧ Data prefetch queue

311‧‧‧多工器311‧‧‧Multiplexer

312‧‧‧多工器312‧‧‧Multiplexer

314‧‧‧擴展引擎314‧‧‧Extension engine

316‧‧‧第一埠316‧‧‧ first

317‧‧‧第二埠317‧‧‧Second

318‧‧‧多工器318‧‧‧Multiplexer

319‧‧‧多工器319‧‧‧Multiplexer

320‧‧‧管線320‧‧‧ pipeline

321‧‧‧管線321‧‧‧ pipeline

330‧‧‧管線330‧‧‧ pipeline

331‧‧‧管線331‧‧‧ pipeline

360‧‧‧中階資料快取硬體預取器360‧‧‧Intermediate data cache hardware prefetcher

370‧‧‧中階資料快取370‧‧‧Intermediate data cache

1300‧‧‧系統1300‧‧‧ system

1310‧‧‧處理器1310‧‧‧ processor

1315‧‧‧處理器1315‧‧‧ Processor

1320‧‧‧圖形與記憶體控制器集線器1320‧‧‧Graphics and Memory Controller Hub

1340‧‧‧記憶體1340‧‧‧ memory

1345‧‧‧顯示器1345‧‧‧ display

1350‧‧‧輸入/輸出控制器集線器1350‧‧‧Input/Output Controller Hub

1360‧‧‧外接的繪圖裝置1360‧‧‧External drawing device

1370‧‧‧peripheral device1370‧‧‧peripheral device

1395‧‧‧前端匯流排1395‧‧‧ front-end busbar

1400‧‧‧多處理器系統1400‧‧‧Multiprocessor system

1470‧‧‧第一處理器1470‧‧‧First processor

1480‧‧‧第二處理器1480‧‧‧second processor

1450‧‧‧點對點互連1450‧‧‧ Point-to-point interconnection

1472‧‧‧整合式記憶體控制器集線器1472‧‧‧Integrated Memory Controller Hub

1482‧‧‧整合式記憶體控制器集線器1482‧‧‧Integrated Memory Controller Hub

1476‧‧‧點對點介面1476‧‧‧ point-to-point interface

1478‧‧‧點對點介面1478‧‧‧ point-to-point interface

1486‧‧‧點對點介面1486‧‧‧ peer-to-peer interface

1488‧‧‧點對點介面1488‧‧‧ point-to-point interface

1450‧‧‧點對點介面1450‧‧‧ peer-to-peer interface

1442‧‧‧記憶體1442‧‧‧ memory

1444‧‧‧記憶體1444‧‧‧ memory

1494‧‧‧點對點介面1494‧‧‧ peer-to-peer interface

1498‧‧‧點對點介面1498‧‧‧ peer-to-peer interface

1452‧‧‧點對點介面1452‧‧‧ peer-to-peer interface

1454‧‧‧點對點介面1454‧‧‧ point-to-point interface

1490‧‧‧晶片組1490‧‧‧ chipsets

1438‧‧‧高性能繪圖電路1438‧‧‧High performance drawing circuit

1439‧‧‧高性能圖形介面1439‧‧‧High-performance graphical interface

1496‧‧‧介面1496‧‧ interface

1416‧‧‧第一匯流排1416‧‧‧First bus

1414‧‧‧輸入/輸出裝置1414‧‧‧Input/output devices

1418‧‧‧匯流排電橋1418‧‧‧ bus bar bridge

1420‧‧‧第二匯流排1420‧‧‧Second bus

1422‧‧‧鍵盤/滑鼠1422‧‧‧Keyboard/mouse

1426‧‧‧通信裝置1426‧‧‧Communication device

1428‧‧‧資料儲存單元1428‧‧‧Data storage unit

1430‧‧‧碼1430‧‧‧ yards

1424‧‧‧音響輸入/輸出1424‧‧‧Audio input/output

1514‧‧‧輸入/輸出裝置1514‧‧‧Input/output devices

1515‧‧‧傳統的輸入/輸出裝置1515‧‧‧Traditional input/output devices

700‧‧‧處理系統700‧‧‧Processing system

790‧‧‧處理器架構790‧‧‧ Processor Architecture

705‧‧‧處理器705‧‧‧ processor

710‧‧‧系統記憶體710‧‧‧ system memory

715‧‧‧非揮性記憶體715‧‧‧ Non-volatile memory

720‧‧‧資料儲存單元720‧‧‧ data storage unit

725‧‧‧通信鏈結725‧‧‧Communication links

730‧‧‧晶片組730‧‧‧ chipsets

707‧‧‧機器可存取媒體707‧‧‧Machine accessible media

750‧‧‧輸入/輸出電路750‧‧‧Input/Output Circuit

790‧‧‧PLL架構790‧‧‧PLL architecture

380‧‧‧環380‧‧‧ Ring

本發明的各種實施例係藉由附圖之圖中非限制的例子來說明，其中：圖1說明按照本發明之實施例，提供順序管線之亂序預取指令之電腦實施方法一實施例的流程圖；圖2說明按照本發明一實施例的處理器架構；圖3說明按照本發明另一實施例的處理器架構；圖4係按照本發明一實施例之系統的方塊圖；圖5係按照本發明實施例之第二系統的方塊圖；圖6係按照本發明實施例之第三系統的方塊圖；及圖7說明的功能性方塊圖說明按照本發明一實施例實施的系統。The various embodiments of the present invention are illustrated by the non-limiting example of the accompanying drawings in which: FIG. 1 illustrates an embodiment of a computer implemented method for providing out-of-order prefetch instructions for sequential pipelines in accordance with an embodiment of the present invention. FIG. 2 illustrates a processor architecture in accordance with an embodiment of the present invention; FIG. 3 illustrates a processor architecture in accordance with another embodiment of the present invention; FIG. 4 is a block diagram of a system in accordance with an embodiment of the present invention; A block diagram of a second system in accordance with an embodiment of the present invention; FIG. 6 is a block diagram of a third system in accordance with an embodiment of the present invention; and FIG. 7 illustrates a functional block diagram illustrating a system implemented in accordance with an embodiment of the present invention.

Claims

A processor comprising: at least one sequential pipeline for receiving data prefetch instructions and other instructions; a translation lookaside buffer (TLB) having a number of ports, mapping virtual addresses to physical addresses; data prefetching queues, When an individual virtual address associated with each data prefetch instruction is not found in the TLB, it is used to receive a data prefetch instruction; and a hardware page walker is not found in the TLB. When the data prefetch instruction is used, it is used to access the page table in the memory.

The processor of claim 1, wherein the at least one sequential pipeline is an arithmetic logic unit (ALU) pipeline for receiving an ALU instruction and a data prefetch instruction.

The processor of claim 1, further comprising: at least one of the second type of sequence pipelines having an issue slots for receiving a plurality of other instructions.

The processor of claim 3, wherein, when at least one execution slot of the at least one of the second type sequence pipelines is available, the data prefetch queue sends a data prefetch command to the second The at least one of the type sequence pipelines.

The processor of claim 1, wherein when the plurality of instructions simultaneously request the same translation from the hardware paging checker, The hardware page check check is merged into a single page check check.

The processor of claim 1, wherein the at least one sequential pipeline is an integer pipeline for receiving an integer instruction and a data prefetch instruction.

The processor of claim 3, wherein the at least one of the second type of sequence pipelines is executed when a multi-hardware page check check occurs.

A system comprising: one or more processing units, comprising: a first type of sequential pipeline for receiving at least one data prefetching instruction; at least one of the second type of sequential pipelines having a transmitting slot for receiving a plurality of instructions And a data prefetching queue, the data prefetching queue is configured to receive the at least one data prefetching instruction when the translational back buffer does not have an entry associated with the virtual address of the data prefetching instruction, and When at least one execution slot of the at least one of the second type of sequential pipelines is available, the at least one data prefetch command is issued to the at least one of the second type of sequential pipelines.

The system of claim 8 further comprising: an additional first type of sequential pipeline, wherein the first type of sequential pipeline is an arithmetic logic unit (ALU) pipeline for receiving ALU instructions and data prefetch instructions.

Such as the system of claim 8 of the patent scope, wherein the second category The at least one of the type sequence pipelines includes a memory pipeline.

The system of claim 10, further comprising: one or more execution units for executing instructions associated with the execution slot of the memory pipeline.

The system of claim 11, further comprising: a software scheduler for utilizing at least one of the one or more firing slots including the second type of sequential pipeline and the at least one data prefetching instruction One or more factors of priority to determine whether to send the at least one data prefetch instruction to the first type of sequential pipeline or to the second type of sequential pipeline.

The system of claim 11, wherein the first type of sequential pipeline is an integer pipeline for receiving integer instructions and data prefetch instructions.

The system of claim 11, further comprising: a memory coupled to the one or more processing units, wherein the one or more execution units of the memory pipeline transmit an instruction related to the executed instruction Information is given to the memory.

A computer-implemented method, comprising: determining whether to send a data prefetching instruction to a first sequential pipeline or to a second sequential pipeline according to availability of one or more transmission slots of a second sequential pipeline; according to one or more factors Receiving, by the first sequence pipeline, the data prefetching instruction; sending the data prefetching instruction to the data prefetching queue; The data prefetch instruction is issued to the second sequential pipeline when at least one execution slot of the second sequential pipeline is available.

The computer-implemented method of claim 15, wherein the first type of sequential pipeline is an arithmetic logic unit (ALU) pipeline for receiving an ALU instruction and a data prefetching instruction, wherein the second sequential pipeline memory Body line.

The computer-implemented method of claim 15, further comprising: receiving, by the transmitting slot of the second sequential pipeline, a plurality of other instructions, wherein the one or more factors comprise one or more of the second type of sequential pipelines The unavailability of at least one of the launch slots and the priority of the data prefetch command.

A machine-accessible medium comprising data, when accessed by a machine, causing the machine to perform operations comprising: deciding whether to issue a data prefetch command based on availability of one or more firing slots of the second type of sequential pipeline a first type of sequential pipeline or sent to the second type of sequential pipeline; receiving the data prefetching instruction in the first sequential pipeline according to one or more factors; sending the data prefetching instruction to the data prefetching queue; When at least one execution slot of the second sequential pipeline is available, the data prefetch command is issued to the second sequential pipeline.

The machine-accessible medium of claim 18, wherein the first type of sequential pipeline is an arithmetic logic unit (ALU) pipeline, The ALU instruction and the data prefetch instruction are received, wherein the second sequential pipeline is a memory pipeline.

The machine-accessible medium of claim 18, further comprising: receiving, by the transmitting slot of the second sequential pipeline, a plurality of other instructions, wherein the one or more factors comprise one or more of the second type of sequential pipeline The unavailability of at least one of the plurality of firing slots and the priority of the data prefetching instruction.