TW201135460A

TW201135460A - Prefetcher, method of prefetch data, computer program product and microprocessor

Info

Publication number: TW201135460A
Application number: TW100110731A
Authority: TW
Inventors: Rodney E Hooker; John Michael Greer
Original assignee: Via Tech Inc
Priority date: 2010-03-29
Filing date: 2011-03-29
Publication date: 2011-10-16
Also published as: CN105183663B; TWI547803B; TW201624289A; TWI574155B; CN104636274B; TWI519955B; CN104615548A; TWI534621B; CN104636274A; CN102169429B; CN105183663A; TW201447581A; TW201535119A; CN102169429A; TW201535118A; TWI506434B; CN104615548B

Abstract

A data prefetcher in a microprocessor having a cache memory is disclosed. The data prefetcher is configured to receive a plurality of memory accesses each to an address within a memory block, wherein the plurality of memory access addresses are non-monotonically increasing or decreasing as a function of time. The data prefetcher includes a storage element and a control logic, whererin the control logic is coupled to the storage element. As the plurality of memory accesses are received, the control logic is configured to maintain within the storage element a largest address and a smallest address of the plurality of accesses and counts of changes to the largest and smallest addresses, to maintain a history of recently accessed cache lines implicated by the access addresses within the memory block. The data prefetcher determines a predominant access direction based on the counts, determines a predominant access pattern based on the history, and prefetches into the cache memory, in the predominant access direction according to the predominant access pattern, cache lines of the memory block which the history indicates have not been recently accessed.

Description

201135460 六、發明說明：【發明所屬之技術領域】本發明係有關於一般微處理器之快取記憶體，特別係有關將資料預取至微處理器之快取記憶體。【先前技術】以歲近的電腦糸統而s ’在快取失敗（cache miss)時，微處理器存取系統記憶體所需的時間，會比微處理器存取快取記憶體（cache)多上一或兩個數量級。因此，為了提高快取命中率（cache hit rate)，微處理器整合了預取技術，用來測試最近資料存取樣態（examine recent data access patterns) ’並且企圖預測哪一個資料為程式下一個存取的對象，而預取的好處已是眾所皆知的範嘴。然而，申请人注意到某些程式的存取樣態並不為習知微處理器之預取單it所能偵測的。例如，g i圖所示為當執行之私式包括經由記憶體進行—序列之儲存動作時，第二級快取記憶體(L2Caehe)之存取樣態，而射所描繪者為 2間之記憶體位址。由第】圖可知，雖㈣㈣㈣著時間而增加記憶體位址，即由往上之方向，然而在許多狀況下’所指定之存取記憶體位址亦可較前—個時間往下，而非總趨勢之往上，使其不同於習4 的結果。於白知預取單元實際所預測雖然就數1相對大的樣本而言 u 總趨勢係朝一個方向則進，但習知預取單元在面臨小樣 ^ 不^部可能出現混亂狀況的原因有兩個。第一個原因為程古、依循其架構對存取 0608-A43067TW/final 4 201135460 記憶體’不論是由演算法特性或是不佳的編程（p〇〇r programming)所造成。第二個原因為非循序（〇ut_〇f_〇rder execution)微處理器核心之管線與佇列在正常功能下執行時’常常會用不同於其所產生的程式順序來進行記憶體存取。因此’需要一個資料預取單元（器）能夠有效地為程式進行資料預取，其必須考慮到在較小時窗（time windows)進行記憶體存取指令（動作）時並不會呈現明顯之趨勢（n〇 dear trend) ’但當以較大樣本數進行審查時則會出現明顯之趨勢0 【發明内容】本發明揭露一種預取單元，設置於具有—快取記憶體之一微處理器中，其中預取單元係用以接收對—記憶體區塊之複數位址的複數存取要求，每一存取要求對應記憶體區塊之位址中之一者，並且存取要求之位址係隨著時間函數非單調性地(non-monotonically)增加或減少。預取單元包括一儲存裝置以及一控制邏輯。控制邏輯，耦接至儲存= 置，其中當接收到存取要求'時，控制邏輯則用以.維持儲存裝置中之存取要求之一*大位址以及一最小位±止，以及最大位址以及最小位址之變化的計數值'維持記憔雩近被存取之快取線的一歷史記錄，最近被存：之：取係與存取要求之位址相關、根據計數值，決定一存根據歷史記錄，決定一存取樣態，並且根據存取子:二: 著存取方向’將快取記憶體内尚未被歷史記錄指二已$ 〇608-A43067TW/final , ’ 201135460 取之快取線預取至記憶體區塊中。本發明揭露一種資料預取方法，用以預取資處理器之—快取記憶體，資料預取方法，包括接收對憶體區塊之複數位址的複數存取要求，每—麵 ° 1憶體區塊之位址中之—者，並且存取要求之位址係隨= 寸門函數非單5周性地增加或減少去收到存取要求時’維持記憶體區塊中之_最大以及—: 位址，並且計算最大以及最小位址之變化的計數值收到存取要求時，維持記憶體區塊巾最近被麵之快^ 的歷史圮錄，最近被存取之快取線係與存取要求之、相關；根據計數值決;t-存取方向；根據歷史紀錄決= 存取樣態；以及根據存取樣態並沿著存取方向，將快憶體内尚未被歷史記錄指示為已存取之快取線、= 體區塊中。王元憶本發明揭露一種電腦程式產品，編碼於至少一讀取媒體之上，並適用於—計算裝置，電腦程式產。= :電腦可讀程式編碼。電腦可讀程式編碼，儲存於=括讀取媒體中’用以在具有-快取記憶體之—微處理器中可定義出（specify)—預取單元，電腦可讀程式其中預取單元係用以接收對-記憶體區塊之複數位址的複數存取要求，每一存取要求對應記憶體區塊之位址中之一者，並且存取要求之位址係隨著時間函數非單調性地(n〇n_m〇n〇t〇nically) 增加或減少。電腦可讀程式包括一第一程式碼以及一第二程式碼。第一程式碼，用以定義出一儲存裝置。第二程式碼，用以定義出一控制邏輯，輕接至儲存裝置，其中當接 0608-A43067TW/fmal 6 201135460 二收到存取要求時，控制邏輯則用以藉由儲存裝置維持之存取之一取大位址以及一最小位址，並且計算最大以及最小，址之變化的計數值、藉由記憶體區塊維持記憶體區塊中最近被存取之快取線的一歷史記錄、根據計數值決定一存取方向、根據歷史紀錄決定一存取樣態並且根據存取樣態並沿著存取方向，將快取記憶體内尚未被歷史記錄指示為已存取之快取線預取至記憶體區塊中。本發明揭露一種微處理器，包括複數核心、一快取記憶體以及一預取單元。快取記憶體，由核心所共享，用以接收對一記憶體區塊之複數位址的複數存取要求，每一存取要求對應記憶體區塊之位址中之一者，存取要求之位址系^著日才間函數非單調性地（n〇n_m〇n〇t〇njCally)增加或減 ^預取單元，用以監視存取要求，並維持記憶體區塊中之一最大位址以及一最小位址，以及最大位址以及最小位 ^之變化的計數值、根據計數值，決定一存取方向並且沿著存取方向，將记憶體區塊中未命中之快取線預取至快取記憶體中。雕本發明揭露一種微處理器，包括一第一級快取記憶體、一第二級快取記憶體以及一預取單元。預取單元用以债測出現在第二級快取記憶體中之最近存取要求之一方向丄及樣心，以及根據方向以及樣態，將複數快取線預取至第級决取a己憶體中、從第一級快取記憶體，接收第一級快取記憶體所接收之一存取要求之一位址，其中位址與一 7取線相關、決定在方向中所相關之快取線之後被樣態所指出之-個或多個快取線並且導致一個或多個快取線被預 7 201135460 取至第一級快取記憶體中。本發明揭露—種¥料方法，用以預取資料至且有 -第二級快取記憶體之—微處理器之㈣田貞劂出現在第二級快取記憶體中之取要求之—方向以及樣態，以及根據方向以及樣也，將複數快取線預取至第二級快取記憶體中；從第一級快取記憶體，接收第-級快取記憶體所接收之—存取要求二:其I位址與—快取線相關；決定在方向中所相關之快取線之後被樣態所指出之—個❹鍊取線；以及 ¥致-個或多個快取線被預取至第—級快取記憶體中。本發明揭露-種電腦程式產品，編碼於至少一電腦可讀取媒體之上，並適用於一計算裝置，電腦程式產品包括 -電腦可讀程式編碼。一電腦可讀程式編碼，儲存於電腦可讀取媒體中，用以定義一微處理器，電腦可讀程式包括第私式碼、-第二程式碼以及一第三程式碼。第一程式石馬’用以^義-第—級快取記憶體裝置。第二程式碼，用以定義-第二級快取記憶體裝置。第三程式碼，用以定義-預取單元’使得預取單元用以偵測出現在第二級快取記憶體中之最近存取要求之一方向以及樣態，以及根據方向以及樣態，將複數快取線預取至第二級快取記憶體中、從第-級快取記憶體，接收第—級快取記憶體所接收之一存取要求之一位址，其中位址與一快取線相關、決定在方向中所相關之快取線之後被樣態所指出之一個或多個快取線並且導致-個或多個快取線被預取至第一級快取記憶體中。〜〇608-A43067TW/final 201135460 本發明揭露一種微處理器，包括一快取記憶.體以及一預取單元。預取單元用以偵測具有一第一記憶體區塊之複數記憶體存取要求之一樣態，並且根據樣態從第一記憶體區塊預取複數快取線至快取記憶體中、監視一第二記憶體區塊之一新的記憶體存取要求、決定第一記憶體區塊是否虛擬鄰近於第二記憶體區塊，並且當自第一記憶體區塊延續至第二記憶體區塊時，則決定樣態是否預測到第二記憶體區塊之新的記憶體存取要求所相關之一快取線在第二記憶體區塊中、並且根據樣態’從第二記憶體區塊將相映的快取線預取至快取記憶體中。本發明揭露一種資料預取方法，用以預取資料至一微處理器之一快取記憶體，資料預取方法包括/[貞測具有一第一記憶體區塊之複數記憶體存取要求之一樣態，並且根據樣態從第一記憶體區塊預取快取線至上至快取記憶體中；監視一第二記憶體區塊之一新的記憶體存取要求；決定第一記憶體區塊是否虛擬鄰近於第二記憶體區塊，並且當自第一記憶體區塊延續至第二記憶體區塊時，決定樣態是否預測到第二記憶體區塊之新的記憶體存取要求所相關之一快取線在第二記憶體區塊中；以及根據樣態，從第二記憶體區塊將複數快取線預取至快取記憶體中，以回應決定步驟。本發明揭露一種電腦程式產品，編碼於至少一電腦可讀取媒體之上，並且適用於一計算裝置，電腦程式產品包括一電腦可讀程式編碼，儲存於電腦可讀取媒體，用以定義一微處理器。電腦可讀程式包括一第一程式碼以及一第 0608-A43067TW/fina] 9 201135460 二程式碼。第一程式碼，用以定義一快取記憶體裝置。第二程式碼，用以定義一預取裝置’使得預取裝置用以偵測具有一第一記憶體區塊之存取之一樣態，並且根據樣態從第一記憶體區塊預取進入快取線、監視一第二記憶體區塊之一新的存取要求、決定第一記憶體區塊係虛擬鄰近至第一s己憶體區塊以及樣態，當持續自第一記憶體區塊至第二記憶體區塊，預測至與具有第二記憶體區塊之新的要求相關之一快取線之一存取、並且根據樣態響應地從第二記憶體區塊預取進入快取記憶體之快取線。【實施方式】以下將詳細討論本發明各種實施例之製造及法。然而值得注意的是，本發明所提供之許多可行的發明概念可實施在各種料範㈣。這些肢實施龍用於舉㈣明本發明之製造及使用方法，但非用於限定本發明之範圍。贋之而言 _仍上返問題之解決方法可以解釋。當—記«之所有麵(齡、動作 =一=上時，所有存取(指令、動作或要求）之 = 加的存取要求亦表示於同；上述首張圖如第8圖所亍：°周整大小後之定界框圈起來。或動作Η。猶塊的兩次存取(指令示具有4KB區塊之存取的二曰二,存取之時間’ Y軸表描緣第-次之兩個存取 =1、取線之索引。首先， I ^個絲_絲線5進行存 201135460 + 一子要求係對快取線6進行存取。一定界框將代表存取要求的兩點圈起來。取如圖所示之再者，第二個存取要求發 — 得代表第三個存取要求的新點可被定二圈:::帳新的存取不斷發生，定界推必隨】:二:：鏟以及下缓的# ^大為向上的例子）°上述定界框上為白上at動之歷史紀錄將用以決定存取樣態之趨勢為向上、向下或者都不是。除了追縱定界框之上緣以及下緣的趨勢以決定—趨勢方向外’追縱個別的存取要求也是必要的，因為存取要求跳過-或兩個快取線的事件時常發生。因此，為了避免跳過所預取快取線的事件發生，〆旦偵測到一向上或向下之趨勢’預取單元則使用額外的準則決定所要預取之快取線。由於存取要求趨勢會被重新排列，預取單元會將這歧暫態的重新排列存取歷史紀錄予以删除。此動作係藉由^ 記位元(marking bit)在一位元遮罩(bit mask)中完成的，每— 位元對應具有一 &己憶體區塊之·/快取線，，且當位元遮罩中對應之位元被設置時，表示特定之區塊可被存取。—曰對記憶體區塊的存取要求已達到一充分數量，預取單元會使用位元遮罩（其中位元遮罩不具有存取之時序的指示），並基於如下所述之較大的存取觀點（廣義large view)去存取整個區塊’而泮基於較小的存取觀點（狹義small view)以及習知預取單元般僅根據存取的時間去存取之區塊。第2圖所系為本發明之微處理态1〇〇的方塊圖。微處理器100包祐/個具有複數階層之傳遞路徑，並且傳遞路 0608-A43067TW/final 11 201135460 徑中亦包括各種功能單元。傳遞路徑包括一指令快取記憶體102 ’指令快取記憶體1 〇2轉接至一指令解碼器1 ;指令解碼器104耦接至一暫存器別名表1〇6(register> alias table，RAT);暫存器別名表ι〇6耦接至一保留站 108(reservation station);保留站ι〇8耦接至一執行單元 112(execution unit);最後，執行單元112耦接至一引退單元114(retire unit)。指令解碼器1〇4可包括一指令轉譯器 (instruction translator)，用以將巨集指令（例如χ86架構之巨集指令）轉譯為微處理器1〇〇之類似精簡指令集（reduce instruction set computerRISC)之巨集指令。保留站 1〇8 產生並且傳送指令至執行單元112，用以使執行單元112依照程式順序(program order)執行。引退單元114包括一重新排序緩衝器（reorder buffer) ’用以依據程式順序執行指令之引退(Retirement)。執行單元112包括載入/儲存單元134以及其他執行單元132(other execution unit)，例如整數單元 (integer unit)、浮點數單元（floating point unit)、分支單元 (branch unit)或者單指令多重資料串流（single hstructi〇n Multiple Data ’ SIMD)單元。載入/儲存單元134用以讀取第一級資料快取記憶體116(level 1 data cache)之資料，並且寫入資料至第一級資料快取記憶體116。一第二級快取記憶體118用以支持（back)第一級資料快取記憶體116以及指令快取記憶體102。第二級快取記憶體118用以經由一匯流排介面單元122讀取以及寫入系統記憶體，匯流排介面單元122係微處理器100與一匯流排（例如一區域匯流排 (local bus)或是記憶體匯流排（memory bus))間之一介面。微 0608-A43067TW/fmal 12 201135460 包括—預取單元124,用以自系統記憶體預體116。級快取記憶體118及/或第一級資料快取記憶鬧。圖所示為第2圖之預取單元124較詳細之方塊 —、、單凡包括一區塊位元遮罩暫存器3〇2。區塊罩暫存器3〇2中之每_位元對應具有一記憶體區塊〇深，其中記憶體區塊之區塊號碼係儲存在一區塊號=暫，盗303内。換言之，區塊號碼暫存器3〇3儲存了隐體區塊之上層位址位元（upper address bits)。當區境位 =罩暫存器3〇2中之一位元的數值為真㈣e她^時，係才曰出所對應之快取線已經被存取了。初始化區塊位元遮罩暫存益302將使得所有的位元值為假(false)。在一實中’記憶體區塊的大小為4KB ’並且快取線之大小為料 =兀組°因此，區塊位元遮罩暫存器搬具有64位元之容量丄在某些實施例中，記憶體區塊之大小亦可與實體記憶體分頁（physical memory page)之大小相同。然而，快取線之大小在其他實施例中可為其他各種不同之大小。、再者’區塊位元遮罩暫存器3G2上所轉之記憶體區域之大小是可改變的’並不需要對應於實體記憶體分頁的大小。更確切的說，區塊位元遮罩暫存器3G2上所維持之記憶體區域（或區塊)之大小可為任何大小（二的倍數最好），只要= 擁有足夠的快取線以便進行利於獅方向與樣態的債測即可0 預取單元124亦可包括-最小指標暫存器3〇4_ pointer register)以及一最大指標暫存器3〇6(m 血 0608-A43067TW/fmal 13 201135460 °最小指標暫存器304以及最大指標暫存器3〇6分別用以在預取單％ 124開始追縱—記憶體區塊之存取後，持續地指向此記憶體區塊中已被存取之最低以及最高之快取線的食引（index)。預取單元124更包括一最小改變計數益:以及—最大改變計數器312。最小改變計數器遞以及=大改變計數器312分別用以在預取單元I24開始追縱此記憶體區塊之存取後，計算最小指標暫存器304以及最大才曰^暫存器3〇6改變之次數。預取單元以亦包括一總計數器314，用，v y- 〜用以在預取早儿124開始追蹤此記憶體區之子取後5十算已被存取之快取線的總數。預取單元丨24 亦包括一中間指標暫存器316，用以在預取單元124開始追縱此記憶體區塊之存取後’指向此記憶體區塊之中間引(例如最小指標暫存器304之計數值以及最大改變计數器312之計數值的平均）。預取單元124亦 Γ4： ^ t# ^ 342(direCti〇n register)' ~ ^ i ΐ 1 臾尋一^暫週/月暫存器346、一樣態區域暫存器州以及一搜哥Μ票暫存器352，纟各功能如下所述。及預取單元U4亦包括複數週期匹配計數器318加細 match counter)。每一週期匹配計數器3 i 8 之-計數值。在—實施例中，週期為3、4'm不同週期指中間指標暫存器316左/右之位元。週期係夕舛釤秸Αγη仏^ 匹配计數器318 之计數值在^的母-記憶體存取進行之後更新。位元遮罩暫存器302指示在週期田£鬼配左邊的存取與對侧標暫存器-右316 0608-A43067TW/final 14 201135460 - 時，預取單元124則接著增加與該週期相關之週期匹配計數器318之計數值。關於週期匹配計數器318更詳細之應用以及操作，將特別在下述之第四、五圖講述之。預取單元124亦包括一預取要求佇列328、一提取指標器324(pop pointer)以及一推進指標器326(push pointer)。預取要求符列328包括一循環的項目（entry)仵列，上述項目的每一者用以儲存預取單元124之操作（特別是關於第4、6以及7圖）所產生之預取要求。推進指標器 326指出將分派至預取要求佇列328的下一個項目（entry)。提取指標器324指出將從預取要求佇列328移出之下一個項目。在一實施例中，因為預取要求可能以失非循序的方式（out of order)結束，所以預取要求仔列328係可以非循失序的方式提取(popping)已完成的（completed)項目。在一實施例中，預取要求佇列328的大小係由於線路流程中，所有要求進入第二級快取記憶體118之標記之線路(tag pipeline)的線路流程而選擇的，於是使得預取要求仔列328 中項目之數目至少和第二級快取記憶體118内之管線層級 (stages)—樣多。預取要求將維持直至第二級快取記憶體 118之管線結束，在這個時間點，要求）可能是三個結果之一，如第7圖更詳細之敘述，亦即命中（hit in)第二級快取記憶磕118、重新執行(replay)、或者推進一全佇列管道項目，用以從系統記憶體預取需要的資料。 0608-A43067TW/fmal 15 201135460 預取單疋124亦包括控制邏輯322，控制邏輯322控制預取單元124之各元件執行其功能。雖」第3圖只顯示出一組與一主動（⑽㈣記憶體區塊有關之硬體單元332(區塊位域罩暫存器3()2、區境號碼暫存器則、最小指標暫存器304、最大指標暫存器遍、最小改變計數器3〇8、最大改變計數器312、總計數器叫、中間指標暫存n 316、樣態順序暫存器346、樣態區域暫存器348以及搜尋指標暫存器352)，但預取單幻24可包括複數個如第3圖所示之硬體單元说，収追縱多個主動記憶體區塊的存取。在一實施例中，微處理器100亦包括-個或多個高度反應式的（highly reactive)預取單元（未圖示），高度反應式的預取單元係在非常小的暫時樣本(sample)中使用不同的演算法來進行存取，並且與預取單元124配合動作，其說明如下。由於此處所述之預取單元m分析較大記憶體存取之數目（相較於高度反應式的預取單元），更長的時間去開始預取一新的記憶體區塊，其必趨向使用如下所述，但卻比高反應式_取單元更精確。因此，❹高度反應式的預取單^與預取單元124同時動作，微處iqq可擁有高反應式的預取單元之更快反應時間以及預取單元i24 之高精確度。另外’預取單元124可監控來自其他預取單元之要求，並且在其預取演算法中使用這些要求。〇608-A43067TW/fmal 201135460 如第4圖所示為第2圖之微處理器100的操作流程圖，並且特別是第3圖之預取單元124的動作。流程開始於步驟402。在步驟402中，預取單元124接收一載入/儲存記憶體存取要求，用以存取對一記憶體位址之一載入/儲存記憶體存取要求。在一實施例中，預取單元124在判斷預取哪些快取線時，會將出載入記憶體存取要求與儲存記憶體存取要求加以區分。在其他實施例中，預取單元124並不會在判斷預取哪些快取線時，辨別載入以及儲存。在一實施例中，預取單元124接收載入/儲存單元134所輸出之記憶體存取要求。預取單元124可接收來自不同來源之記憶體存取要求，上述來源包括（但不限於）載入/儲存單元134、第一級資料快取記憶體116(例如第一級資料快取記憶體 116所產生之一分派要求，於載入/儲存單元134記憶體存取未擊中第一級資料快取記憶體116時），及/或其他來源，例如微處理器100之用以執行與預取單元124不同預取演算法以預取資料之其他預取單元（未圖示）。流程進入步驟 404。在步驟404中，控制邏輯322拫據比較記憶體存取位址與每一區塊號碼暫存器303之數值，判斷是否對一主動區塊之記憶體進行存取。也就是，控制邏輯322判斷第 3圖所示之硬體單元332是否已被分派給記憶體存取要求 0608-A43067TW/final 17 201135460 所指定之記憶體位址所相關的記憶體區塊。若是，則進入步驟406。在步驟406中，控制邏輯322分派第3圖所示之硬體單兀332給相關之記憶體區塊。在一實施例中，控制邏輯322以—輪替（round-robin)的方式分派硬體單元332。在其他實施例，控制邏輯322為硬體單元332維持最久未用到的頁取代法(least_recently_used)之資訊，並且以一最久未用到的頁取代法(least_recently_used)之基礎進行分派。另外’控制邏輯322 #初始化所分派之硬體單元332。特別是’控制邏輯322會清除區塊位域罩暫存器逝之所有位儿’將記憶體存取位址之上層位元填充(pGpulate)至區塊號碼暫存器3G3 ’並且清除最小指標暫存!| 3G4、最大指標暫存器3〇6、最小改變計數器308、最大改變計數器312、總計數器3M以及匹配計數器318為g。流程進入到步驟408。在步驟4〇8中’控制邏輯322根據記憶體存取位址更新硬體單it 332 ’如第5圖所述。流程進入步驟412。在步驟412中，石争辦留-，更體早7〇 332測試(examine)總計數 β 314用以判斷程式是否已對記憶體區塊進行足夠之存取要求讀偵’則存取樣態。在—實施例中，控制邏輯功判斷總計數器314之朴盤伯θ ^ 冲數值疋否大於一既定值。在—實施例中，此既定值為1〇，妙二" 、 …、、而此既定值有很多種本發明不 0608-A43067TW/final 201135460 流程進行至步驟414 ; 於此。若已執行足夠之存取要求，否則流程結束。 ^判斷在區塊位元暫存器302中所指定的存取要求是 ’… 勒u B 疋否具有一個明顯的趨勢。也就疋說，控制邏輯322判斷存取要求有明顯向上的趨勢（存取位址增加）或是向下_勢(存取位址減少）。在— 實施例中，控制邏輯322根據最小改變計數器3〇8以及最大改變计數器312兩者的差值（difference)是否大於一既定值來決定存取要求是否有明顯的趨勢。在一實施例中，既定值為2’而在其他實施例中既定值可為其他數值。當最小改變計數器308之計數值大於最大改變計數器312之叶數值一既定值’則有明顯向下的趨勢；反之，當最大改變 5十數器312之計數值大於最小改變計數器308之計數值— 既定值’則有明顯向上的趨勢。當有一明顯的趨勢已產生，則進入步驟416，否則結束流程。在步驟416中，控制邏輯322判斷在區塊位元遮罩暫存器302所指定的存取要求中是否為具有一明顯的樣態週期贏家(pattern period winner)。在一實施例中，控制邏輯 322根據週期匹配計數器318之一者與其他週期匹配計數器318計數值之差值是否大於一既定值來決定是否有_明顯的樣態週期贏家。在一實施例中，既定值為2，而在其他貫施例中既定值可為其他數值。週期匹配計數器3 1 8之〇608-A43067TW/final 19 201135460 更新動作將於第5圖加以詳述。當有_ ，'肩的樣態週期贏家產生’流程進行到步驟418 ;否則，流程纟士束在步驟418中’控制邏輯322填充方向暫存器342 以指出步驟414所判斷之明顯的方向趨執 ° 另。另外，控制邏輯322用在步驟416偵測之清楚赢家樣態週期⑽肛 winning pattern period)N填充樣態順序智左w 只存盗346。最後，控制邏輯322將㈣416削貞測到之明―㈣· t 以’控制邏輯322用區塊位元遮罩暫存器302之N位元至中間指標暫存器之右側或者左側（根據第5圖步驟518所诂二π 所地而匹配）來填充樣態暫存器344。流程進行到步驟422。在步驟422中，控制邏輯322椒μ α 科以根據所偵測到之方向以及樣態開始對記憶體區塊中尚未被預取之快取線 (non-fetchedcacheiine)進行預取(如第6圖中所述）。流程在步驟422結束。第5圖所示為第3圖所示之預取單元124執行第4 圖所示之步驟的操作流程。流程開始於步驟逝。在步驟502中，控制邏輯得J22增加總計數器314之計數值。流程進行到步驟5〇4。在步驟504中，控制邏輯铒22判斷目前的記憶體存取位址（特別是指’最近記情體左 μ體存取位址所相關之快取線之器306之值記憶體區塊的索引值)是否大於最大指標暫存 0608-A43067TW/final 20 201135460 若是机輊進行到步驟506 ;若否則流程進行至步驟5〇8。在步驟506中，控制邏輯322用最近記憶體存取位止所相關之快取線之記憶體區塊的索引值來更新最大指標暫存益306，並增加最大改變計數器312之計數值。流程進行到步驟514。在步驟508中，控制邏輯322判斷被最近記憶體存取位址所相關之快取線之記憶體區塊的索引值是否小於最 J才曰私暫存态304之值。若是，流程進行至步驟512 ;若否’則流程進行至步驟514。在步驟512中，控制邏輯322帛最近記憶體存取位址所相關之快取線之記憶體區塊的索；丨值來更新最小指標暫存器304，並增加最小改變計數器3〇8之計數值。流程進行到步驟514。在步驟5丨4中，控制邏輯322計算最小指標暫存器 3〇4與最大指標暫存器3〇6之平均值，並且用所算之出平均值更新中間指標暫存n 316。流程進行到步驟516。在步驟5!6中，控制邏輯322檢查區塊位元遮罩暫存器302，並且以中間指標暫存器316為中心，切割成左側與右側各N位元，其中N為與每—週期匹配計數器训有關之每—者之位元的位元數。流程進行到步驟518。。。在步驟5财，控制邏輯322決定在令間指標暫存态316之左側的N位元是否與中間指標暫存器幻6之右| 〇608-A43067TW/finai ^ 201135460 的N位元相匹配。若是，流程進行到步驟522 ;若否，則流程結束。在步驟522中，控制邏輯322增加具有一 N週期之週期匹配計數器318之計數值。流程結束於步驟522。第6圖所示為第3圖之預取單元124執行第4圖之步驟422的操作流程圖。流程開始於步驟602。在步驟602中，控制邏輯322初始化會在離開偵測方向之中間指標暫存器316的樣態噸序暫存器346中，對搜尋指標暫存器3M以及樣態區域暫存器（patt?n locat1〇n)348進行初始化。也就是說，控制邏輯322會將搜尋指標暫存器352以及樣態區域暫存器348初始化成中間指標暫存器316與所彳貞測到之週期(N)兩者之間相加/相減後的值。例如，當中間指標暫存器3 i 6之值為！ 6，n為$ 並且方向暫存器342所示之趨勢為向上時，控制邏輯3: 則將搜尋指標暫存器352以及樣態區域暫存器撕初始為2卜因此，在本例中，為了比較之目的（如下所述卜態暫存器344之5位元可設置於區塊位元遮罩暫存器3( 之位元21到25。流程進行到步驟604。在步驟604中，控制邏輯322測試區塊位元遮罩 2中在方向暫存器342所指之位元以及樣能暫存 344中之對應位元（該位元係位於樣態區域用以對應區塊位元遮罩暫存哭I ^ 22 201135460 體區鬼中之對應快取線。流程進行到步驟606。之快取^步^ _中’控制邏輯322預測是否需要所測試線。當樣態暫存器⑽之位元為真㈣，控制邏輯 ^-彳此㈣物的，咖㈣將會^^ =:若快靖需要的，流程進行綱614 ，曰否已^驟6〇8中，控制邏輯322根據方向暫存器342 ^區塊中^區塊位元遮罩暫存器3〇2之盡頭，判斷在記憶 :π否有其他未測試之快取線。若已無未測試之快則·結束；否則，流程進行至步驟612。 342之^驟612中，控制邏輯322增加/減少方向暫存器 ^ 。另外，若方向暫存器342已超過樣態暫存器344 1ΓΓ位元時，控制邏輯322將用方向暫存器如之新新樣態區域暫存器州，例如將樣態暫存器344轉移(Shlft)至方向暫存器342之位置。流程進行到步驟604。 θ在步驟614中，控制邏輯322決定所需要之快取線是否已被預取。當區塊位元遮罩暫存器302之位元為直，控制邏輯奶則判斷所需要之快取線已被預取。若所需要之快取線已被預取’流程進行到步驟608 ;否則，流程進行到步驟616。在判斷步驟616中，若方向暫存器342為向下，控制邏輯322決定判斷列入參考之快取線是否自最小指標暫 0608-A43067TW/final 201135460 存器304多於一既定值（既定值在一實施例中為16);，或者若方向暫存器342為向上，控制邏輯322將判斷決定列入參考之快取線是否自最大指標暫存器306多於一既定值。右控制邏輯322決定列入參考之多於上述的判斷為真既定值，則流程結束；否則，流程進行到判斷步驟618。值得注意的是，若快取線大幅多於（遠離）最小指標暫存器 304/最大指標暫存器306則流程結束，但這樣並不代表預取單元124將不接著預取記憶體區塊之其它快取線，因為根據第4圖之步驟’對記憶體區塊之快取線的後續存取亦會再觸發更多的預取動作。在步驟618中，控制邏輯322判斷預取要求仔列328 是否滿了。若是預取要求佇列328滿了’則流程進行到步驟622，否則流程進行到步驟624。在步驟622中，控制邏輯322暫停(stall)直到預取要求仔列328不滿(n〇n-full)為土。流程進行到步驟624。在步驟624中，控制邏輯322推進一項目（entry)至預取要求佇列328，以預取快取線。流程進行到步驟608。如第7圖所示為第3圖t預取要求佇列328的操作流程圖。流程開始於步驟702。在步驟702中，在步驊624中被推進到預取要求佇列328中之一預取要求被允許進行存取（其中此預取要求用以對第二級快取記憶體118進行存取）’並繼續進行至第二 0608-A43067TW/final 201135460 - 級快取記憶體118的管道。流程進行到步驟704。在步驟704中，第二級快取記憶體118判斷快取線位址是否命中第二級快取記憶體118。若快取線位址命中第二級快取記憶體118，則流程進行到步驟706 ;否則，流程進行到判斷步驟708。在步驟706中，因為快取線已經在第二級快取記憶體118中準備好，故不需要預取快取線，流程結束。在步驟708中，控制邏輯322判斷第二級快取記憶體118之回應是否為此預取要求必須被重新執行。若是，則流程進行至步驟712 ;否則，流程進行至步驟714。在步驟712中，預取快取線之預取要求係重新推進 (re-pushed)至預取要求仔列328中。流程結束於步驟712。在步驟714中，第二級快取記憶體118推進一要求至微處理器100之一全佇列（fill queue)(未圖示）中，用以要求匯流排介面單元122將快取線讀取至微處理器100中。流程結束於步驟714。如第9圖所示為第2圖之微處理器100的操作範例。如第9圖所示為對一記憶體區塊進行十次存取後，區塊位元遮罩暫存器3 0 2 (在一位元位置上之星號表示對所對應之快取線進行存取）、最小改變計數器308、最大改變計數器312、以及總計數器314在第一、第二以及第十存取之内容。在第9圖中，最小改變計數器308稱 0608-A43067TW/final 25 201135460 為”cntr一min_change”，最大改變計數器312稱為 ’’cntr一max_change”，以及總計數器 3! 4 稱為”cntr_t〇tal”。中間指標暫存器316之位置在第9圖中則以”μ”所指示。由於對位址0x4dced300所進行的第一次存取（如第4 圖之步驟402)係在記憶體區塊中位於索引12上的快取線上進行，因此控制邏輯322將設定區塊位元遮罩暫存器3〇2 之位元12 (第4圖之步驟408)，如圖所示。另外，控制邏輯322將更新最小改變計數器3〇8、最大改變計數器312 以及總計數器314(第5圖之步驟502、506以及512)。由於對位址0x4Ced260之第二次存取係在記憶體區塊中位於索引9上的快取線進行，控制邏輯322根據將設定區塊位元遮罩暫存器302之位元9,如圖所示。另外，控制邂輯322將更新最小改變計數器308以及總計數器314 之計數值。在第三到第十次存取中（第三到第九次存取之位址未予圖示，第十次的存取位址為0x4dced6c0)，控制邏輯 322根據會對區塊位元遮罩暫存器進行適當元之設置，如圖所示。另外，控制邏輯322對應於每一次存取更新最小改變計數11通、最大改變計數器312以及總計數器314之計數值。 “ 在每個執行十次的記憶 522後的週期匹配計數第9圖底部為控制邏輯322 體的存取中，當執行完步驟514到〇608-A43067TW/final 26 201135460 —318之内容。在第9圖中，週期匹配計數器3Ϊ8稱為”她―peri〇d_N_matches”，其中 ν為卜2、3、4 或者 5。如第9圖所示之範例，雖然符合步驟412的準則（總。十數器314至少為十）以及符合步驟416的準則（週期$之週/月匹配汁數益318較其他所有之週期匹配計數器318至少大於2)，但不符合步驟414的準則（最小改變計數器 Τ及區塊位元遮罩暫存器3〇2之間的差少於2)。因此，此時將不會在此記憶體區塊内執行預取操作。如第9圖底部亦顯示在週期3、4以及5中，從週期 3 4以及5至中間指標暫存器316之右側與左側的樣態。如第10圖所示為第2圖之微處理器延續第9圖所示之範例的操作流程圖。第1〇圖描繪相似於第9圖之資訊，但不同處於在對記憶體區塊之進行第十一次以及第十人的存取（第十一次存取之位址為〇x4dced76〇)。如圖所不，其符合步驟412的準則（總計數器314至少為十）、步驟414的準則（最小改變計數器3〇8以及區塊位元遮罩暫存 302之間的差至少為2)以及步驟416的準則（週期5之週期匹配計數器318在週期5之計數較其他所有之週期匹配計數器318至少大於2)。因此，根據第4圖之步驟418，控制邏輯322填充(populate)方向暫存器342(用以指出方向趨勢為向上）、樣態順序暫存器346 (填入數值5)、樣態暫存器344(用樣態，，**，，或者，，〇1〇1〇，，）。控制邏輯322亦根據〇608-A43067TW/fina! 201135460 第4圖之步驟422與第6圖，為記憶體區塊執行預取預測，如第圖所示。第10圖亦顯示控制邏輯322在第6圖之步驟602操作中’方向暫存器342在位元21之位置。如第11圖所示為第2圖之微處理器1〇〇延續第9以及10圖之範例的操作流程圖。» U ®經由範例中描繪十不同範例之每—者(表標示成G到11)經過第6圖之步驟 604到父驟616直到記憶體區塊之快取線被預取單元預測發現需要被預取之記憶體區塊之的操作。如圖所示，在每範例中’方向暫存器342的值是根據第6圖步驟612 而曰加如第11圖所示，在範例5以及1〇中，樣態區域暫存器348會根據第6圖之倾612被更新。如範例〇、2、 4、5、7以及1〇所示’由於在方向暫存器342之位元為假 (false)，樣態指出在方向暫存器⑷上之快取線將不被需要。圖中更顯示，在範例卜3、6以及8中，由於在方向暫存器342中樣態暫存器344之位元為真㈣，樣態暫存器344指出在方向暫存器342上的快取線將被需要，然而快取線已經準備被取时etehed)，如區塊位元遮罩暫存器 302之位元為真(ture)之指示。最後如圖所示，在範例u中，由於在方向暫存器342中樣態暫存器344之位元為直 (㈣’所以樣態暫存器344指出在方向暫存器⑷上之快取線將被需要’但是因區塊位元遮罩暫存器302之位元為饭(false) ’所以此快取線尚未被取出㈣加幻。因此，控制 0608-A43067TW/final 28 201135460 邏輯322根據第6圖之步驟624推進—預取要求至預取要求fr歹〗328中，用以預取在位址⑻之快取線，其對應於在區塊位域罩暫存器搬之位元32。在貝施例中，所描述之一或多個既定值係可藉由操作系統(例如經由-樣態特定暫存器（mQdei specie 叫▲，廳R))或者經由微處理器1〇〇之溶絲（fuses)來編程，其中炫絲可在微處理器刚的生產過程中溶斷。在一貝施例中，區塊位元遮罩暫存器302之大小可為了節省電源(power)以與及裸片晶片大小機板(die㈣ estate)而減小。也就是言兒’在每一區塊位元遮罩暫存器如中的位元數’將少於在-記憶體區塊中快取線之數量。例如，在-實施例中，每-區塊位元遮罩暫存器3〇2之位元數僅為記憶體區塊所包含之快取線之數量的—半。區塊位元遮罩暫存器搬僅追縱對上半區塊或者下半區翻存位取，端看記憶體區塊的那-半先被存取，而—額外之位元用以指出記憶體區塊之下半部或者上半部是否先被存取Γ 在一實施例中，控制邏輯322並不如步驟516⑽ 所述地測試中間指標暫存器316上下N位元，而是包括— 序列引擎（senaUng㈣，-次—個或兩個位元地掃:區塊位元遮罩暫存H 3G2，Μ尋找大於—最大週期之樣態（如前所述為5位元）。在-實施例中，若在步驟414沒有偵測出明顯的方 0608-A43067TW/fmal 29 201135460 向趨勢、或者在步驟416並未偵測出明顯的樣態週期、以及總計數器314之計數值到達一既定臨界值（用以指出在記憶體區塊令之大部份的快取線已被存取）時，控制邏輯則繼續執行以及預取在記憶體區塊中剩下的快取線。上述既定臨界值係為記憶體區塊之快取記憶體數量之一相對高的百分比值，例如區塊位元遮罩暫存器3〇2之位元的值。第-絲取記憶快取記憧體夕炉中 οσ 一单元近代的微處理器包括具有一階層結構之快取記憶體。典型地’-微處理器包括一又小又快的第一級資料快取》己It肢以及較大但較慢之第二級快取記憶體，分別如 ^ 2圖之第-級資料快取記憶體116以及第二級快取記憶把118。具有-階層結構之快取記憶體有利於預取資料至快取記憶體’以改善快取記憶體之命中率速度⑽咖）。由於第一級資料快取記憶體116之速度較快，故較佳的狀況為預取資料至第-級資料快取記憶體ιΐ6。然而，由於第級貝料快取記憶體116之記憶體容量較小，快取記憶體命中之速度率可能實際上較差變慢，由於如果預取單元不正確預取資料進第一級資料快取記憶體U6使得最後資料 W不而要的’便需要而替代以其他需要的資料做替代。因此貝料被載人第—級資料快取記憶體116或者第二級〇608-A43067TW/final 201135460 Γ取記憶體118的結果，軸取單^是否能正確預測資料疋否被需要的函數(funeti°n)。因為第—級資料快取記憶體 ⑴被要求較小的尺寸，第一級資料快取記憶體】關向較小之容量以及因此具有較差的準確性；反之，由於第二級快取記憶體標籤以及資料陣列之大小使得第—級快取記憶體預取單元之大小顯得很小，所以—第二級快取記憶體預取單元可為較大之容量因此具有較佳之準確性。本發明實施例所述微處理器2⑽的優勢，在於一載入/儲存單幻34用以作為第二級快取記憶體ιΐ8以及第一級資料快取記憶體]16之預取需要之基礎。本發明之實施例提升載入/儲存單元134(第二級快取記憶體ιΐ8)之準確陡用以應用在解決上述預取進入第一級資料快取記憶體 116之問題。再者，實施例中也完成了運用單體邏輯(single body 〇fIogic)來處理第一級資料快取記憶體⑴以及第二級快取記憶體118之預取操作的目標。如第12圖所示為根據本發明各實施例之微處理器 100。第12圖之微處理器100相似於第2圖之微處理器_ 並具有如下所述之額外的特性。第一級資料快取記憶體116提供第一級資料記憶體位址196至預取單& 124。第一級資料記憶體位址196係藉由載人/儲存單元134對第-級資料快取記憶體116進行載入/儲存存取的實體位址。也就是說，預取單元124會隨 0608-A43067TW/fmal , 201135460 著載入/儲存單元134存取第—級資料快取記憶體m時進仃竊聽(eavesdr〇ps)。預取單元124提供一樣態預測快取線位址194至第一級資料快取記憶冑116之-仔列198，樣態賴快取線位址194為快取線之位址，其中之快取線係預，單疋124根據第一級資料記憶體位址196預測載入/儲子單7G 134即將對第-級資料快取記憶體出所提出之要求。第-級資料快取記憶體116提供—快取線配置要求⑼ 至預取f元124 ’用以從第二級快取記憶體ιι8要求快取線而k些快取線之位址係儲存於传列⑽中。最後，第二級快取記憶體m提供所要求之快取線資料m至第一級資料快取記憶體116。預取單元124，亦包括第一級資料搜尋指標器]72以及第級"貝料樣悲位址178，如第12圖所示。第一級資料搜尋才曰U 172以及第-級資料樣態位址178之用途與第 4圖相關且如下所述。。如第13圖所示為第12圖之預取單元124的操作流程圖。流程開始於步驟13〇2。在步驟1302中，預取單元124從第一級資料快取記隐體116接收第12圖之第一級資料記憶體位土止196。流程進行到步驟1304。在步驟1304中，由於預取單元124已事先偵測到一存取樣態並已開始從系統記憶體預取快取線進入第二級快 °6〇8-A43〇67TW/final ” 201135460 : 取記憶體118，故預取單元124偵測屬於一記憶體區塊（例如分頁（page))之第一級資料記憶體位址196，如第】至n 圖中相關處所述。仔細而言，由於存取樣態已被偵測，故預取單元124用以維持（maintain)—區塊號碼暫存器303，其指定記憶體區塊之基本位址。預取單元124藉由偵測區塊號碼暫存器303之位元是否匹配第一級資料記憶體位址 196之對應位元，來偵測第一級資料記憶體位址196是否落在記憶體區塊中。流程進行到步驟13〇6。在步驟1306中，從第一級資料記憶體位址1開始，預取單元124在記憶體區塊中所偵測到之存取方向 (detected access direction)上尋找下兩個快取線，這兩個快取線與先前所偵測的存取方向有關。步驟13〇6更詳細之執行操作將於後續的第14圖中加以說明。流程進行到步驟 1308。在步驟1308中，預取單元124提供在步驟13〇6找到之下兩個快取線之實體位址至第一級資料快取記憶體 116，作為樣態預測快取線位址194。在其他實施例中，預取單元丨24所提供之快取線位址的數量可多於或少於2。流程進行到步驟1312。在步驟1312中，第一級資料快取記憶體116把在步驟1308中所提供之位址推進至佇列198中。流程進行到步驟 1314 。 0608-A43067TW/fmal 201135460 在步驟1314中，無論何時只要佇列198為非空 (non-empty)，第一級資料快取記憶體116將下一個位址取出佇列198，並發出一快取線配置要求192至第二級快取 s己憶體118，以便取得在該位址之快取線。然而，若在佇列198之一位址已出現於第一級資料快取記憶體116，第一級負料快取記憶體116將拋棄（dumps)該位址以及放棄自第二級快取記憶體118要求其快取線。第二級快取記憶體 118接著提供所要求之快取線資料188至第一級資料快取記憶體116。流程結束於步驟1314。如第14圖所示為第12圖所示之預取單元124根據第13圖之步驟1306的操作流程圖。第14圖所敘述之操作係在第3圖所偵測到樣態方向為向上(叫^肛句的狀泥下'。然而，右所偵測到之樣態方向為向下，預取單元亦可用以執行同樣的功能。步驟！術到剛之操作係用以將第3圖中之樣態暫存器344放置在記憶體區塊中適當的位置’使得預取單A 124藉由從第一級資料記憶體位址⑼ 上開始的樣態暫存器344之樣態搜尋下兩個快取線中進行搜尋，並只要有需求時在該記憶體區塊上複製該樣態暫存器344之樣態344即可。、流程開始於步驟14〇2。在步驟中，預取單元124咖以於第6圖在步驟6〇2初始化搜尋指標暫存器352以及樣態區域暫存器⑽ 之方式’用第3圖之樣態順序暫存器以及中間指。 0608-A43067TW/final 日标 $ 34 201135460 ；存器316的總和，來初始化第12圖之第一級資料搜尋指標器172以及第一級資料樣態位址178。例如，若中間指標暫存器316之值為16以及樣態順序暫存器346為5，並且方向暫存器342之方向為往上，預取單元124初始化第一級資料搜尋指標器Π2以及第一級資料樣態位址178至 21。流程進行到步驟1414。在步驟14014中，預取單元124決定第一級資料記憶體位址196是否落入在具有目前所指定位置之樣態暫存器344之樣態中，樣態的目前位置開始係根據步驟1402所決定的，並可根據步驟1406進行更新。也就是說，預取單元124決定第一級資料記憶體位址196之適當位元（relevant bits)的值（即除了去確認記憶體區塊的位元，以及具有快取線中用來之指定位元組補償偏移（byte offset)的位元外），是否大於或者等於第一級資料搜尋指標器172之值，以及是否小於或者等於第一級資料搜尋指標器172之值與樣態順序暫存器346之值兩者所相加之總合。若第一級資料記憶體位址196落入(fall within)樣態暫存器344之樣態中，流程進行到步驟1408 ;否則流程進行到步驟1406。在步驟1406中，預取單元124根據樣態順序暫存器 346增加第一級資料搜尋指標器172以及第一級資料樣態位址178。根據步驟1406(與後續之步驟1418)所述之操作，若第一級資料搜尋指標器172已達到記憶體區塊之終點則 0608-A43067TW/fmal 35 201135460 結束搜尋。流程回到步驟1404。在步驟1408中，預取單元124將第一級資料搜尋指標器172之值設置（set)為第一級資料記憶體位址196所相關之快取線之記憶體頁的偏移量（offset)。流程進行到步驟 1412。在步驟1412中，預取單元124在第一級資料搜尋指標器172中測試樣態暫存器344中之位元。流程進行到步驟 1414 。在步驟1414中，預取單元124決定步驟1412所測試之位元是否設置好了。如果在步驟1412所測試之位元設置好了，流程進行到步驟1416;否則流程進行到步驟1418。在步驟1416中，預取單元124將步驟1414被樣態暫存器344所預測之快取線標記為已準備好傳送實體位址至第一級資料快取記憶體116，以作為一樣態預測快取線位址194。流程結束於步驟1416。在步驟1418中，預取單元124增加第一級資料搜尋指標器172之值。另外，若第一級資料搜尋指標器172已超過上述樣態暫存器344之最後一個位元，預取單元124 則用第一級資料搜尋指標器172之新的數值更新第一級資料搜尋指標器Π2之值，亦即轉換（shift)樣態暫存器344至新的第一級資料搜尋指標器172的位置。步驟1412到1418 之操作係反覆執行，直到兩快取線（或者快取線之其他既定 0608-A43067TW/final 36 201135460 值)被找到為止。流程結束於步驟1418。第13圖中預取快取線至第-級資料快取記憶體116 的好處係第-級資料快取記憶體ιι6以及第二級快取記憶體118所需要之改變較小。然而，在其他實施例中，預取單元124亦可不提供樣態預測快取線位址194至第-級資料快取記憶體116。例如’在—實施例中，預取單元124 直接要求匯流排介面單元122自記憶體獲擷取快取線，然後將所接收之寫人絲線寫人至第—級㈣快取記憶體在另-實施例中，預取單元124自用以提供資料至預取早70 124的第二級快取記憶冑118要求並取得快取線(如果為命中失敗(missing)職記憶體取得快取線），並將收到之快取線寫入至第-級資料快取記憶體116。在其他實施例中，預取單元124自第二級快取記憶體118要求快取線 (如果為命巾失敗(missin_從記憶體取得快取線），其直接將快取線寫入第一級資料快取記憶體116。如上所述’本#明之各實施例的好處在於具有單一的預取單元m總計數器314，作為第二級快取記憶體】18 以及第-級資料快取記憶體116兩者之預取f要之基礎。雖然第2、n以及15圖所示（如下討論之内容）為名明不同之區塊，預取單元m在㈣安排上可佔據鄰近於第二級快取記憶體！ i 8之標籤㈣以及資料列（_虹㈣之位置並且概念上包括第二級快取記憶體118，如第2】圖所示〇608-A43067TW/finaI 37 201135460 各實施例允許載人/儲存單元134具大空間之安排來提升之：其精確度與其大空間之需求，以應用一單體邏輯來處理第一級資料快取記憶體116以及第二級快取記憶體⑴之預取操作’以解決習知技術中只能預取進人資料給容量較小的第一級資料快取記憶體116之問題。 warm-up penaltvH^i ^ 取單元本發韻叙難料124在—雜縣塊(例如， -貫體記憶體頁）上偵測較複雜之存取樣態（例如，一實體記憶體頁）’其不同於習知一般預取單元之積測。舉例而 έ ’預取單元124可以根據—樣態制正在進行存取一記憶體區塊之程式，即使微處理器1〇〇之非循失序執行 (out-of-onier execution)管線(pipeline)會不以程式命令的順序而重新排序（re-order)記憶體存取，這可能會造成習知一般預取單元不去偵測記憶體存取樣態以及而導致沒有預取動作。这疋由於預取單元124只考慮對記憶體區塊之進行有效地存取’而時間順序(time order)並非其考量點。然而，為了滿足辨識更複雜之存取樣態及/或重新排序存取樣態之能力，相較於習知的預取單元，本發明之預取單元124可能需要一較長之時間去偵測存取樣態，如下所述之’’暖機時間（warm-up time)”。因此需要一減少預取單 0608-A43067TW/final 38 201135460 ; 元124暖機時間之方法。預取單元124 “預測—個之前藉由—存取樣態正在存卜記憶體區塊之程式，是否已經跨到（⑽心術)實際上與售的記憶體區塊相鄰之一新記憶體區塊，以及預測此程式是否會根據相同之樣態繼續存取這個新的記憶體區 a應於此’預取單元】24使用來自舊的記憶體區塊之 ' 方向以及其他相關資訊，以加快在新的記憶體區塊谓測存取樣態的速度，即減少暖機時間。如第15圖所示為具有一預取單元124之微處理器 100的方塊圖。第15圖之微處理器^⑻相似於第2以及丄2 圖之微處理器100，並且具有如下所述之其它特性。如第3圖中之相關敘述，預取單元124包括複數硬體單元332。每一硬體單元332相較於第3圖所述更包括 5己憶體區塊虛擬雜湊虛擬位址攔(hashed virtual address ofmemory，HVAMB)354 以及一狀態攔（status)356。在第 4 圖所述之步驟406初始化已分派之硬體單元332的過程中，預取單元124取出區塊號碼暫存器303中之實體區塊碼(physical block number)，並在將實體區塊碼轉譯成一虛擬位址後，根據後續第17圖所述之步驟1704所執行之相同雜湊法則（the same hashing algorithm)將實體區塊碼轉譯成一虛擬位址（雜湊(hash)此之虛擬位址），並將其雜湊演算之結果儲存至記憶體區塊虛擬雜湊位址欄354。狀態欄356 0608-A43067TW/fmal 39 201135460201135460 VI. Description of the Invention: [Technical Field] The present invention relates to a cache memory of a general microprocessor, and more particularly to a cache memory for prefetching data to a microprocessor. [Prior Art] With a computer system that is close to the old one, when the cache miss occurs, the time required for the microprocessor to access the system memory is higher than that of the microprocessor accessing the cache (cache). ) One or two orders of magnitude more. Therefore, in order to increase the cache hit rate, the microprocessor integrates prefetching techniques to test the recent data access patterns and attempts to predict which data is the next program. Access to objects, and the benefits of prefetching are well known. However, the Applicant has noticed that the access pattern of some programs is not detectable by the prefetching itit of the conventional microprocessor. For example, the gi diagram shows the access mode of the second-level cache memory (L2Caehe) when the execution of the private mode includes the storage operation via the memory-sequence, and the depiction is two memories. Body address. It can be seen from the figure that although (4) (4) and (4) increase the memory address in time, that is, from the upward direction, in many cases, the specified access memory address can be lower than the previous time, not the total. The trend is going up, making it different from the results of Xi 4 . Yu Baizhi pre-fetch unit actually predicts that although the total number of samples is relatively large, the general trend of u is toward one direction, but there are two reasons why the pre-fetch unit may face chaos in the face of the sample. One. The first reason is that Cheng Gu, according to its architecture, accesses 0608-A43067TW/final 4 201135460 memory', either by algorithmic features or poor programming (p〇〇r programming). The second reason is that the non-sequential (〇ut_〇f_〇rder execution) microprocessor core pipeline and the queue are executed under normal functions. 'The memory sequence is often different from the program sequence generated by it. take. Therefore, 'requires a data prefetching unit to efficiently prefetch data for the program. It must take into account that memory access commands (actions) are not obvious when the time windows are used. Trend (n〇dear trend) 'But when there is a large sample number, there will be a clear trend. 0 SUMMARY OF THE INVENTION The present invention discloses a prefetch unit that is provided in a microprocessor having a cache memory The prefetching unit is configured to receive a complex access request for a complex address of the memory block, each access requesting one of the addresses of the corresponding memory block, and the access request position The address system increases or decreases non-monotonically with time. The prefetch unit includes a storage device and a control logic. The control logic is coupled to the store = set, where the control logic is used when the access request is received. Maintaining one of the access requirements in the storage device * large address and a minimum bit ± stop, and the count value of the change of the maximum address and the minimum address 'maintains a history of the cache line that is accessed near Record, recently saved: it: the system is related to the address of the access request, according to the count value, it is determined according to the history, the access mode is determined, and according to the accessor: two: the access direction The cache memory has not been pre-fetched into the memory block by the history record, which has not been indexed by $ 〇 608-A43067TW/final, '201135460. The present invention discloses a data prefetching method for prefetching a processor-cache memory, a data prefetching method, including receiving a complex access request for a complex address of a memory block, each surface 1 Recalling the address of the block, and the address of the access request is not increased by 5 weeks or more when the access function is received, and the address is maintained in the memory block. The maximum and -: address, and the count value of the change of the maximum and minimum addresses is calculated. When the access request is received, the history of the memory block is recently updated. The most recently accessed cache is accessed. The line system is related to the access requirement; according to the count value; t-access direction; according to the historical record = access mode; and according to the access state and along the access direction, the fast memory is not yet Indicated by the history as the accessed cache line, = in the body block. Wang Yuanyi The present invention discloses a computer program product encoded on at least one reading medium and suitable for use in a computing device or a computer program. = : Computer readable program code. The computer readable program code is stored in the reading medium for 'specifying in the microprocessor with the -memory memory|prefetch unit, the computer readable program prefetching unit a complex access request for receiving a complex address of a pair of memory blocks, each access requesting one of the addresses of the corresponding memory block, and the address of the access request is non-time function Monotonically (n〇n_m〇n〇t〇nically) increases or decreases. The computer readable program includes a first code and a second code. The first code is used to define a storage device. The second code is used to define a control logic, which is connected to the storage device, wherein when the access request is received by 0608-A43067TW/fmal 6 201135460, the control logic is used to maintain the access by the storage device. One of the largest address and a minimum address, and the maximum and minimum, the count value of the change of the address, the memory block to maintain a history of the most recently accessed cache line in the memory block, Determining an access direction according to the count value, determining an access mode according to the history record, and according to the access mode and along the access direction, the cache memory is not indicated by the history record as the accessed cache line. Prefetched into the memory block. The invention discloses a microprocessor comprising a complex core, a cache memory and a prefetch unit. The cache memory is shared by the core to receive a complex access request for a plurality of addresses of a memory block, and each access request corresponds to one of the addresses of the memory block, and the access request The address is a non-monotonic (n〇n_m〇n〇t〇njCally) addition or subtraction prefetch unit to monitor access requirements and maintain one of the largest memory blocks. The address and a minimum address, and the count value of the change of the maximum address and the minimum bit ^, according to the count value, determine an access direction and along the access direction, the cache miss in the memory block The line is prefetched into the cache memory. The present invention discloses a microprocessor comprising a first level cache memory, a second level cache memory and a prefetch unit. The prefetching unit is configured to measure one of the latest access requirements and the centroids appearing in the second-level cache memory, and prefetch the plurality of cache lines to the first level according to the direction and the state. In the memory, from the first-level cache memory, one of the access requirements received by the first-level cache memory is received, wherein the address is related to a 7-take line and is determined in the direction. The cache line is followed by one or more cache lines as indicated by the pattern and causes one or more cache lines to be fetched into the first level cache memory. The invention discloses a method for purchasing materials, which is used for pre-fetching data to and having a second-level cache memory--microprocessor (4) field 贞劂 appearing in the second-level cache memory - The direction and the mode, and the pre-fetching of the plurality of cache lines to the second-level cache memory according to the direction and the sample; from the first-level cache memory, receiving the first-level cache memory is received - Access requirement 2: its I address is related to the cache line; it is determined by the state after the relevant cache line in the direction - a chain is taken; and ¥ to - or more caches The line is prefetched into the first level cache memory. The invention discloses a computer program product encoded on at least one computer readable medium and adapted for use in a computing device, the computer program product comprising: a computer readable program code. A computer readable program code is stored in the computer readable medium for defining a microprocessor, and the computer readable program includes a first private code, a second code, and a third code. The first-stage stone horse is used for the ^-level-level cache memory device. The second code is used to define - the second level cache device. The third code is used to define a prefetch unit that causes the prefetch unit to detect one of the most recent access requirements and appearances in the second level of cache memory, and according to the direction and the form, Pre-fetching the plurality of cache lines into the second-level cache memory, from the first-level cache memory, receiving one of the address requests received by the first-level cache memory, wherein the address is A cache line is associated with one or more cache lines that are indicated by the pattern after the cache line associated with the direction and causes one or more cache lines to be prefetched to the first level cache memory In the body. ~ 〇 608-A43067TW/final 201135460 The present invention discloses a microprocessor including a cache memory. Body and a prefetch unit. The prefetching unit is configured to detect the same state of the multiple memory access request of the first memory block. And prefetching the plurality of cache lines from the first memory block to the cache memory according to the mode, Monitoring a new memory access request of one of the second memory blocks, Determining whether the first memory block is virtually adjacent to the second memory block, And when continuing from the first memory block to the second memory block, Determining whether the mode predicts that a new memory access request of the second memory block is associated with one of the cache lines in the second memory block, And the cache line is pre-fetched from the second memory block to the cache memory according to the mode. The invention discloses a data prefetching method, Used to prefetch data to one of the microprocessor's cache memories, The data prefetching method includes /[measuring the same state of the complex memory access request with a first memory block, And prefetching the cache line from the first memory block to the cache memory according to the mode; Monitoring a new memory access request of one of the second memory blocks; Determining whether the first memory block is virtually adjacent to the second memory block, And when continuing from the first memory block to the second memory block, Determining whether the mode predicts that one of the new memory access requirements of the second memory block is related to the cache line in the second memory block; And according to the state, Prefetching multiple cache lines from the second memory block into the cache memory, In response to the decision step. The invention discloses a computer program product, Coded on at least one computer readable medium, And suitable for a computing device, The computer program product includes a computer readable program code. Stored on computer readable media, Used to define a microprocessor. The computer readable program includes a first code and a code number 0608-A43067TW/fina] 9 201135460. First code, Used to define a cache memory device. Second code, For defining a prefetching device, the prefetching device is configured to detect the same state of access with a first memory block. And prefetching from the first memory block into the cache line according to the mode, Monitoring one of the second memory blocks, one of the new access requirements, Determining that the first memory block is virtually adjacent to the first sth memory block and the pattern, When continuing from the first memory block to the second memory block, Predicting access to one of the cache lines associated with a new request having a second memory block, And the cache line entering the cache memory is prefetched from the second memory block in response to the mode. [Embodiment] The manufacture and method of various embodiments of the present invention will be discussed in detail below. However, it is worth noting that Many of the possible inventive concepts provided by the present invention can be implemented in a variety of materials (4). These limbs are used to implement (4) the manufacturing and use methods of the invention. However, it is not intended to limit the scope of the invention. In the case of 赝 _ still solve the problem of the problem can be explained. When - remember all faces (age, Action = one = on, All access (instructions, Action or request) = added access requirements are also indicated in the same; The above first picture is as shown in Figure 8: °The bounding frame of the whole size is circled. Or action Η. Two accesses to the quarantine (instructions that have access to 4KB blocks, Access time 'Y-axis table description-first two accesses =1, Take the index of the line. First of all, I ^ wire _ wire 5 is stored 201135460 + One sub-requires access to the cache line 6. The bounding box will circle the two points representing the access requirements. Take the picture as shown, The second access request is sent – the new point representing the third access request can be set twice: : : New access to the account continues to occur, Delimitation must follow 】: two: : Shovel and slow down #^Greatly upward example) °The above bounding box is the white on the historical record that will be used to determine the trend of accessing the pattern. Down or not. In addition to tracking the upper and lower edges of the bounding box to determine the trend direction, it is necessary to track individual access requirements. Because access requests skip - or two cache line events occur frequently. therefore, In order to avoid jumping over the prefetched cache line, Once an up or down trend is detected, the prefetch unit uses additional criteria to determine which cache line to prefetch. Since the access requirements trend will be rearranged, The prefetch unit deletes the transient reordering history. This action is done in a bit mask by the ^ marking bit. Each bit has a & Recalling the block/cache line of the body block, , And when the corresponding bit in the bit mask is set, Indicates that a particular block can be accessed. —曰 The access requirements for the memory block have reached a sufficient number, The prefetch unit uses a bit mask (where the bit mask does not have an indication of the timing of the access), And accessing the entire block based on the larger access point (generalized large view) as described below, and based on the smaller access view (small view) and the conventional prefetch unit, only based on access Time to access the block. Figure 2 is a block diagram of the micro-process state of the present invention. The microprocessor 100 has a pass path with multiple levels. And the transmission road 0608-A43067TW/final 11 201135460 also includes various functional units. The transfer path includes an instruction cache memory 102' instruction cache memory 1 〇 2 transferred to an instruction decoder 1; The instruction decoder 104 is coupled to a register alias table 1〇6 (register> Alias table, RAT); The register alias table ι〇6 is coupled to a reservation station 108 (reservation station); The reservation station ι 8 is coupled to an execution unit 112 (execution unit); At last, The execution unit 112 is coupled to a retirement unit 114. The instruction decoder 1〇4 may include an instruction translator. A macro instruction that is used to translate a macro instruction (such as a macro instruction of the χ86 architecture) into a microprocessor-like reduced instruction set computer RISC. The reservation station 1〇8 generates and transmits an instruction to the execution unit 112, The execution unit 112 is configured to execute in accordance with a program order. The retirement unit 114 includes a reorder buffer </ RTI> for performing the retirement of the instructions in accordance with the program sequence. The execution unit 112 includes a load/store unit 134 and other execution units 132. Such as an integer unit, Floating point unit Branch unit or single hstructi〇n Multiple Data (SIMD) unit. The load/store unit 134 is configured to read the data of the first level data cache 116 (level 1 data cache). And the data is written to the first level data cache 116. A second level cache memory 118 is used to back up the first level data cache memory 116 and the instruction cache memory 102. The second level cache memory 118 is used to read and write system memory via a bus interface unit 122. The bus interface unit 122 is an interface between the microprocessor 100 and a bus (e.g., a local bus or a memory bus). Micro 0608-A43067TW/fmal 12 201135460 includes a prefetch unit 124, Used to self-system memory pre-116. The level cache memory 118 and/or the first level data cache memory. The figure shows a more detailed block of the prefetch unit 124 of Fig. 2, , A single block includes a mask mask register 3〇2. Each of the block hood registers 3 〇 2 has a memory block 〇 deep, The block number of the memory block is stored in a block number = temporary, Pirates 303. In other words, The block number register 3〇3 stores the upper address bits of the hidden block. When the location bit = the value of one of the Sockets 3 〇 2 is true (four) e her ^, The corresponding cache line has been accessed. Initializing the block bit mask temporary save 302 will cause all bit values to be false. In a real world, the size of the memory block is 4 KB ’ and the size of the cache line is the material = 兀 group. Therefore, The block bit mask register has a capacity of 64 bits, in some embodiments, The size of the memory block can also be the same as the size of the physical memory page. however, The size of the cache line can be of various other sizes in other embodiments. , Furthermore, the size of the memory area transferred to the block occlusion mask 3G2 is changeable' and does not need to correspond to the size of the physical memory page. More precisely, The size of the memory area (or block) maintained on the block bit mask register 3G2 can be any size (the best of two is the best). As long as = has enough cache lines to facilitate the lion direction and the form of the debt test 0 prefetch unit 124 can also include - the minimum indicator register 3 〇 4_ pointer register) and a maximum indicator register 3 〇 6 (m blood 0608-A43067TW/fmal 13 201135460 ° minimum indicator register 304 and maximum indicator register 3 〇 6 are used to start the access to the memory block after the prefetching order % 124 The index of the lowest and highest cache lines that have been accessed in this memory block is continuously pointed. The prefetch unit 124 further includes a minimum change count benefit: And - the maximum change counter 312. The minimum change counter and the large change counter 312 are respectively used after the prefetch unit I24 starts to track the access of the memory block. The number of times the minimum indicator register 304 and the maximum value of the register 3〇6 are changed are calculated. The prefetch unit also includes a total counter 314. use, v y- ~ The total number of cache lines that have been accessed after the prefetch early 124 starts tracking the memory area. The prefetch unit 丨24 also includes an intermediate indicator register 316. Used to refer to the middle index of the memory block after the prefetch unit 124 begins to track the access of the memory block (for example, the count value of the minimum indicator register 304 and the count value of the maximum change counter 312). Average). Prefetch unit 124 is also Γ4: ^ t# ^ 342(direCti〇n register)' ~ ^ i ΐ 1 臾 Find a ^ temporary week / month register 346, The same state area register state and a Sogou ticket register 352, The functions are as follows. And the prefetch unit U4 also includes a complex period matching counter 318 to fine the match counter). Each cycle matches the counter 3 i 8 - the count value. In the embodiment, The period is 3, 4'm different period refers to the left/right bit of the intermediate indicator register 316. The count value of the periodic system Αη仏仏 matching counter 318 is updated after the mother-memory access of ^ is performed. The bit mask register 302 indicates that during the period of the access and the opposite side register - right 316 0608-A43067TW/final 14 201135460 - Prefetch unit 124 then increments the count value of period match counter 318 associated with the cycle. Regarding the more detailed application and operation of the period matching counter 318, Will be especially in the fourth, The five pictures tell. The prefetch unit 124 also includes a prefetch request queue 328, An extract pointer 324 (pop pointer) and a push pointer 326 (push pointer). The prefetch request column 328 includes a loop entry column. Each of the above items is used to store the operation of the prefetch unit 124 (especially regarding the fourth, Pre-fetch requirements generated in Figures 6 and 7). The push indicator 326 indicates the next entry to be dispatched to the prefetch request queue 328. The extraction indicator 324 indicates that the next item will be removed from the prefetch request queue 328. In an embodiment, Because prefetching requirements may end in an out of order, Therefore, the prefetch requires that the 328 system can populate the completed project in a non-disordered manner. In one embodiment, The size of the prefetch request queue 328 is due to the line flow, All required to enter the line flow of the tag pipeline of the second level cache 118, Thus, the number of items in the prefetch request queue 328 is at least as large as the pipeline stages in the second level cache 118. The prefetch request will be maintained until the end of the pipeline of the second level cache 118. At this point in time, Request) may be one of three outcomes, As described in more detail in Figure 7, That is, hit in the second-level cache memory 118, Replay, Or advance a full pipeline project, Used to prefetch the required data from the system memory. 0608-A43067TW/fmal 15 201135460 The prefetch unit 124 also includes control logic 322. Control logic 322 controls the various components of prefetch unit 124 to perform their functions. Although Fig. 3 only shows a set of hardware units 332 related to an active ((10) (four) memory block (block bit mask register 3 () 2 Area number, the register, Minimum indicator register 304, The largest indicator register, Minimum change counter 3〇8, Maximum change counter 312, The total counter is called, The intermediate indicator is temporarily stored n 316, Pattern sequential register 346, The mode area register 348 and the search indicator register 352), However, the prefetching single illusion 24 may include a plurality of hardware units as shown in Fig. 3, Access to multiple active memory blocks. In an embodiment, Microprocessor 100 also includes one or more highly reactive prefetching units (not shown). Highly reactive prefetching units use different algorithms for access in very small temporary samples. And cooperate with the prefetch unit 124, It is explained below. Since the prefetch unit m described herein analyzes the number of larger memory accesses (compared to the highly reactive prefetch unit), Longer time to start prefetching a new memory block, It must be used as described below, But it is more accurate than the high-reaction _ taking unit. therefore, ❹ Highly reactive prefetching unit ^ and prefetching unit 124 operate simultaneously, The micro-site iqq can have a faster reaction time of the highly reactive prefetch unit and a high precision of the prefetch unit i24. In addition, the prefetch unit 124 can monitor requirements from other prefetch units. And use these requirements in its prefetch algorithm. 〇608-A43067TW/fmal 201135460 As shown in Fig. 4, the operation flow chart of the microprocessor 100 of Fig. 2 is shown. And in particular the operation of the prefetch unit 124 of Fig. 3. The process begins in step 402. In step 402, Prefetch unit 124 receives a load/store memory access request, Used to access a load/store memory access request for one of the memory addresses. In an embodiment, When the prefetch unit 124 determines which cache lines are prefetched, The load memory access request is distinguished from the storage memory access requirement. In other embodiments, The prefetch unit 124 does not determine which cache lines are prefetched. Identify loading and storage. In an embodiment, The prefetch unit 124 receives the memory access request output by the load/store unit 134. The prefetch unit 124 can receive memory access requests from different sources. The above sources include, but are not limited to, a load/store unit 134, The first level of data cache memory 116 (eg, one of the first level data caches 116 generated by the distribution request, When the load/store unit 134 stores the memory of the first level data cache 116, And/or other sources, For example, other prefetching units (not shown) of microprocessor 100 for performing prefetching algorithms different from prefetch unit 124 to prefetch data. The flow proceeds to step 404. In step 404, Control logic 322 compares the value of the memory access address to the value of each block number register 303. It is judged whether or not the memory of a main block is accessed. That is, The control logic 322 determines whether the hardware unit 332 shown in FIG. 3 has been assigned to the memory block associated with the memory address specified by the memory access request 0608-A43067TW/final 17 201135460. if, Then, go to step 406. In step 406, Control logic 322 dispatches the hardware unit 332 shown in Figure 3 to the associated memory block. In an embodiment, Control logic 322 dispatches hardware unit 332 in a round-robin manner. In other embodiments, The control logic 322 maintains the information of the last unused page replacement (least_recently_used) for the hardware unit 332. It is also dispatched on the basis of a long-lasting page replacement method (least_recently_used). In addition, the control logic 322 # initializes the assigned hardware unit 332. In particular, the 'control logic 322 will clear all bits of the block bit field mask register' to fill (pGpulate) the upper bit of the memory access address to the block number register 3G3' and clear the minimum indicator. Temporary storage! | 3G4, The largest indicator, the register 3〇6, Minimum change counter 308, Maximum change counter 312, The total counter 3M and the matching counter 318 are g. The flow proceeds to step 408. In step 4〇8, the control logic 322 updates the hardware unit it 332' according to the memory access address as described in FIG. The flow proceeds to step 412. In step 412, Stone fight to stay -, More 7 〇 332 test (examine) total count β 314 is used to determine whether the program has sufficient access to the memory block. In the embodiment, The control logic determines whether the value of the general counter 314 is greater than a predetermined value. In the embodiment, This default value is 1〇, Wonderful two" , ..., , However, there are many types of the present invention, and the process of the invention is not 0608-A43067TW/final 201135460. herein. If sufficient access requirements have been fulfilled, Otherwise the process ends. ^ Judge whether the access request specified in the block bit buffer 302 is '... 勒 u B 疋 has a significant trend. That is to say, Control logic 322 determines that the access request has a significantly upward trend (access address increase) or down_pot (access address reduction). In the embodiment, Control logic 322 determines whether the access request has a significant tendency based on whether the difference between the minimum change counter 3〇8 and the maximum change counter 312 is greater than a predetermined value. In an embodiment, The predetermined value is 2' and in other embodiments the predetermined value may be other values. When the count value of the minimum change counter 308 is greater than the leaf value of the maximum change counter 312 by a predetermined value, there is a clear downward trend; on the contrary, When the maximum value of the tensor 312 is greater than the count value of the minimum change counter 308 - the predetermined value ', there is a clear upward trend. When a clear trend has emerged, Then proceed to step 416, Otherwise the process ends. In step 416, Control logic 322 determines if there is an apparent pattern period winner in the access request specified by block cipher mask register 302. In an embodiment, The control logic 322 determines whether there is a _ explicit modal cycle winner based on whether the difference between the one of the period match counters 318 and the count values of the other period match counters 318 is greater than a predetermined value. In an embodiment, The default value is 2, In other embodiments, the established value can be other values. Cycle Match Counter 3 1 8 〇 608-A43067TW/final 19 201135460 The update action will be detailed in Figure 5. When there is _ , The 'shoulder cycle cycle winner generation' process proceeds to step 418; otherwise, Flow Gentleman's Beam In step 418, control logic 322 fills direction register 342 to indicate the apparent direction gestation determined at step 414. In addition, The control logic 322 uses the clear winner pattern period (10) that is detected in step 416. The padding sequence is left. At last, The control logic 322 cuts the (four) 416 to the right - (4) · t with the control logic 322 masking the N bits of the register 302 to the right or left side of the intermediate indicator register (according to Figure 5) The step 518 is matched by the second π to fill the modal register 344. The flow proceeds to step 422. In step 422, The control logic 322 is configured to prefetch the non-fetched cacheiine in the memory block (as described in FIG. 6) according to the detected direction and the mode. . The process ends at step 422. Fig. 5 is a flow chart showing the operation of the prefetch unit 124 shown in Fig. 3 to execute the steps shown in Fig. 4. The process begins with the steps gone. In step 502, Control logic J22 increments the count value of the total counter 314. The flow proceeds to step 5〇4. In step 504, The control logic 22 determines whether the current memory access address (especially the index value of the value memory block of the cache line 306 associated with the nearest left body access address) is greater than The maximum index temporary storage 0608-A43067TW/final 20 201135460 If the machine proceeds to step 506; Otherwise, the flow proceeds to step 5〇8. In step 506, The control logic 322 updates the maximum indicator temporary storage benefit 306 with the index value of the memory block of the cache line associated with the most recent memory access. And the count value of the maximum change counter 312 is increased. The flow proceeds to step 514. In step 508, The control logic 322 determines whether the index value of the memory block of the cache line associated with the most recent memory access address is less than the value of the most recent private temporary storage state 304. if, The process proceeds to step 512; If no, the flow proceeds to step 514. In step 512, Control logic 322 索 the memory block of the cache line associated with the most recent memory access address; Depreciate to update the minimum indicator register 304, And increase the minimum change counter 3 〇 8 count value. The flow proceeds to step 514. In step 5丨4, Control logic 322 calculates an average of the minimum indicator register 3〇4 and the maximum indicator register 3〇6, And update the intermediate indicator temporary storage n 316 with the calculated average value. The flow proceeds to step 516. At step 5! 6, in Control logic 322 checks block bit mask register 302, And centered on the intermediate indicator register 316, Cut into N bits on the left and right sides, Where N is the number of bits per bit associated with each period-matching counter training. The flow proceeds to step 518. . . In step 5, Control logic 322 determines whether the N-bit to the left of inter-arrival indicator temporary state 316 matches the N-bit of the middle indicator register phantom 6 〇 608-A43067TW/finai ^ 201135460. if, The process proceeds to step 522; If not, Then the process ends. In step 522, Control logic 322 increments the count value of period match counter 318 having an N cycle. The process ends at step 522. Fig. 6 is a flow chart showing the operation of the prefetching unit 124 of Fig. 3 to perform step 422 of Fig. 4. The flow begins in step 602. In step 602, Control logic 322 initializes the ton-like register 346 in the intermediate indicator register 316 that leaves the detection direction. For the search indicator register 3M and the sample area register (patt? n locat1〇n) 348 is initialized. That is, The control logic 322 initializes the search index register 352 and the sample region register 348 to the value added/subtracted between the intermediate index register 316 and the detected period (N). . E.g, When the value of the intermediate indicator register 3 i 6 is! 6, n is $ and the trend shown by direction register 342 is up, Control Logic 3: Then the search index register 352 and the sample area register are torn to 2, so In this case, For comparison purposes (5 bits of the scratch register 344 as described below can be placed in the block bit mask register 3 (bits 21 through 25). The flow proceeds to step 604. In step 604, The control logic 322 tests the bit locations in the block buffer 2 in the direction register 342 and the corresponding bits in the sample temporary storage 344 (the bit is located in the mode area for corresponding block bits) Mask Temporary Cry I ^ 22 201135460 Corresponding cache line in the body ghost. The flow proceeds to step 606. The fast control step ^ _ middle 'control logic 322 predicts whether the test line is needed. When the bit of the state register (10) is true (four), Control logic ^-彳(4), Coffee (four) will ^^ =: If you need it, The process proceeds to outline 614, 曰No, ^6〇8, The control logic 322 masks the end of the scratchpad 3〇2 according to the direction register 342 ^ block block Judging in memory: π No other untested cache lines. If there is no untested, then the end; otherwise, The flow proceeds to step 612. In 342, in step 612, Control logic 322 increments/decreases the direction register ^. In addition, If the direction register 342 has exceeded the level register 344 1 bit, The control logic 322 will use the direction register as a new state region register state, For example, the mode register 344 is shifted (Shlft) to the position of the direction register 342. The flow proceeds to step 604. θ is in step 614, Control logic 322 determines if the desired cache line has been prefetched. When the block bit masks the register 302, the bit is straight. The control logic milk determines that the required cache line has been prefetched. If the required cache line has been prefetched, the process proceeds to step 608; otherwise, The flow proceeds to step 616. In decision step 616, If the direction register 342 is down, The control logic 322 determines whether the cache line included in the reference is more than a predetermined value from the minimum indicator temporarily 0608-A43067TW/final 201135460 (the established value is 16 in one embodiment); , Or if the direction register 342 is up, Control logic 322 will determine if the cache line listed in the reference is greater than a predetermined value from maximum indicator register 306. The right control logic 322 determines that more than the above determinations included in the reference are true, Then the process ends; otherwise, Flow proceeds to decision step 618. It is worth noting that If the cache line is significantly more than (away from) the minimum indicator register 304/maximum indicator register 306, the process ends. However, this does not mean that the prefetch unit 124 will not prefetch other cache lines of the memory block. Subsequent access to the cache line of the memory block according to step 4 of Figure 4 will also trigger more prefetch actions. In step 618, Control logic 322 determines if prefetch request queue 328 is full. If the prefetch request queue 328 is full, then the flow proceeds to step 622. Otherwise the flow proceeds to step 624. In step 622, Control logic 322 stalls until the prefetch request queue 328 is not full (n〇n-full). The flow proceeds to step 624. In step 624, Control logic 322 advances an entry to a prefetch request queue 328, Prefetch the cache line. The flow proceeds to step 608. As shown in Fig. 7, a flowchart of the operation of the prefetch request queue 328 of Fig. 3 is shown. The process begins in step 702. In step 702, One of the prefetch requests advanced to the prefetch request queue 328 in step 624 is allowed access (where the prefetch request is used to access the second level cache 118) and continues To the second 0608-A43067TW/final 201135460 - the pipeline of the cache memory 118. The flow proceeds to step 704. In step 704, The second level cache memory 118 determines whether the cache line address hits the second level cache memory 118. If the cache line address hits the second level cache memory 118, Then the process proceeds to step 706; otherwise, Flow proceeds to decision step 708. In step 706, Because the cache line is already ready in the second level cache memory 118, Therefore, there is no need to prefetch the cache line. The process ends. In step 708, Control logic 322 determines if the response of second level cache 118 has to be re-executed for this prefetch request. if, Then the process proceeds to step 712; otherwise, The flow proceeds to step 714. In step 712, The prefetch request for the prefetch cache line is re-pushed to the prefetch request queue 328. The process ends at step 712. In step 714, The second level cache memory 118 advances a request to one of the microprocessor 100 fill queues (not shown). The bus bar interface unit 122 is required to read the cache line into the microprocessor 100. The process ends at step 714. An operation example of the microprocessor 100 of Fig. 2 is shown in Fig. 9. As shown in Figure 9, after ten accesses to a memory block, Block bit mask register 3 0 2 (the asterisk at the one-bit position indicates access to the corresponding cache line), Minimum change counter 308, Maximum change counter 312, And the total counter 314 is at first, The contents of the second and tenth accesses. In Figure 9, The minimum change counter 308 is called 0608-A43067TW/final 25 201135460 is "cntr one min_change", The maximum change counter 312 is called ’’cntr_max_change”, And the total counter 3! 4 is called "cntr_t〇tal". The position of the intermediate indicator register 316 is indicated by "μ" in Fig. 9. Since the first access to address 0x4dced300 (step 402 of Figure 4) is performed on the cache line on index 12 in the memory block, Therefore, the control logic 322 masks the block bit to mask the bit 12 of the register 3〇2 (step 408 of FIG. 4). as the picture shows. In addition, Control logic 322 will update the minimum change counter 3〇8, The maximum change counter 312 and the total counter 314 (step 502 of Figure 5, 506 and 512). Since the second access to the address 0x4Ced260 is performed on the cache line located on index 9 in the memory block, Control logic 322 masks bit 9 of register 302 in accordance with the set block cipher. as the picture shows. In addition, Control 322 will update the count value of minimum change counter 308 and total counter 314. In the third to tenth accesses (the third to ninth access addresses are not shown, The tenth access address is 0x4dced6c0), The control logic 322 sets the appropriate element according to the block bit mask register. as the picture shows. In addition, The control logic 322 corresponds to each access update update minimum change count 11 pass, The maximum change counter 312 and the count value of the total counter 314 are changed. “The cycle match count after each memory 522 of execution 522 is at the bottom of the control logic 322 body access at the bottom of Figure 9. When step 514 is performed, the contents of 〇608-A43067TW/final 26 201135460-318 are performed. In Figure 9, The period matching counter 3Ϊ8 is called "her-peri〇d_N_matches", Where ν is 2 3, 4 or 5. As shown in Figure 9, Although the criteria of step 412 are met (total. The decimator 314 is at least ten) and conforms to the criteria of step 416 (the cycle/month matching juice number 318 of the cycle $ is at least greater than 2 than all other cycle matching counters 318), However, the criteria of step 414 are not met (the difference between the minimum change counter 区 and the block bit mask register 3 〇 2 is less than 2). therefore, Prefetching will not be performed within this memory block at this time. As shown in the bottom of Figure 9, also shown in cycle 3, 4 and 5, From the period 3 4 and 5 to the state of the right side and the left side of the intermediate indicator register 316. As shown in Fig. 10, the microprocessor of Fig. 2 continues the operational flow chart of the example shown in Fig. 9. Figure 1 depicts an information similar to Figure 9, However, the difference is in the eleventh and tenth access to the memory block (the address of the eleventh access is 〇x4dced76〇). As shown in the figure, It meets the criteria of step 412 (the total counter 314 is at least ten), The criteria of step 414 (minimum change counter 3〇8 and the difference between block bit mask scratchpad 302 is at least 2) and the criteria of step 416 (cycle match counter 318 of cycle 5 counts in cycle 5 compared to all other The period match counter 318 is at least greater than 2). therefore, According to step 418 of Figure 4, Control logic 322 populates direction register 342 (to indicate that the direction trend is up), Pattern sequence register 346 (fill in the value 5), Pattern register 344 (using the form, , **, , or, , 〇1〇1〇, , ). Control logic 322 is also based on 〇608-A43067TW/fina! 201135460 Step 422 and Figure 6 of Figure 4, Perform prefetch prediction for memory blocks, As shown in the figure. Figure 10 also shows the control logic 322 in the operation of step 602 of Figure 6 where the direction register 342 is at bit 21. As shown in Fig. 11, the microprocessor 1 of Fig. 2 continues the operational flow chart of the examples of Figs. 9 and 10. » U ® depicts each of the ten different paradigms in the example (the table is labeled G to 11) through step 604 of Figure 6 to the parent step 616 until the cache line of the memory block is predicted by the prefetch unit to be found The operation of the prefetched memory block. as the picture shows, In each of the examples, the value of the 'direction register 342 is added as shown in Fig. 11 according to step 612 of Fig. 6, In examples 5 and 1 , The mode area register 348 is updated according to the tilt 612 of Figure 6. As an example, 2, 4, 5, 7 and 1 are shown as 'because the bit in the direction register 342 is false (false), The pattern indicates that the cache line on the direction register (4) will not be needed. The figure shows more, In the example, 3, 6 and 8, Since the bit of the mode register 344 is true (4) in the direction register 342, The mode register 344 indicates that the cache line on the direction register 342 will be needed, However, the cache line is ready to be taken when etehed), For example, the bit of the block bit mask register 302 is an indication of true. Finally, as shown in the figure, In example u, Since the bit of the mode register 344 is straight ((4)' in the direction register 342, the mode register 344 indicates that the cache line on the direction register (4) will be required 'but due to the block bit The bit of the meta-mask register 302 is a meal (false) 'so this cache line has not been taken out (four) plus magic. therefore, Control 0608-A43067TW/final 28 201135460 Logic 322 advances according to step 624 of Figure 6 - prefetch request to prefetch request fr歹 328, Used to prefetch the cache line at address (8), It corresponds to the bit 32 moved in the block bit field. In the case of Beishi, One or more of the established values described may be by an operating system (e.g., via a -like specific register (mQdei specie called ▲, Hall R)) or programmed via a microprocessor 1 fuses Among them, the glare can be dissolved in the production process of the microprocessor. In a case, The size of the block bit mask register 302 can be reduced in order to save power and die dies. That is to say, the number of bits in the mask of each block will be less than the number of cache lines in the block of memory. E.g, In an embodiment, The number of bits per block-level mask register 3〇2 is only half of the number of cache lines included in the memory block. The block bit mask register is moved to track only the upper half block or the lower half area. Looking at the memory block, the half-first is accessed first. And the additional bits are used to indicate whether the lower half or the upper half of the memory block is accessed first. In an embodiment, The control logic 322 does not test the upper and lower N bits of the intermediate indicator register 316 as described in step 516 (10). But include - the sequence engine (senaUng (four), -Second- or two-bit sweep: Block zoning masks temporary storage of H 3G2, Μ Look for a pattern larger than the maximum period (5 bits as described above). In an embodiment, If the obvious square 0608-A43067TW/fmal 29 201135460 is not detected in step 414, Or in step 416, no obvious period of time is detected, And when the count value of the total counter 314 reaches a predetermined threshold (to indicate that most of the cache lines have been accessed in the memory block block), The control logic continues to execute and prefetches the remaining cache lines in the memory block. The above-mentioned predetermined threshold is a relatively high percentage of the number of cache memories of the memory block. For example, the block bit masks the value of the bit of the scratchpad 3〇2. The first-wire memory is fast-received in the 夕夕 ο 一一近代 unit Modern microprocessor includes a cache memory with a hierarchical structure. Typically, the 'microprocessor includes a small and fast first level data cache" and its larger and slower second level cache memory. For example, the first-level data cache memory 116 and the second-level cache memory are as shown in Fig. 2. A cache memory having a - hierarchical structure facilitates prefetching data to the cache memory to improve the hit rate of the cache memory (10). Since the first level data cache memory 116 is faster, Therefore, the preferred condition is to prefetch data to the first level data cache memory ιΐ6. however, Since the memory capacity of the first-level batting memory 116 is small, Cache memory hit rate may actually be slower and slower, If the prefetch unit does not correctly prefetch the data into the first level data cache memory U6 so that the last data W is not needed, it needs to be replaced by other required data. Therefore, the bead material is loaded with the first-level data cache memory 116 or the second-level 〇608-A43067TW/final 201135460 to retrieve the result of the memory 118. Whether the axis can be used to correctly predict the data 疋 is not required function (funeti°n). Because the first level data cache memory (1) is required to be smaller, The first level of data cache memory] is relatively small and therefore has poor accuracy; on the contrary, Due to the size of the second-level cache memory and the size of the data array, the size of the first-stage cache memory prefetch unit is small. Therefore, the second-stage cache memory prefetch unit can have a larger capacity and therefore has better accuracy. The advantages of the microprocessor 2 (10) according to the embodiment of the present invention, It is based on the prefetching requirement of a load/store single phantom 34 for the second level cache memory ι 8 and the first level data cache memory 16 . The embodiment of the present invention enhances the accuracy of the load/store unit 134 (second level cache memory ι 8) for application to solve the above problem of prefetching into the first level data cache 116. Furthermore, The goal of using the single body 〇fIogic to process the prefetch operation of the first level data cache memory (1) and the second level cache memory 118 is also implemented in the embodiment. As shown in Fig. 12, a microprocessor 100 in accordance with various embodiments of the present invention is shown. The microprocessor 100 of Fig. 12 is similar to the microprocessor of Fig. 2 and has additional features as described below. The first level data cache 116 provides a first level data memory address 196 to a prefetch & 124. The first level data memory address 196 is a physical address for loading/storing access to the first level data cache memory 116 by the person/storage unit 134. That is, Prefetch unit 124 will follow 0608-A43067TW/fmal. 201135460 The load/store unit 134 accesses the first-level data cache memory m when it is eavesdropping (eavesdr〇ps). The prefetch unit 124 provides the same state prediction cache line address 194 to the first level data cache memory 116 - the 198 column, The sample fast access line address 194 is the address of the cache line. Among them, the cache line is pre- The unit 124 predicts that the load/store 7G 134 is about to be requested for the level-level data cache based on the first level data memory address 196. The first level data cache memory 116 is provided - the cache line configuration requirement (9) to the prefetch f element 124 ' is used to request the cache line from the second level cache memory ιι8 and the cache lines of the cache lines are stored. In the biography (10). At last, The second level cache memory m provides the required cache line data m to the first level data cache memory 116. Prefetch unit 124, Also includes the first level data search indicator] 72 and the level " The material of the shell is 178, As shown in Figure 12. The first level of information is used to search for the U 172 and the level of the data sample address 178 in relation to Figure 4 and is described below. . An operational flowchart of the prefetch unit 124 of Fig. 12 is shown in Fig. 13. The process begins at step 13〇2. In step 1302, The prefetch unit 124 receives the first level data memory location 196 of Fig. 12 from the first level data cache. The flow proceeds to step 1304. In step 1304, Since the prefetch unit 124 has detected an access pattern in advance and has started prefetching the cache line from the system memory into the second level, the speed is higher. 6〇8-A43〇67TW/final ” 201135460: Taking memory 118, Therefore, the prefetch unit 124 detects the first level data memory address 196 belonging to a memory block (for example, a page). As described in the relevant section of the figure ～n. Carefully speaking, Since the access pattern has been detected, Therefore, the prefetch unit 124 is used to maintain the block number register 303. It specifies the base address of the memory block. The prefetch unit 124 detects whether the bit of the block number register 303 matches the corresponding bit of the first level data memory address 196. To detect whether the first level data memory address 196 falls in the memory block. The flow proceeds to step 13〇6. In step 1306, Starting with the first level data memory address 1, The prefetch unit 124 searches for the next two cache lines in the detected access direction detected in the memory block. These two cache lines are related to the previously detected access direction. A more detailed operation of Step 13〇6 will be explained in the subsequent Figure 14. The flow proceeds to step 1308. In step 1308, The prefetch unit 124 provides the physical address of the two cache lines to the first level data cache 116 in step 13〇6. The cache line address 194 is predicted as a pattern. In other embodiments, The number of cache line addresses provided by the prefetch unit 丨 24 may be more or less than two. The flow proceeds to step 1312. In step 1312, The first level of data cache memory 116 advances the address provided in step 1308 to queue 198. The flow proceeds to step 1314. 0608-A43067TW/fmal 201135460 In step 1314, Whenever the queue 198 is non-empty, The first level data cache 116 takes the next address out of the queue 198. And issue a cache line configuration request 192 to the second level cache s memory body 118, In order to get the cache line at the address. however, If one of the addresses in the queue 198 has appeared in the first level data cache 116, The first level of the negative cache memory 116 will dump the address and discard the second level of cache memory 118 from its cache line. The second level cache memory 118 then provides the requested cache line data 188 to the first level data cache memory 116. The process ends at step 1314. As shown in Fig. 14, the prefetch unit 124 shown in Fig. 12 is a flow chart according to the operation of step 1306 of Fig. 13. The operation described in Fig. 14 is as shown in Fig. 3, and the direction of the sample is upward (called the underside of the anal sentence). however, The direction detected by the right is downward. The prefetch unit can also be used to perform the same function. step! The operation is just to place the mode register 344 in FIG. 3 in the appropriate location in the memory block' so that the prefetch list A 124 starts from the first level data memory address (9). The state register 344 searches for the next two cache lines for searching. And the state 344 of the state register 344 can be copied on the memory block as needed. , The process begins at step 14〇2. In the steps, The prefetch unit 124 uses the mode of the third embodiment to initialize the search index register 352 and the state area register (10) in step 6 of FIG. 0608-A43067TW/final Japanese standard $ 34 201135460 ; The sum of the registers 316, To initialize the first level data search indicator 172 of Fig. 12 and the first level data sample address 178. E.g, If the intermediate indicator register 316 has a value of 16 and the sample sequence register 346 is 5, And the direction of the direction register 342 is upward. The prefetch unit 124 initializes the first level data search indicator Π2 and the first level data sample addresses 178 to 21. The flow proceeds to step 1414. In step 14014, The prefetch unit 124 determines whether the first level data memory address 196 falls into the state of the temporary register 344 having the currently specified position. The current position of the pattern is initially determined according to step 1402. It can be updated according to step 1406. That is, Prefetch unit 124 determines the value of the appropriate bits of the first level data memory address 196 (i.e., in addition to identifying the bits of the memory block, And a bit with a specified byte offset offset used in the cache line), Is it greater than or equal to the value of the first level data search indicator 172, And whether it is less than or equal to the sum of the value of the first level data search indexer 172 and the value of the sample order register 346. If the first level data memory address 196 falls into the state of the state register 344, The process proceeds to step 1408; Otherwise the flow proceeds to step 1406. In step 1406, The prefetch unit 124 adds the first level data search indexer 172 and the first level data sample address 178 according to the sample sequence register 346. According to the operation described in step 1406 (and subsequent step 1418), If the first level data search indexer 172 has reached the end of the memory block, then 0608-A43067TW/fmal 35 201135460 End the search. The flow returns to step 1404. In step 1408, The prefetch unit 124 sets the value of the first level data search indexer 172 to the offset of the memory page of the cache line associated with the first level data memory address 196. The flow proceeds to step 1412. In step 1412, The prefetch unit 124 tests the bits in the mode register 344 in the first level data search indexer 172. The flow proceeds to step 1414. In step 1414, Prefetch unit 124 determines if the bit tested in step 1412 is set. If the bit tested in step 1412 is set, The process proceeds to step 1416; Otherwise the flow proceeds to step 1418. In step 1416, The prefetch unit 124 marks the cache line predicted by the mode register 344 in step 1414 as being ready to transmit the physical address to the first level data cache 116. The cache line address 194 is predicted as the same state. The process ends at step 1416. In step 1418, The prefetch unit 124 increments the value of the first level data search indicator 172. In addition, If the first level data search indexer 172 has exceeded the last bit of the mode register 344, The prefetch unit 124 updates the value of the first level data search indicator Π2 with the new value of the first level data search indicator 172. That is, shifting the state register 344 to the position of the new first level data search indexer 172. The operations of steps 1412 to 1418 are repeated, Until the two cache lines (or other established line of the cache line 0608-A43067TW/final 36 201135460) are found. The process ends at step 1418. The benefit of prefetching the cache line to the level-level data cache 116 in Fig. 13 is that the change required for the level-level data cache memory ιι6 and the second-level cache memory 118 is small. however, In other embodiments, The prefetch unit 124 may also not provide the aspect prediction cache line address 194 to the level-level data cache memory 116. For example, in the embodiment, The prefetch unit 124 directly requests the bus interface interface unit 122 to obtain the cache line from the memory. Then, the received writing thread is written to the first level (four) cache memory. In another embodiment, The prefetch unit 124 requests and obtains a cache line from the second-level cache memory 118 for providing the data to the pre-fetching 70 124 (if a memory line is obtained for a missed job memory), The received cache line is written to the level-level data cache 116. In other embodiments, The prefetch unit 124 requests the cache line from the second level cache memory 118 (if the fuse is failed (missin_ gets the cache line from the memory), It directly writes the cache line to the first level data cache 116. An advantage of the various embodiments described above is that there is a single prefetch unit m total counter 314, As the second-level cache memory 18 and the first-level data cache memory 116 are the basis for prefetching. Although the second The n and 15 diagrams (as discussed below) are blocks of different names, The prefetch unit m can occupy adjacent to the second level cache memory in the (four) arrangement! The label (4) of i 8 and the position of the data column (_虹(四) and conceptually include the second level cache memory 118, As shown in Fig. 2, 〇608-A43067TW/finaI 37 201135460 Embodiments allow the manned/storage unit 134 to have a large space arrangement to enhance it: Its precision and its large space needs, Applying a single logic to process the first-level data cache memory 116 and the second-level cache memory (1) prefetch operation to solve the problem in the prior art that only the pre-fetched data can be pre-fetched to a smaller capacity. The problem of the primary data cache memory 116. Warm-up penaltvH^i ^ Take the unit This is a rhyme-telling material 124 in the Zaxian block (for example, - On-complex memory pages) detect more complex access patterns (for example, An entity memory page) is different from the conventional general prefetch unit. For example, the pre-fetch unit 124 can perform a program for accessing a memory block according to the pattern. Even if the microprocessor 1 out-of-onier execution pipeline does not re-order memory access in the order of program commands, This may cause the conventional prefetch unit to not detect memory access patterns and cause no prefetch actions. This is because the prefetch unit 124 only considers efficient access to the memory blocks' and the time order is not a point of consideration. however, In order to meet the ability to identify more complex access patterns and/or reorder access patterns, Compared to the conventional prefetch unit, The prefetching unit 124 of the present invention may take a long time to detect the access pattern. ''warm-up time'' as described below. Therefore, it is necessary to reduce the pre-take order 0608-A43067TW/final 38 201135460; Yuan 124 method of warming up time. Prefetch unit 124 "predicts a program that was previously used to access the memory block in the memory mode. Has it crossed ((10) the heartbeat) a new memory block that is actually adjacent to the memory block sold, And predicting whether the program will continue to access the new memory area according to the same pattern. a should use the 'pre-fetch unit' 24 to use the 'direction and other related information from the old memory block. To speed up the prediction of access patterns in new memory blocks, That is to reduce the warm-up time. A block diagram of a microprocessor 100 having a prefetch unit 124 is shown in FIG. The microprocessor ^(8) of Fig. 15 is similar to the microprocessor 100 of the second and second figures, And has other characteristics as described below. As described in Figure 3, The prefetch unit 124 includes a plurality of hardware units 332. Each hardware unit 332 further includes a hashed virtual address of memory (hashed virtual address of memory) as compared with FIG. HVAMB) 354 and a status 356. In the process of initializing the assigned hardware unit 332 in step 406 described in FIG. 4, The prefetch unit 124 fetches the physical block number in the block number register 303. And after translating the physical block code into a virtual address, The physical block code is translated into a virtual address (a hash of the virtual address) according to the same hashing algorithm performed by step 1704 described in the subsequent Figure 17. The result of the hash calculation is stored in the memory block virtual hash address field 354. Status bar 356 0608-A43067TW/fmal 39 201135460

具有三種可能之數值：非主動（inactive)、主動（active)或者試用（probationary) ’如下所述。預取單元124亦包括一虛擬雜湊表(virtual hash table ’ VHT)162，關於虛擬雜湊表IQ 組織架構以及操作之詳細說明請參考後續_ 16到19圖之敛述。如第16圖所示為第15圖之虛擬雜湊表162。虛擬雜湊表162包括複數項目，最好組織成一仔列。每一項目包括一有效位元（valid bit)(未圖示）以及三個欄：一負1雜湊虛擬位址1602(HVAM1)、一未修改雜湊虛擬位址 16〇4(HVAUN)以及一正！雜湊虛擬位址16〇6(HVApi)。用以填充上述欄位之數值的生成請參考後續第17圖所述。第17圖所述為第15圖之微處理器1〇〇之操作流程圖。流程開始於步驟1702。在步驟1702中，第一級資料快取記Μ 116接收來自載入/儲存單元134之—載人/儲存要求，其載人/儲存要求包括一虛擬位址。流程進行到步驟17〇4。在步驟1704中，第一級資料快取記憶體116對步驟中所接收之料位址選擇之位元執行—雜凑功能（函數），用以產生一未修改雜凑虛擬位址1604(HVAUN)。另外’第一級資料快取記愔已U體116增加一記憶體區塊大小 (MBS)至在步驟17〇2 尼、厅接收之雜凑位址所選擇的位元，用以產生-加總值’並對加總值執行& 0608-A43067TW/fmal '、’、月b，以產生一 201135460 正1雜凑虛擬位址1606(HVAP1)。另外，第一級資料快取 δ己憶體116從在步驟1702所接收之雜凑位址選擇的位元，減去s己憶體區塊之大小’用以產生一差值，並對此差值執行一雜湊功能，以產生一負1雜湊虛擬位址 1602(HVAjVH)。在一實施例中，記憶體區塊大小為4ΚΒ。在一實施例中，虛擬位址為40位元，虛擬位址之位元39 : 30以及11 : 〇被會雜湊功能忽略。剩下之18個虛擬位址位元為’’已處理（dealt)” ’如已擁有之資訊，係透過雜湊位元位置來處理。其想法為虛擬位址之較低位元具有最高亂度 (entropy)以及較高位元具有最低亂度。用此方法處理可保證亂度階級(entropy level)為較一致交叉雜湊之位元。在一實施例中，剩下之虛擬位址之18位元係根據後續表1之方法雜湊至6位元。然而，在其他實施例中，亦可考慮使用不同雜凑演算法；此外’若有性能支配空間（performance dominates space)以及電力消耗之設計考量，實施例可考慮不使用雜湊演算法。流程進行到步驟1706。 assign hash[5]=VA|79]AVA[18]AVA[17]; assign hash[4]=VA[28]AVA[19]AVA[16]； assign hash[3]=VA[27]AVA[20]AVA[15]； assign hash[2]=VA[26]AVA[21]AVA[14]； assign hash[l]=VA[25]AVA[22]AVA[13]； assign hash[0]-VA[24]AVA[23]AVA[12]； 0608-A43067TW/final 41 201135460 表1 在步驟1706中，第一級資料快取記憶體116提供在步驟1704中所產生之未修改雜湊虛擬位址 (HVAUN)1604、JL 1雜湊虛擬位址（HVApi)16〇6以及負工雜湊虛擬位址(hVAM1) 1602至預取單幻24。流程進行到步驟1708。在步驟1708中，預取單元124用步驟17〇6所接收之未修改雜凑虛擬位址(HVAUN)16〇4、正}雜凑虛擬位址 (HVAP1)1606以及負1雜溱虛擬位址(阶八⑷》選擇性地更新虛擬雜絲162。也就是說，如果虛娜湊表162 已包括一具有未修改雜湊虛擬位址16〇4(hvaun)、正工雜湊虛擬位址16〇6(HVAP1)以及負i雜湊虛擬位址 ’ 1602(HVAM1)之項目，預取單元124⑽棄更新虛擬雜凑表162。相反地，預取單元124則以先進先出 (first-in-first_out)的方式將未修改雜湊虛擬位址 1604(HVAUN)、正1雜湊虛擬位址16〇6(HVApi)以及負工雜湊虛擬位址]602(HVAM1)推進至虛擬雜湊表162最頂端之項目’並將所推進之項目標記為有效⑽⑷。流程結束於步驟1708。如第18圖所示為第16圖之虛擬雜湊表162在預取單元124在載入/儲存單元134根據第17圖之敘述操作之 0608-A43067TW/flnal 201135460 後的内容，其中在載入/儲存單元134因應於程式的執〜已經由兩記憶體區塊（標示為A and A+MBS)在〜a 订冋上的方向上進行’並進入一第三記憶體區塊(標示為 A+2*MBS)，以便回應已填充虛擬雜湊表162之預取單_ 124。仔細而言，虛擬雜湊表162距離尾端之兩個項目目包括在負1雜湊虛擬位址（HVAM1)1602之Α-1\4Βς;、 ^的雜湊、在未修改雜湊虛擬位址(HVAUN)16〇4之A的雜'、奏以在正1雜湊虛擬位址（HVAP1)1606之A+MBS的雜、、奏·及、’虛擬雜湊表162項目係距離尾端之一個項目的項目包括負雜溱虛擬位址(HVAM1) 1602之A的雜湊、在未修改雜、奏卢擬位址（HVAUN)1604之A+MBS的雜湊以及在正！崦生此 i雜凑虛擬位址(HVAP1)1606之A+2*MBS的雜凑；虛擬雜凑表162 項目係在尾端的項目（即最近時間所最近之推進的項目）包括在負1雜湊虛擬位址(HVAM1)1602之A+MBS的雜凑、在未修改雜湊虛擬位址(HVAUN)1604之A+2*MBS的雜湊以及在正1雜湊虛擬位址（HVAP1)1606之A+3*MBS的雜湊0 如第19圖所示（由第19A圖以及第19B圖組成）之第 5圖之預取單元124的操作流程圖。流程開始於步驟1902。在步驟1902中，第一級資料快取記憶體116傳送一新的配置要求（allocation request，AR)至第二級快取記憶體 118。新的配置要求係要求一新記憶體區塊。也就是說預取〇608-A43067TW/fmal 43 201135460 單元124決定與配置要求相關之記憶尚未配置一硬體單元332給新的配置^係新的，意即區塊。也就是說，預取單幻244 ^所相關之記憶體一新記憶體區塊之阶$西七备接文（encountered) 在-載人增存第-㈣中’配置要求係由第二級快取記憶體118要❿體116結果失敗並隨之在-實施例中，配置〉同决取線所產生的要求。直要未用以指定―眚所相關之一虛擬位址是由银又只體位址，實體位址料快取記憶H 116根據一、、體位址轉譯而來的。第-級資 1704相同之縣功雜凑功能（意即與第17圖之步驟、力月>0’雜凑枭虛擬位址，用以產生配置、?配置要求之實體位址有關之 (HVAAR)，並且將配置要^求之一已雜凑虛擬位址. 單元124。流程進行 /之已雜湊虛擬位址提供至預取主步騍1903。在步驟1903中，預 _ 元332給新的記憶體區塊早元124配至一個新的硬體單單元332存在，預取單果有不活動（mactive)的硬體給新的記憶體區塊。否則4配置一不活動的硬體單元332 則配置-個最近最少使。在—實施财’預取單元⑶ 塊。在-實施例中，—旦箱石更體單几332給新的記憶體區示之記憶體區塊的所有快、早凡124已經預取樣態所指 r 、取線時，預取單元124則會鈍化 (inactivate)硬體皁元 332。/ 有固定㈣硬體單元332 =實施例中，預取單元124具 0608-A43067TW/final %力’使其就算為-個最近最 44 201135460 少使用之硬體呈分2 „β 一更體早tl 332亦不會被重置。舉例而言，若預取早兀124该測到已經根據樣態對記憶體區塊進行-既定次 —、仁預取單元124尚未根據樣態對整個記憶體區 t成所有的預取，預取單元124即可固定與記憶體區塊硬體單元332，使其就算成為一個最近最少使用之硬體早①332仍不夠資格被重置。在-實施例中，預取單 y ，隹持每硬體單元332之相對期間（從原始配置），並且當其期間（age)到達—既定期間臨界值時，預取單元 124則會鈍化硬體單元说。在另-實施例中，若預取單元 1叫猎由彳㈣的㈣19G4到1926_—虛擬相鄰的記憶體，塊’亚且已完成自虛擬鄰近的記憶體區塊之預取，預取早凡124則會選擇性地重複使用在虛擬相鄰的記憶體區免之硬紅單疋332 ’而不是配置一新的硬體單元332。在此 “例中，預取單A 124選擇性地初始化重複使用之硬體早儿332之各種儲存元件(例如方向暫存器342、樣態暫存器344與樣態區域暫存器348)，以便維持儲存在其内之可用資訊。流程進行至步驟丨9〇4。在步驟1904中，預取單元124比較在步驟19〇2所產生之已雜湊虛擬位址(HVAAR)與虛擬雜凑表162之每一項目之負1雜凑虛擬位址麗(HVAM1)和正丨雜湊虛擬位址1606(HVAP1)。預取單元124根據步驟19〇4到1922之操作係為了決定-已主動（aGtive)記憶體區塊是否虛擬相鄰 0608-A43067TW/fmal Ac 201135460 至新記憶體區塊，預取單元丨24根據步驟1924到1928之操作係為了制記憶體存取是否將根據事先彳貞測到之存取樣態與方向’繼續自虛擬相鄰之已主動記憶體區塊進入新的記憶體區塊，用以以降低預取單元124之暖機時間，使得預取單元124可較快開始預取新的記憶體區塊。流程進行至步驟1906。在步驟1906中’預取單元124根據步驟！9〇4執行之比較方式，決定已雜湊虛擬位址(HVAAR)是否與虛擬雜湊表162之任何一項目匹配。若已雜湊虛擬位址(HVAAR) 與虛擬雜湊表162之一項目匹配，流程進行至步驟1908 ; 否則，流程進行至步驟1912。在步驟1908中’預取單元124設定一候補方向旗幟 (candidate_direction flag)至一數值，以指示向上之方向。流程進行至步驟1916。在步驟1912中，預取單元124根據步驟1908所執行之比較方式，決定已雜湊虛擬位址(HVAAR)是否與虛擬雜湊表162之任何一項目匹配。若已雜湊虛擬位址(HVAAR) 與虛擬雜湊表162之一項目匹配，流程進行至步驟1914 ; 否則，流程結束。在步驟1914中，預取單元124設定候補方向旗幟 (candidate_direction flag)至一數值，以指示向下之方向。流程進行至步驟1916。 0608-A43067TW/final 46 201135460 ；在步驟1916中，預取單元124將候補雜湊暫存器 (candidate_havregister)(未圖示）設定為步驟 1906 或]912 所決定之虛擬雜湊表162之未修改雜湊虛擬位址 1604(HVAUN)的一數值。流程進行至步驟1918。在步驟1918中，預取單元124比較候選雜凑 (candidatejiva)與預取單元124中每一主動記憶體區塊之記憶體區塊虛擬雜湊位址欄(HVAMB) 354。流程進行至步驟 1922 。在步驟1922中，預取單元124根據步驟1918所執行之比較方式’決定候選雜湊(candidate—hva)是否與任何— 記憶體區塊虛擬雜湊位址欄(HVAMB) 354匹配。若候選雜湊(candidatejiva)與一記憶體區塊虛擬雜湊位址欄 (HVAMB) 354匹配，流程進行至步驟1924;否則，流程結束。在步驟1924中，預取單元124已確定步驟1922所找到之匹配主動記憶體區塊確實虛擬鄰近於新的記憶體區塊。因此’預取單元124比較(步驟1%8或者DM所指定之)候選方向與匹配主動記憶體區塊之方向暫存器342，用以根據先㈣測到之存取樣態與方向，預測記憶體存取是否將繼續自虛擬相鄰的已主動記憶體區塊進人新的記憶體區塊。仔細而f ’奸選方向與虛擬相鄰記憶體區塊:方向暫存器342不同’記憶體存取不太可能會根據There are three possible values: inactive, active, or probationary as described below. The prefetch unit 124 also includes a virtual hash table (VHT) 162. For a detailed description of the virtual hash table IQ organization structure and operation, please refer to the following _ 16 to 19 diagram. As shown in Fig. 16, the virtual hash table 162 of Fig. 15 is shown. The virtual hash table 162 includes a plurality of items, preferably organized into a series of columns. Each item includes a valid bit (not shown) and three columns: a negative 1 hash virtual address 1602 (HVAM1), an unmodified hash virtual address 16〇4 (HVAUN), and a positive ! The hash virtual address is 16〇6 (HVApi). For the generation of the values for filling the above fields, please refer to the subsequent Figure 17. Fig. 17 is a flow chart showing the operation of the microprocessor 1 of Fig. 15. The flow begins in step 1702. In step 1702, the first level data cache 116 receives the manned/storage request from the load/store unit 134, the manned/storage request including a virtual address. The flow proceeds to step 17〇4. In step 1704, the first level data cache memory 116 performs a hash function (function) on the bit selected by the material address received in the step to generate an unmodified hash virtual address 1604 (HVAUN). ). In addition, the 'first-level data cache record has been added to the memory block size (MBS) of the U-body 116 to the bit selected in the hash address received in step 17〇2, for generating-plus The total value 'executes & 0608-A43067TW/fmal ', ', month b for the summed value to produce a 201135460 positive 1 hash virtual address 1606 (HVAP1). In addition, the first level data cache δ mnemonic 116 subtracts the size of the s hexamed block from the bit selected by the hash address received at step 1702 to generate a difference, and The difference performs a hash function to produce a negative 1 hash virtual address 1602 (HVAjVH). In one embodiment, the memory block size is 4 ΚΒ. In one embodiment, the virtual address is 40 bits, and the virtual address bits 39:30 and 11: are ignored by the hash function. The remaining 18 virtual address bits are ''dealt''. If the information is already owned, it is processed by the hash bit position. The idea is that the lower bits of the virtual address have the highest degree of chaos. (entropy) and higher bits have the lowest degree of chaos. This method is used to ensure that the entropy level is a more consistent cross-heavy bit. In one embodiment, the remaining virtual address is 18 bits. It is hashed to 6 bits according to the method in Table 1. However, in other embodiments, different hash algorithms may be considered; in addition, if there is performance dominates space and power consumption design considerations, The embodiment may consider not using a hash algorithm. The flow proceeds to step 1706. assign hash[5]=VA|79]AVA[18]AVA[17]; assign hash[4]=VA[28]AVA[19]AVA [16]; assign hash[3]=VA[27]AVA[20]AVA[15]; assign hash[2]=VA[26]AVA[21]AVA[14]; assign hash[l]=VA[ 25] AVA[22]AVA[13]; assign hash[0]-VA[24]AVA[23]AVA[12]; 0608-A43067TW/final 41 201135460 Table 1 In step 1706, the first level data cache Memory 116 provides an unmodified hash virtual address (HVAUN) 1604, a JL 1 hash virtual address (HVApi) 16〇6, and a negative hash virtual address (hVAM1) 1602 to prefetch phantom 24 generated in step 1704. Flow proceeds to step 1708. In step 1708, prefetch unit 124 uses the unmodified hash virtual address (HVAUN) 16〇4, positive } hash virtual address (HVAP1) 1606, and negative received in step 17〇6. A chowder virtual address (order eight (4)" selectively updates the virtual shred 162. That is, if the vain table 162 already includes an unmodified hash virtual address 16 〇 4 (hvaun), a duplication of work The virtual address 16 〇 6 (HVAP1) and the negative i hash virtual address ' 1602 (HVAM1), the prefetch unit 124 (10) discards the update virtual hash table 162. Conversely, the prefetch unit 124 is FIFO (first The method of -in-first_out advances the unmodified hash virtual address 1604 (HVAUN), the positive 1 hash virtual address 16 〇 6 (HVApi), and the negative hash virtual address 602 (HVAM1) to the virtual hash table 162. The top item 'marks the project being promoted as valid (10)(4). The process ends at step 1708. As shown in FIG. 18, the virtual hash table 162 of FIG. 16 is after the prefetch unit 124 operates in the load/store unit 134 according to the description of FIG. 17, 0608-A43067TW/flnal 201135460, in which the load/ The storage unit 134 performs the program in the direction of the ~a subscription by the two memory blocks (labeled A and A+MBS) and enters a third memory block (labeled A+). 2*MBS) in response to the prefetch _ 124 of the filled virtual hash table 162. In detail, the virtual hash table 162 from the end of the two items is included in the negative 1 hash virtual address (HVAM1) 1602 Α -1 \ 4 Βς;, ^ hash, in the unmodified hash virtual address (HVAUN) In the case of a project of the end of the project, the item of the project is included in the item of the end of the project. The hash of the A MB 溱溱 HV HV 602 602 602 602 602 602 HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV The hash of the A+2*MBS of this i-heavy virtual address (HVAP1) 1606; the virtual hash table 162 project at the end of the project (ie the most recently promoted project in the most recent time) is included in the negative 1 hash The hash of the A+MBS of the virtual address (HVAM1) 1602, the hash of the A+2*MBS of the unmodified hash virtual address (HVAUN) 1604, and the A+3 of the positive 1 hashed virtual address (HVAP1) 1606 *Block 0 of MBS is an operational flowchart of the prefetch unit 124 of Fig. 5 of Fig. 19 (composed of Fig. 19A and Fig. 19B). The flow begins in step 1902. In step 1902, the first level data cache 116 transmits a new allocation request (AR) to the second level cache 118. The new configuration requirement requires a new memory block. That is to say, prefetching 〇608-A43067TW/fmal 43 201135460 unit 124 determines the memory related to the configuration requirements. A hardware unit 332 has not been configured to give a new configuration, that is, a block. That is to say, the pre-fetching single illusion 244 ^ related to the memory of a new memory block of the ranks of the seventh occupant (encountered) in the - manned increase - (four) 'configuration requirements are based on the second level The cache memory 118 fails to result in the body 116 and, in the embodiment, is configured to meet the requirements of the line. It is not used to specify “眚” that one of the virtual addresses is composed of silver and only the physical address, and the physical address cache memory H 116 is translated according to the first and the physical address. The same county-level cosmic function of the first-level capital 1704 (meaning that it is the same as the step of the 17th figure, the force month > 0' hashed virtual address, which is used to generate the physical address of the configuration and configuration requirements ( HVAAR), and will configure one of the hashed virtual addresses. Unit 124. The process proceeds/the hashed virtual address is provided to the prefetch main step 1903. In step 1903, the pre_element 332 is given to the new The memory block early 124 is assigned to a new hardware unit 332, and the pre-fetched single fruit has a mactive hardware for the new memory block. Otherwise, 4 configures an inactive hardware unit. 332 then configure - the least recently. In the - implementation of the pre-fetch unit (3) block. In the embodiment, - the box stone is more than a few 332 to the new memory area of the memory block all the fast The pre-fetch unit 124 inactivates the hard soap element 332. / There is a fixed (four) hardware unit 332. In the embodiment, the pre-fetch unit 124 has 0608-A43067TW/final % force 'Let it be the most recent 44 201135460 Less used hardware score 2 „β一Even earlier, tl 332 will not be reset. For example, if the prefetch early 124 is detected, the memory block has been determined according to the state - the predetermined time - the kernel prefetch unit 124 has not been based on the pattern The entire memory area t is all prefetched, and the prefetch unit 124 can be fixed to the memory block hardware unit 332 so that it is not eligible to be reset even if it is a least recently used hardware 1332. In an embodiment, the prefetch unit y holds the relative period of each hardware unit 332 (from the original configuration), and when its age reaches a predetermined period threshold, the prefetch unit 124 passesivates the hardware unit. In another embodiment, if the prefetching unit 1 is called (4) 19G4 to 1926_-virtual adjacent memory, the block 'sub-and has completed the pre-fetching of the virtual neighboring memory block, pre-fetching In the case of 124, the virtual adjacent memory area is selectively reused instead of the new hard unit 332. In this example, the prefetching order A 124 is selected. Initially initialize various storage elements of the reusable hardware 332 (example) The direction register 342, the mode register 344 and the mode area register 348) are used to maintain the available information stored therein. The flow proceeds to step 丨9〇4. In step 1904, the prefetch unit 124 The negative 1 hash virtual address (HVAM1) and the positive hash virtual address 1606 (HVAP1) of each of the hashed virtual address (HVAAR) and virtual hash table 162 generated in step 19〇2 are compared. The prefetch unit 124 operates according to steps 19〇4 to 1922 in order to determine whether the aGtive memory block is virtual adjacent to 0608-A43067TW/fmal Ac 201135460 to the new memory block, and the prefetch unit 丨24 is based on The operations of steps 1924 to 1928 are for the memory access to enter the new memory block by the virtual memory adjacent to the active memory block according to the previously detected access state and direction ' In order to reduce the warm-up time of the prefetch unit 124, the prefetch unit 124 can start prefetching a new memory block faster. The flow proceeds to step 1906. In step 1906, the prefetch unit 124 follows the steps! The comparison of the executions of the 〇4 determines whether the hashed virtual address (HVAAR) matches any of the items of the virtual hash table 162. If the hashed virtual address (HVAAR) matches one of the virtual hash tables 162, the flow proceeds to step 1908; otherwise, the flow proceeds to step 1912. In step 1908, the prefetch unit 124 sets a candidate_direction flag to a value to indicate the upward direction. The flow proceeds to step 1916. In step 1912, prefetch unit 124 determines whether the hashed virtual address (HVAAR) matches any of the virtual hash table 162 based on the comparison performed in step 1908. If the hashed virtual address (HVAAR) matches one of the virtual hash tables 162, the flow proceeds to step 1914; otherwise, the flow ends. In step 1914, prefetch unit 124 sets the candidate direction flag (canddate_direction flag) to a value to indicate the downward direction. The flow proceeds to step 1916. 0608-A43067TW/final 46 201135460; In step 1916, prefetch unit 124 sets a candidate hash register (not shown) (not shown) to the unmodified hash virtual of virtual hash table 162 determined in step 1906 or ]912. A value of the address 1604 (HVAUN). The flow proceeds to step 1918. In step 1918, prefetch unit 124 compares the candidate hashes (medidentjiva) with the memory block virtual hash address field (HVAMB) 354 of each active memory block in prefetch unit 124. The flow proceeds to step 1922. In step 1922, prefetch unit 124 determines whether the candidate hash (hd) matches any of the memory block virtual hash address columns (HVAMB) 354 based on the comparison mode performed in step 1918. If the candidate jigsaw matches a memory block virtual hash address field (HVAMB) 354, the flow proceeds to step 1924; otherwise, the process ends. In step 1924, prefetch unit 124 has determined that the matching active memory block found at step 1922 is indeed virtually adjacent to the new memory block. Therefore, the prefetch unit 124 compares the candidate direction (specified by step 1%8 or DM) with the direction register 342 matching the active memory block for predicting the mode and direction according to the first (four) access. Whether the memory access will continue to enter a new memory block from the virtual adjacent active memory block. Carefully and f' the direction of the selection and the virtual adjacent memory block: the direction register 342 is different. 'Memory access is unlikely to be based on

0608-A43067TW/fina] 闲 /只J 201135460 到之存取樣態與方向，繼續只曰湿擬相鄰的已主動記憶體區塊進入新的記憶體區塊。流程進行至步驟1926。在步驟1926中’預取單元124根據步驟1924所執行之比較方法，決定候選方向 ,,..〇〇 6，、匹配主動記憶體區塊之方向暫存器342是否匹配。若佐、煙 &遊方向與匹配主動記憶體區塊之方向暫存器342匹配、j/，IL&進行至步驟1928;否則，流程結束。在步驟1928中，預取里-^ 、取早凡124決定在步驟1902所接收到之新的重置要求是否破‘到步驟1926所偵測到之匹配虛擬相鄰主動記憶體區 , 思之一已被樣態暫存器344所預測之快取線。在一實施例 f 為了執行步驟1928之決疋，預取單元124根據其樣相卞制〜項序暫存器346有效地轉換，、複1¾匹配虛擬相鄰主動記恃由、，—處體區埤之樣態暫存器344，用以在虛擬相鄰記憶體區塊 %樣態位置樣態區域暫存器 348以便在新的記憶體區 ™ Φ ,、维持樣態334連貫性。若新的配置要求係要求匹配主動記憔 .^ ^ 〜、體£塊之樣態暫存器344所相關之一快取記憶體列，流仃至步驟1934 ;否則，流程進订至步驟1932。在步驟1932中，預取i - , 早疋124根據第4圖之步驟 406與408，初始化與填充（步 .-^ (，驟19〇3所配置之)新的硬體早兀332 ’希望其最後可根據象上述與第4到6圖相關之方法，偵測對新的記憶體區塊之關之万 __黯編子取的新樣態，而這將需要 201135460 暖機時間。流程結束於步驟1932。在步驟1934中，預取單幻24預測存取要求將據匹配虛擬相鄰主動記憶體區塊之樣態暫存器^4與方^ 暫存器342繼續進入新的記憶體區塊。因此，預取單元%向以相似於步驟1932之方式填充新的硬體單元332，^^24 些許不同。仔細而言’預取單元124會用來自虛擬相憶體區塊之硬體單元332的對應數值來填充方向暫存器 342、樣態暫存斋344以及樣態順序暫存哭346。另外態區域暫存器348之新的數值係藉由繼續轉換於增加之^ 態順序暫存器346之值所決定，直到其交又進入新的記二體區塊’以提供樣態暫存器344持續地進入新的記憶體‘ 塊，如步驟1928中之相關敘述。再者，新的硬體單元中之狀態欄356係用以標記新的硬體單元332為試用 (probationary)。最後，搜尋指標暫存352被初使化以便由一記憶體區塊之開頭進行搜尋。流程進行至步驟。在步驟1936中，預取單元124繼續監視發生於新記憶體區塊之存取要求。若預取單元124偵測到對記憶體區塊之至少一既定數量的後續存取要求是要求樣態暫存器 344所預測之記憶體線，接著預取單元124促使硬體單元 332之狀態攔356自試用（probationary)轉為主動，並且接著如第6圖所述開始自新的記憶體區塊進行預取。在—實施例中’存取要求之既定數量為2 ’雖然其他實施例可考虞 0608-A43067丁 W/fmal 49 201135460 為其它既定數量。流程進行至步驟1936。如第20圖所示為第15圖所示之預取單元124所用之雜湊只體位址至雜凑虛擬位址庫（hashed physical address-to-hashed virtual address thesaurus)2002。雜湊物理位址至雜奏虛擬位址庫2002包括一項目陣列。每一項目包括一實體位址(PA)2〇〇4以及一對應的雜湊虛擬位址 (HVA)2006。對應的雜凑虛擬位址2006係由實體位址2〇〇4 轉譯成之虛擬位址加以雜湊的結果。預取單元124藉由對最近之雜凑物理位址至雜湊虛擬位址庫2〇〇2進行竊聽，用以在跨越載入/儲存單元134的管線。在另一實施例中，於第19圖之步驟19〇2，第一級資料快取記憶體116並未提供已雜湊虛擬位址(HVAAR)至預取單元124，但只提供配置要求所相關之物理位址。預取單元124在雜湊物理位址至雜湊虛擬位址庫2002中尋找實體位置，以找到一匹配實體位址（PA) 2004,並獲得相關之雜湊虛擬位址(HVa) 2006，所獲得之雜湊虛擬位址(HVA) 2006將在第19圖其他部分成為已雜湊虛擬位址(HVAAR)。將雜凑物理位址至雜湊虛擬位址庫2002包括在預取單元124可緩和第—級資料快取記憶體116提供配置要求所要求之雜湊虛擬位址的需要，因此可簡化第一級資料快取記憶體116與預取單元 124之間的介面。在一實施例中，雜湊實體位址至雜湊虛擬位址庫 0608-A43067TW/finaI Λ 201135460 : 2002之每一項目包括一雜凑實體位址，而不是杂舰、， X 體位址 2004 ’亚且預取單& 124將自第一級資料快取記憶體】所接收之配置要求實體位址雜湊成一雜湊物理位址— 找哥雜凑實體位址至雜凑虛擬位址庫2〇〇2， Λ • 吏獲得適當之對應的雜湊虛擬位址(HVA)2006。本實施例允許_ 雜湊實體位址至雜湊虛擬位址庫2002，但需μ 之文乃外之時間對實體位址進行雜湊。如第21圖所示為本發明實施例之多核微處理哭 1.00。多核微處理器100包括兩個核心(表示成核心^ 以及核心Β2102Β)，可整個視為核心2102(友去- 号早一核心 2102)。每一核心具有相似於如第2圖所示之恩姑干极微處理器 100之元件12或15。另外，每一核心2102具有如々所\ 之高度反應式的預取單元2104。該兩個核心北一— 旱第二級快取記憶體118以及預取單元124。特別的是，4 核心2012之第一級資料快取記憶體116、載入/储存單一 134以及高度反應式的預取單元2104係耦接至共享之第一級快取記憶體118以及預取單元124。另外，一共享之高度反應式的預取單元21 〇6係耦接至第二級快取記憶體]! 8 以及預取單元124。在一實施例中，高度反應式的預取單元2104/共享之高度反應式的預取單元2106只預取一記憶體存取所相關之快取線後的下一個相鄰之快取線。預取單元124除了監控載入/儲存單元134以及第一 0608-A43067TW/fina1 51 201135460 級資料快取記憶體116之記憶體存取之外，亦可監押言户反應式的預取單元2104/共享之高度反應式的預取單元a 2106所產生之記憶體存取，用以進行預取決定。預取單_ 124可監控從不同組合之記憶體存取來源的記憶體存取以執行本發明所述之不同的功能。例如’預取單元監控記憶體存取之一第一組合，以執行第2到丨〗圖所述之關相功能，預取單元124可監控記憶體存取之一第_組合，以執行第12到14圖所述之相關功能，並且預取單元 124可監控記憶體存取之一第三組合，以執行第^到w 圖所述之相關功能。在實施例中，共享之預取單元由於時間因素難以監控每一核心2102的載入/儲存單元134 之行為。因此，共享之預取單幻24，經由第―級資料快取記憶體116所產生之傳輸狀況(traffic)間接地監控載入/儲存單元134之行為，作為其載入/儲存未命中加以)之結果。本發明的不同實施例已於本文敘述，但本領域具有通常知識者應能瞭解這些實施例僅作為範例，而非限定於此。本領域具有通常知識者可在不脫離本發明之精神的情況下，對形式與細節上做不同的變化。例如，軟本發明實施例所述的裝置與方法之：能體：; (fabrication)、塑造(modeling)、模擬、描述(descripti〇n)、以及/或測試’亦可透過一般程式語言（C、c++)、硬體描述語言（Hardware Description Languages，HDL)(包括0608-A43067TW/fina] Idle / J 201135460 Access to the mode and direction, continue to wet only the adjacent active memory block into the new memory block. The flow proceeds to step 1926. In step 1926, the prefetch unit 124 determines the candidate direction according to the comparison method performed in step 1924, . . . 6, and matches whether the orientation register 342 of the active memory block matches. If the direction of the match, the smoke & swim direction matches the direction register 342 matching the active memory block, j/, IL& proceeds to step 1928; otherwise, the flow ends. In step 1928, prefetching -^, taking early 124 determines whether the new reset request received in step 1902 is broken "to the matching virtual adjacent active memory area detected in step 1926, thinking A cache line that has been predicted by the mode register 344. In an embodiment f, in order to perform the step 1928, the prefetch unit 124 effectively converts according to the sample phase-to-item register 346, and matches the virtual adjacent active record by, The mode register 344 is used to maintain the state 334 continuity in the new memory area TM Φ in the virtual adjacent memory block % modal location area register 348. If the new configuration requirement is to match the active memory record, the cache memory column associated with the active register 344 is flown to step 1934; otherwise, the process advances to step 1932. . In step 1932, prefetching i - , early 124 is initialized and filled according to steps 406 and 408 of FIG. 4 (step .-^ (, configured by step 19〇3) new hardware early 332 'hope Finally, according to the method described above in connection with Figures 4 to 6, the new pattern of the new memory block is detected, and this will require 201135460 warm-up time. The process ends in step 1932. In step 1934, the prefetching single-prediction 24 predictive access request continues to enter the new memory according to the mode register 4 and the register 342 of the matching virtual adjacent active memory block. Thus, the prefetch unit % is slightly different in filling the new hardware unit 332 in a manner similar to step 1932. Carefully speaking, the prefetch unit 124 will use the block from the virtual memory block. The corresponding value of the hardware unit 332 is used to fill the direction register 342, the state temporary storage 344, and the state sequence temporary storage cry 346. The new value of the state area register 348 is further converted to increase by ^ The value of the state sequence register 346 is determined until it is re-entered into the new two-body block' to provide The state register 344 continues to enter the new memory 'block, as described in step 1928. Again, the status bar 356 in the new hardware unit is used to mark the new hardware unit 332 for trial use (probationary Finally, the search indicator temporary storage 352 is initialized to be searched by the beginning of a memory block. The flow proceeds to step. In step 1936, the prefetch unit 124 continues to monitor the occurrence of the new memory block. If the prefetch unit 124 detects that at least a predetermined number of subsequent access requests to the memory block is a memory line predicted by the mode register 344, then the prefetch unit 124 causes the hardware unit The state block 356 of 332 transitions from the probationary to the active, and then begins prefetching from the new memory block as described in Fig. 6. In the embodiment, the 'number of access requests is 2' although Other embodiments may consider 0608-A43067 DW/fmal 49 201135460 for other established quantities. The flow proceeds to step 1936. As shown in Fig. 20, the hash address of the prefetch unit 124 shown in Fig. 15 is shown to Hash virtual Hased physical address-to-hashed virtual address thesaurus 2002. The hash physical address to the vocal virtual address library 2002 includes an array of items. Each item includes a physical address (PA) 2〇〇4 and a Corresponding Hash Virtual Address (HVA) 2006. The corresponding hash virtual address 2006 is the result of hashing the virtual address translated into the physical address 2〇〇4. The prefetch unit 124 is used to traverse the load/store unit 134 by eavesdropping on the nearest hash physical address to the hash virtual address base 2〇〇2. In another embodiment, in step 19〇2 of FIG. 19, the first level data cache memory 116 does not provide a hashed virtual address (HVAAR) to the prefetch unit 124, but only provides configuration requirements. Physical address. The prefetch unit 124 looks for the entity location in the hash physical address to hash virtual address pool 2002 to find a matching physical address (PA) 2004 and obtains the associated hash virtual address (HVa) 2006, the hash obtained. Virtual Address (HVA) 2006 will become a hashed virtual address (HVAAR) in other parts of Figure 19. The inclusion of the hash physical address into the hash virtual address library 2002 includes the need for the prefetch unit 124 to alleviate the need for the hash virtual address required by the first level data cache 116 to provide configuration requirements, thereby simplifying the first level of data. The interface between the memory 116 and the prefetch unit 124 is cached. In one embodiment, the hash entity address to the hash virtual address pool 0608-A43067TW/finaI Λ 201135460: 2002 each item includes a hashed physical address, rather than a miscellaneous ship, the X-body address 2004 'Aya The prefetching order & 124 will be hashed from the first level data cache memory to the configuration request entity address to a hashed physical address - find the tangible entity address to the hash virtual address base 2〇〇2 , Λ • 吏 Obtain the appropriate corresponding hash virtual address (HVA) 2006. This embodiment allows the _ hashed entity address to be hashed to the virtual address pool 2002, but requires the time of μ to hash the physical address. As shown in Fig. 21, the multi-core micro-processing crying 1.00 according to the embodiment of the present invention. The multi-core microprocessor 100 includes two cores (represented as a core ^ and a core Β 2102 Β), which can be regarded as a core 2102 (friend-to-first early core 2102). Each core has an element 12 or 15 similar to that of the NPG microprocessor 100 as shown in FIG. In addition, each core 2102 has a highly reactive prefetch unit 2104 as described. The two cores are a north-severe second-level cache memory 118 and a prefetch unit 124. In particular, the 4-core 2012 first-level data cache memory 116, the load/store single 134, and the highly reactive pre-fetch unit 2104 are coupled to the shared first-level cache memory 118 and prefetched. Unit 124. In addition, a shared highly reactive prefetch unit 21 〇6 is coupled to the second level cache memory 8! and the prefetch unit 124. In one embodiment, the highly reactive prefetch unit 2104/shared highly reactive prefetch unit 2106 prefetches only the next adjacent cache line after the memory access associated with the cache line. The prefetch unit 124 can monitor the load/store unit 134 and the memory access of the first 0608-A43067TW/fina1 51 201135460 level data cache 116, and can also monitor the pre-fetch unit 2104/ of the respondent. The memory access generated by the shared highly reactive prefetch unit a 2106 is used to make prefetch decisions. Prefetch _ 124 can monitor memory accesses from different combinations of memory access sources to perform the different functions described herein. For example, the 'prefetch unit monitors one of the first combinations of memory accesses to perform the phase closure function described in the second to the second figure, and the prefetch unit 124 can monitor one of the memory accesses to perform the first The related functions described in Figures 12 through 14, and the prefetch unit 124 can monitor a third combination of memory accesses to perform the related functions described in Figures 1-4. In an embodiment, the shared prefetch unit is difficult to monitor the behavior of the load/store unit 134 of each core 2102 due to time factors. Therefore, the shared prefetch phantom 24 indirectly monitors the behavior of the load/store unit 134 via the transmission generated by the level-level data cache 116 as its load/store miss. The result. Various embodiments of the invention have been described herein, but those of ordinary skill in the art should understand that these embodiments are only by way of example and not limitation. Variations in form and detail may be made by those skilled in the art without departing from the spirit of the invention. For example, softening the apparatus and method described in the embodiments of the present invention: "fabrication", modeling, simulation, descriptive, and/or testing 'can also be through a general programming language (C) , c++), Hardware Description Languages (HDL) (including

VerilogHDL、VHDL等等）、或其他可利用的程式語言來 0608-A43067TW/fmal 52 201135460 π成此軟體可配置在任何已知的電腦可使用媒介，例如磁π半‘體、磁碟，或是光碟（例如cd_r〇m、dvd_r〇m 等等）、網際網路、有線、無線、或其他通訊媒介的傳輸方式之中。本發明所述之裝置與方法實施例可被包括於半導體智慧財產核心，例如微處理器核心（a HDL來實現），並轉換成積體電路產品的硬體。此外，本發明所述之裝置與方法透過硬體與軟體的結合來實現。目此，本發明不應侷限於所揭露之實施例，而是依後附之中請專利範圍與等效實施所界定。特別是，本發明可實施在使用於—般用途電腦中的微處理ϋ裝置内。最後，本發明雖以較佳實施例揭露如上’然其並非用以限定本發明的範圍，任何所屬技術領域中具有通常知識者’在不脫離本發明之精神和範圍内，當可做些許的更動與潤飾’因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。本發明的不同實施例已於本文敘述，但本領域具有通常知識者應能瞭解這些實施例僅作為範例，而非限定於此。本領域具有通常知識者可在不脫離本發明之精神的情況下，對形式與細節上做不同的變化。例如，軟體可致能本發明實施例所述的裝置與方法之功能、組建 (fabrication)、塑造（modeling)、模擬、描述（descripti〇n)、以及/或測試’亦可透過一般程式語言（C、C++ )、硬體VerilogHDL, VHDL, etc., or other available programming language to 0608-A43067TW/fmal 52 201135460 π into this software can be configured in any known computer usable medium, such as magnetic π half body, disk, or Among the transmission methods of optical discs (such as cd_r〇m, dvd_r〇m, etc.), Internet, wired, wireless, or other communication media. The apparatus and method embodiments of the present invention can be included in a semiconductor intellectual property core, such as a microprocessor core (a HDL implementation), and converted into hardware for an integrated circuit product. Furthermore, the apparatus and method of the present invention are realized by a combination of a hardware and a soft body. The invention is not limited to the disclosed embodiments, but is defined by the scope of the appended claims and the equivalent implementation. In particular, the present invention can be implemented in a micro-processing device for use in a general purpose computer. In the following, the present invention is not limited to the scope of the present invention, and any one of ordinary skill in the art can be made a part of the present invention without departing from the spirit and scope of the present invention. The scope of protection of the present invention is therefore defined by the scope of the appended claims. Various embodiments of the invention have been described herein, but those of ordinary skill in the art should understand that these embodiments are only by way of example and not limitation. Variations in form and detail may be made by those skilled in the art without departing from the spirit of the invention. For example, the software can enable the functions, fabrication, modeling, simulation, descriptive, and/or testing of the apparatus and methods described in the embodiments of the present invention, as well as through a general programming language ( C, C++), hardware

描述語言（Hardware Description Languages，HDL)(包括 Verilog HDL、VHDL等等）、或其他可利用的程式語言來完成。此軟體可配置在任何已知的電腦可使用媒介，例如磁帶、半導體、磁碟’或是光碟（例如CD-ROJvl、DVD-ROM 0608-A43067TW/final 53 201135460 等等）、網際網路、有線、無方式之中。本發明所述之裝寮、線、或其他通訊媒介的傳輸導體智慧財產核心，例如微x卢置與方法實施例可被包括於半並轉換成_電路產品的^理器核心UXHDL來實現），與方法透過石更體與軟體的結二匕外’本發明所述之裝置侷限於所揭露之實施 :來實現。因此，本發明不應效實施所界I特別j 是依後附之申請專利範圍與等雷腦中疋發明可實施在使用於一般用途冤肠肀的微處理器裝置内。 ^ 揭露如上，然其並非用本發明雖以較佳實施例 Γ=Γ常知識者’在不脫離本發明之精神和範圍 ‘許的更動與潤飾，因此本發明之保護範圍當視後附之中請專利範Ϊ所界定者為準。【圖式簡單說明】第1 @所示為當執行經由記憶體包括一序列儲存操作之的程式時，一種第二級快取記憶體之樣態存取表現。第2圖為本發明的一種微處理器的方塊圖。第3圖為本發明第2圖之預取單元更詳細之方塊圖。第4圖為本發明第2圖之微處理器以及特別係第3圖之預取單元的操作流程圖。第5圖為本發明第3圖之預取單元對第4圖之步驟的才呆作流程圖。第6圖為本發明第3圖之預取早元對第4圖之步驟的操作流程圖。第7圖為本發明第3圖之預取要求仔列的操作流程圖。 0608-Α43067丁 W/final <4 201135460 第8圖為本發明—記憶體區塊之兩個以表示本發名之定界框預取單元。躲點’用第9圖為本發明第2圖所示之方塊圖。〈裸作乾例的第10@為本發明延續第9圖之範例處理器之操作範例的方塊圖。輯不之微所二1Λ為本發明延續第9以及10圖之範例的第2圖所不之镟處理器之操作範例的方塊圖。圖第圖為本發明另一實施例之一種微處理器之方塊圖第13圖為本發明第12圖所示之預取單元之操作流程第14圖為本發明根據第13圖步驟之第〗2 取單元的操作流程圖。 Τ下之預第15圖為本發明另一實施例具有一定界框預取單元之一種微處理器的方塊圖。第16圖為本發明第15圖之虛擬雜湊表的方塊圖。第17圖為本發㈣15圖之微處理器的操作流程圖。仲一第Μ圖為本發明根據經由第17圖範例敘述之在預取單^^操作後之f 16目之虛擬雜湊表的内容。第19圖（統合第19A以及_圖）為本發明帛15圖之預取單元的操作流程圖。一第20圖為本發明另一實施例之用在第15圖之預取單元之二雜祕理彳紐至齡虛擬健庫的方塊圖。、第21圖本發明之一多核微處理器的方塊圖。 0608-A43067丁 W/final 201135460 【主要元件符號說明】 100〜微處理器 102〜指令快取記憶體 104〜指令解碼器 106〜暫存器別名表 108〜保留站 112〜執行單元 132〜其他執行單元 134〜載入/儲存單元 124〜預取單元 114〜引退單元 116〜第一級資料快取記憶體 118〜第二級快取記憶體 122〜匯流排介面單元 162〜虛擬雜凑表 198〜仔列 172〜第一級資料搜尋指標器 178〜第一級資料樣態位址 196〜第一級資料記憶體位址 194〜樣態預測快取線位址 192〜快取線配置要求 188〜快取線資料 354〜記憶體區塊虛擬雜湊位址欄 0608-A43067TW/finaI 56 201135460The Description Languages (HDL) (including Verilog HDL, VHDL, etc.), or other available programming languages. This software can be configured on any known computer usable medium, such as tape, semiconductor, diskette or CD (eg CD-ROJvl, DVD-ROM 0608-A43067TW/final 53 201135460, etc.), internet, cable There is no way. The transmission conductor intellectual property core of the device, the line, or other communication medium of the present invention, such as the micro-x and the method embodiment, may be implemented by half-converted into a processor core UXHDL of the _ circuit product. And the method of the invention is limited to the implementation of the disclosed invention by means of a stone-like body and a soft body. Therefore, the present invention is not limited to the implementation of the invention. The special application is in the scope of the appended claims and the invention can be implemented in a microprocessor device for general use. The disclosure of the present invention is not intended to be a modification of the present invention, and the scope of protection of the present invention is attached to the present invention without departing from the spirit and scope of the present invention. The patent is defined by the patent. [Simple Description of the Drawing] The first @@ shows the state of the access performance of a second-level cache memory when executing a program including a sequence of storage operations via the memory. Figure 2 is a block diagram of a microprocessor of the present invention. Figure 3 is a more detailed block diagram of the prefetching unit of Figure 2 of the present invention. Figure 4 is a flow chart showing the operation of the microprocessor of Figure 2 of the present invention and, in particular, the prefetching unit of Figure 3. Fig. 5 is a flow chart showing the steps of the prefetching unit of Fig. 3 of the present invention for the steps of Fig. 4. Fig. 6 is a flow chart showing the operation of the steps of the prefetching early element to the fourth drawing of Fig. 3 of the present invention. Figure 7 is a flow chart showing the operation of the prefetch request in Figure 3 of the present invention. 0608-Α43067丁 W/final <4 201135460 Figure 8 is a two-memory block of the present invention to represent the bounding box prefetching unit of this name. Fig. 9 is a block diagram of Fig. 2 of the present invention. <10th Example of the Naked Work Example is a block diagram of an example of the operation of the processor in the example of the ninth embodiment of the present invention. The second section is a block diagram of an example of the operation of the processor of the second embodiment of the present invention. FIG. 13 is a block diagram of a microprocessor according to another embodiment of the present invention. FIG. 13 is an operation flow of the prefetch unit shown in FIG. 12 of the present invention. FIG. 14 is a diagram of the present invention according to the steps of FIG. 2 Take the operation flow chart of the unit. BRIEF DESCRIPTION OF THE DRAWINGS Figure 15 is a block diagram of a microprocessor having a bounding frame prefetching unit in accordance with another embodiment of the present invention. Figure 16 is a block diagram of a virtual hash table of Figure 15 of the present invention. Figure 17 is a flow chart showing the operation of the microprocessor of Figure 4 (b). The second diagram is the content of the virtual hash table of the f 16 mesh after the prefetch operation according to the example illustrated in Fig. 17. Fig. 19 (integrated 19A and _) is a flow chart showing the operation of the prefetching unit of Fig. 15 of the present invention. Figure 20 is a block diagram of a second handy 彳至至虚拟虚拟虚拟用。。。。。。。。。。。。。。。。。。。。。。。 Figure 21 is a block diagram of a multi-core microprocessor of the present invention. 0608-A43067丁 W/final 201135460 [Main component symbol description] 100~ Microprocessor 102~ Instruction cache memory 104~ Instruction decoder 106~ Register alias table 108~ Reserved station 112~ Execution unit 132~ Other execution The unit 134~ the loading/storing unit 124~ the prefetch unit 114~ the retiring unit 116~ the first level data cache memory 118~ the second level cache memory 122~ the bus interface unit 162~ the virtual hash table 198~ 172~1st level data search indexer 178~ first level data sample address 196~ first level data memory address 194~ state prediction cache line address 192~ cache line configuration requirement 188~ fast Take the line data 354~ Memory block virtual hash address bar 0608-A43067TW/finaI 56 201135460

356 狀態欄 302 區塊位元遮罩暫存 303 區塊號碼暫存器 304 最小指標暫存器 306 最大指標暫存器 308 最小改變計數器 312 最大改變計數器 314 r->^/ 總計數器 316 r^/ 中間指標暫存器 318 週期匹配計數器 342 r·^ 方向暫仔器 344 樣態暫存器 346 樣態順序暫存器 348 /-w 樣態區域暫存器 352 搜尋指標暫存器 332 r·^ 硬體單元 322 r*s^/ 控制邏輯 328 /-*%«✓ 預取要求佇列 324 提取指標器 326 推進指標器 2002 /-s. -雜湊虛擬位址庫 2102A ^核心A 2102B 〜核心B 0608-A43067TW/fina1 201135460 2104〜高度反應式的預取單元 2106〜共享之高度反應式的預取單元 0608-A43067TW/final 58356 Status Bar 302 Block Bit Mask Temporary 303 Block Number Register 304 Minimum Indicator Register 306 Maximum Indicator Register 308 Minimum Change Counter 312 Maximum Change Counter 314 r->^/ Total Counter 316 r ^/ Intermediate Indicator Register 318 Period Match Counter 342 r·^ Direction Temporary 344 Sample Register 346 Sample Sequence Register 348 /-w Sample Area Register 352 Search Indicator Register 332 r ·^ Hardware Unit 322 r*s^/ Control Logic 328 /-*%«✓ Prefetch Request 伫 324 Extract Indicator 326 Push Indicator 2002 /-s. - Hash Virtual Address Library 2102A ^ Core A 2102B ~ Core B 0608-A43067TW/fina1 201135460 2104~ highly reactive prefetch unit 2106~ shared highly reactive prefetch unit 0608-A43067TW/final 58

Claims

201135460; VII. Patent application scope: 1. A prefetching unit is provided in a microprocessor having a cache memory, comprising: wherein the prefetching unit is configured to receive a plurality of access requests for a plurality of addresses of a memory block, each of which is stored Retrieving one of the addresses corresponding to the memory block, and the address of the access request is non-monotonically increasing or decreasing with time; a storage device; and a control logic And the storage device is coupled to the storage device, wherein when the access request is received, the control logic is configured to: maintain one of the access addresses and a minimum address of the access request in the storage device, and the maximum bit a count value of the change of the address and the minimum address; maintaining a history record of the most recently accessed cache line in the memory block, the most recently accessed cache line being associated with the address of the access request Determining an access direction according to the above count value; determining an access mode according to the historical record; and according to the access mode and along the access direction, Not cache above said history recording instruction prefetching in vivo to the above-described memory block to the cache line has been accessed. 2. The prefetching unit of claim 1, wherein the control logic is further configured to suspend the prefetching action before the number of recently accessed cache lines in the memory block is greater than a predetermined value. . 3. The prefetching unit described in claim 2, wherein the upper 0608-A43067TW/fmal 59 201135460 has a predetermined value of at least 9. 4. The prefetching unit of claim 2, wherein the predetermined value is at least ten percent of the number of cache lines in the memory block. 5. The prefetching unit of claim 1, wherein the control logic is configured to: change the count value of the change of the maximum address and the minimum address when the access direction is determined according to the count value. When the difference between the count values is greater than a predetermined value, the above accessor is determined. Toward the system; when the difference between the count value of the change of the minimum address and the count value of the change of the maximum address is greater than the predetermined value, it is determined that the access direction is downward. 6. The prefetching unit of claim 1, wherein the control logic is further configured to use an absolute value of a difference between a count value of the change of the maximum address and a count value of the change of the minimum address. The above prefetch action is suspended until a predetermined value. 7. The prefetching unit of claim 1, wherein: the history record comprises a one-dimensional mask, wherein the bit mask is used to indicate the most recently accessed cache line, and the above-mentioned recently accessed The cache line is associated with the address of the memory block; when receiving the access request, the control logic is further configured to: calculate the most recently accessed cache line in the bit mask An intermediate indicator register; and the above-mentioned bit mask on the right side of the above-mentioned intermediate indicator register, the 0608-A43067TW/finaI 60 201135460 N bit of the above-mentioned bit mask on the left side of the intermediate indicator register When the N bit matches, for each of a plurality of different bit bit periods, the count value of one of the matching counters associated with the bit period is increased, where N is in the bit period The number of bits. 8. The prefetching unit of claim 1, wherein the control logic is configured to: detect the matching associated with one of the bit periods, in order to determine the access mode according to the bit mask; Whether the difference between the counter and the matching counter associated with the other of the bit periods is greater than a predetermined value; and determining the N-bit of one of the intermediate indicator registers masked by the bit The specified access mode, wherein N is the number of bits of one of the bit periods, and the correlation matching counter of the one of the bit periods has a correlation match with the other of the bit periods The difference between the counters is greater than the above predetermined value. 9. The prefetching unit of claim 8, wherein the memory block is marked by the bit mask as being not recently stored according to the access mode and along the access direction. Taking the cache line prefetched into the cache memory, the control logic is configured to: along the access direction, assign a search indicator and the access pattern from the N indicator of the intermediate indicator; And when the bit in the access mode on the search indicator indicates an access, prefetch the cache line associated with the bit in the bit mask on the search indicator. 0608-A43067TW/final 61 201135460 1〇. For example, according to the above-mentioned access mode and along the access direction, the pre-acquisition according to the ninth patent scope of the Shenqing patent is marked as being not recently accessed by the above-mentioned bit mask; In the cache memory, the above-mentioned control system: and, according to the access direction, increase/decrease the value of the search indicator; when: add/reduce the above search indexer === prefetch the above The cache line associated with the above-mentioned bits in the above search = ^疋 mask has been increased/decreased. , The pre-fetching unit described in the first paragraph of the application of the patent, the above-mentioned control logic is further used to: repeat the above-mentioned increase of the value of the search indicator and the work until the dog condition occurs, wherein the above situation The method includes: a prefetching movement, when the access direction is upward, a bit in the bit mask on the search indicator and a bit in the bit 70 of the bit address associated with the maximum address The distance between the elements is greater than a second predetermined value; and when the access direction is downward, the bit in the bit mask on the search indicator is related to the bit address The distance between the bits in the 7G mask is greater than the second predetermined value. 12. The prefetching unit of claim 7, wherein the control logic is further configured to use the matching counter associated with one of the different bit periods to match the other of the different bit periods. Before the difference between the counters is greater than a predetermined value, the above pre-0608-A43067TW/final · ^ 201135460 action is suspended. 13. The prefetching unit of claim 1, wherein the above bit period is 3, 4, and 5 bits. 14. The prefetching unit of claim 1, wherein the control logic discards the cache line when the cache line has appeared in any of the cache memories of the microprocessor. 15. The prefetching unit of claim 1, wherein the size of the memory block is 4 kilobytes. 16. The prefetching unit of claim 1, further comprising: a plurality of the storage devices; wherein the control logic is configured to receive an access request, and the address of the access request is not in the storage device One of the new memory blocks is associated, and one of the above storage devices is assigned to the new memory block. 17. The prefetching unit of claim 16, wherein the control logic is further configured to clear a count value of the change of the maximum address, a count value of the minimum address change, and one of the storage devices being dispatched. The above history. 18. A data prefetching method for prefetching data to a cache memory of a microprocessor, the data prefetching method comprising: receiving a plurality of access requests for a plurality of addresses of a memory block, each The access request corresponds to one of the addresses of the memory block, and the address of the access request is non-monotonically increased or decreased with time; when the above storage is received When required, the 0608-A43067TW/final 63 201135460 a maximum and a minimum address in the memory block are maintained, and the count value of the change of the maximum and minimum addresses is calculated; when the access request is received, Maintaining a history record of the most recently accessed cache line in the memory block, wherein the most recently accessed cache line is associated with the address of the access request; determining an access direction according to the count value; Determining an access pattern according to the historical record; and, according to the access mode and along the access direction, indicating that the cache memory has not been indicated by the history record as The access cache line is prefetched into the above memory block. 19. The data prefetching method as described in claim 18, further comprising suspending the prefetching action before the number of recently accessed cache lines in the memory block is greater than a predetermined value. 20. For example, the data pre-fetching method described in claim 19, wherein the predetermined value is at least 9. twenty one. The method for prefetching data according to claim 19, wherein the predetermined value is at least ten percent of the number of cache lines in the memory block. twenty two. The data prefetching method of claim 18, wherein determining the access direction according to the foregoing counting value further comprises: calculating a change of the count value of the change of the maximum address and the minimum address When the difference between the values is greater than a predetermined value, the access direction is determined to be upward; and the count value of the change of the minimum address and the change of the maximum address is 0608-A43067TW/fmal 64 201135460 The direction between the i and the direction is downward. When the value is greater than the above-mentioned predetermined value, it is determined that the above-mentioned storage patents include the data pre-fetching method described in the above-mentioned maximum (four) brothers, and the count value of the change of the changed count value and the minimum bit The above-mentioned prefetch action is slowed down. Before the absolute value of Jun, greater than - the established value, temporarily one of the patent scope - the data prefetch method, the above historical record _ out of the above-mentioned recently accessed position 70 mask, the above-mentioned bit mask for the finger line And the cached block 1 is the most recently accessed cache. When the access is accessed, the method further includes: ten calculating the above-mentioned top line in the bit mask: the intermediate indicator register And the access speed of the ν2Ί"4 intermediate indicator register on the left side of the above-mentioned bit mask 兀 /, the upper speed intermediate indicator register right:, the Ν bit match, the plural For each of the different only = 兀 mask bitpe - increase the above-mentioned bit U (d_Ct matches the count value of the counter, one of the relevant numbers in 1), and ?N is the upper half of the upper light element period. · The information mentioned in item 24 of the patent application, in order to determine the above access samples according to the above-mentioned bits. And determining, by the other one of the above bit periods related to the above-mentioned bit period, that the matching count is greater than a predetermined value; The difference of 3 is 0608-A43067TW/flnal 65 201135460. The above-mentioned access mode specified by one of the N-bits of the intermediate indicator register masked by the above-mentioned bit mask is determined, where N is the above-mentioned bit period. a number of a bit in the middle, wherein the difference between the correlation matching counter of the one of the bit periods and the correlation matching counter of the other of the bit periods is greater than the predetermined value The difference between all other clear bit periods. 26. The method for prefetching data according to claim 25, wherein, in order to according to the access mode and along the access direction, the memory block is marked by the bit mask as not being recently The access cache line is prefetched into the cache memory, and the control logic is configured to: assign a search indicator along the access direction and the access mode from the N indicator of the intermediate indicator And when the bit in the access mode on the search indicator indicates an access, prefetch the cache line associated with the bit in the bit mask on the search indicator. 27. The data prefetching method of claim 26, wherein the memory block is marked by the bit mask as being recently not yet used according to the access mode and along the access direction. The cache line of the access is pre-fetched into the cache memory, and further includes: increasing/decreasing a value of the search indicator according to the access direction; and the foregoing storing on the search indicator after increasing/decreasing The bit in the sampled state indicates an access line prefetching the cache line associated with the bit in the bit mask on the search indicator that has been incremented/decreased. 28. For example, the data pre-fetching method described in claim 27, 0608-A43067TW/fmal 66 201135460 further includes: ,, the value of the finder and the pre-fetching condition, wherein the above conditions include: When the check direction is upward, the distance between the bit 70 in the mask above the search index and the bit located in the mask is greater than the distance between the bits in the mask; And when the access direction is downward, the distance between the bit of the mask of the vertical mask and the bit of the third mask of the minimum address is greater than the above The second is depreciating. Hey. For example, in the scope of claim 24, the difference between the above-mentioned matching counters related to the other matching period 兀 period associated with the method of the above-mentioned different bit periods (four) is greater than -岐Previously, the above prefetch action was suspended. . 3. In the data pre-fetching method described in item 18 of the scope of application, the above-mentioned bit period is 3, 4 and 5 bits. As described in item 18 of the patent application scope, the cache line has appeared in the above-mentioned microprocessor - cached the hidden body %, and abandoned the pre-fetching of the above cache line. 32. For example, the data mentioned in the application for patent scope 预18 is pre-existing. The size of the above memory block is 4 kilobytes. ,, 33. τ kinds of computer program products, compiled and sold on at least _ computer readable media' and applicable to - computing devices, the above computer program products include. One:: readable program code? The computer readable medium 201135460 is configured to define a prefetch unit in a microprocessor having a cache memory, wherein the computer readable program comprises: wherein the prefetch unit is And a plurality of access requests for receiving a plurality of addresses of a memory block, each access request corresponding to one of the addresses of the memory block, and the address of the access request is over time The function is non-monotonically increased or decreased; a first code for defining a storage device; and a second code for defining a control logic coupled to the storage device When the access request is received, the control logic is configured to: maintain the one of the access addresses and a minimum address maintained by the storage device, and calculate the change of the maximum and minimum addresses Counting a value; maintaining a history record of the most recently accessed cache line in the memory block by using the memory block; determining an access direction according to the count value; The historical record determines an access pattern; and pre-fetching the cache line in the cache memory that has not been indicated by the history record as being accessed according to the access mode and along the access direction In the memory block. 34. The computer program product of claim 33, wherein the at least one computer readable medium is selected from a disc, tape or other magnetic, optical or electronic storage medium and a network, line, wireless or Other communication media. 35. A data prefetching method for prefetching data into a micro processing 0608-A43067TW/fma] 68 201135460, one of the cache memory, the data prefetching method comprises: receiving one address of a memory block An access request; setting one bit associated with a cache line in the one-bit mask, wherein the cache line is associated with the address of the memory block; receiving the access request Thereafter, adding a count value of the total counter: when the address is greater than the value of a maximum index register, updating the maximum indicator register with the address, and adding a count value of the maximum change counter; Less than a minimum indicator register, updating the minimum indicator register with the above address, and adding a minimum change counter count value; calculating an intermediate indicator register as an average of the maximum and minimum change counters; The N-bit of the bit mask on the left side of the intermediate indicator register matches the N-bit of the bit mask on the right side of the intermediate indicator register And, for each of a plurality of different bit bit periods, increasing a count value of one of the matching counters associated with the bit period, where N is the number of bits in the bit period; Whether a condition occurs, wherein the above conditions include: (A) the total access counter is greater than a first predetermined value; (B) the difference between the maximum change counter and the minimum change counter minus the absolute value is greater than a second predetermined And (C) the absolute value of the difference between the count value of one of the above matching counters and the others is greater than a third predetermined value; and 0608-A43067TW/final 69 201135460 when the above condition exists: When the maximum change counter is greater than the minimum change counter, determining that the access direction is upward, and when the maximum change counter is smaller than the minimum change counter, determining that the access direction is downward; determining that the middle is masked by the bit The above-mentioned access mode specified by one of the N bits of the index register, where N is the above and above the bit period The maximum number of bits match counter associated one element's; and based on the direction of access and the access decision like state, the above-described plurality of blocks of memory prefetch cache line to the cache memory in the above. 36. The method for prefetching the data according to claim 36, wherein the step of prefetching the cache line into the cache memory according to the determined access direction and the access mode includes: (1) along the access direction, initializing a search indicator and the access pattern from the N indicator of the intermediate indicator; (2) determining whether a second condition exists, wherein the second condition includes: D) in the above-mentioned search indicator, the bit of the above-mentioned access mode has been i-fL · δ and 疋, (Ε) the bit of the above-mentioned bit mask in the above search indicator has been cleared; and (F) In the access direction, a difference between the maximum/minimum indicator and the bit in the bit mask of the search indicator is 0608-A43067TW/fmal 70 201135460; less than a fourth predetermined value; and (3) And when the second condition exists, prefetching the cache line associated with the bit in the bit mask of the search indicator. 37. The method for prefetching the data according to claim 36, wherein the step of prefetching the cache line to the cache memory according to the determined access direction and access mode further comprises: When the second condition exists, after determining the existence and access of the second status, the value of the search indicator is increased/decreased according to the access direction; and the above steps (2) and (3) are repeated. 38. The method for prefetching data according to claim 37, wherein the step of prefetching the cache line to the cache memory according to the determined access direction and access mode further comprises: The above condition (F) is true, and the above repeated steps are stopped. 39. The method for prefetching data according to claim 37, wherein the step of prefetching the cache line to the cache memory according to the determined access direction and access mode further comprises: All the bits of the above bit mask have been tested, and the above repeated steps are stopped. 40. a microprocessor comprising: a complex core; a cache memory shared by the core for receiving a plurality of access requirements for a plurality of addresses of a memory block, each access request corresponding to the memory One of the addresses of the body block, the address of the access request is non-monotonically increased or decreased by 0608-A43067TW/final 71 201135460 with time function; and a prefetch unit, The method is: monitoring the access request, and maintaining a maximum address and a minimum address in the memory block, and a count value of the change of the maximum address and the minimum address; determining one according to the count value And the access direction; and the cache line missed in the memory block is pre-fetched into the cache memory along the access direction. 41. The microprocessor of claim 40, wherein the prefetching unit is further configured to: maintain a history record of the most recently accessed cache line in the memory block, the recently accessed fast The fetching line is associated with the address of the access request; determining an access mode according to the historical record; and, according to the access mode and along the access direction, the cached memory is subjected to the history The record indicates that the plurality of cache lines that have not been accessed recently and are missed in the memory block are prefetched into the memory block. 42. A microprocessor includes: a first level cache memory; a second level cache memory; and a prefetch unit configured to: detect a most recent memory present in the second level cache memory Taking a direction and a state of the request, and prefetching the plurality of cache lines into the second-level cache memory according to the above direction and the form; 0608-A43067TW/fma! 72 201135460 : From the above first level Taking a memory, receiving an address of one of the access requirements received by the first-level cache memory, wherein the address is associated with a cache line, and is determined by the cache line associated with the direction One or more cache lines as indicated by the pattern; and causing one or more of the cache lines to be prefetched into the first level of cache memory. 43. The microprocessor of claim 42, wherein: the pre-fetching unit is configured to detect the direction and the appearance of the recent access request appearing in the second-level cache memory. Measure a direction and a state of a memory block, the memory block being a small set of memory ranges accessible by the microprocessor; to determine a cache line associated with the direction The one or more cache lines indicated by the above manner, wherein the pre-fetching unit is configured to: place the above-mentioned state to the memory block such that the address is located in the above-mentioned state; and along the above direction, The above address starts searching until it encounters one of the cache lines indicated by the above state. 44. The microprocessor of claim 43, wherein: the above aspect comprises a sequence of cache lines; wherein, in order to place the above-mentioned state to the memory block, the address is located in the above manner, The prefetching unit is configured to shift the above-described state to the memory block by the above sequence. 45. The microprocessor of claim 43, wherein the above-mentioned address system of the above-mentioned memory block of the memory block that appears in the second-level cache memory is 0608-A43067TW/final 73 201135460 With a time function rather than a non-monotonically increasing force and decreasing. 46. The microprocessor of claim 45, wherein the address of the recent access request of the memory block that appears in the second-level cache memory is non-contiguous (non- Sequentail). 47. The microprocessor of claim 42, further comprising: a plurality of cores; wherein the second level cache memory and the prefetch unit are shared by the core; and each of the cores comprises the first level One of the different types of cache memory is 4歹1Ka distinct instantation). 48. The microprocessor of claim 42, wherein the prefetching unit is configured to provide the one or more in order to cause the one or more cache lines to be prefetched into the first level cache. The address of the plurality of cache lines is to the first level cache memory, wherein the first level cache memory system is configured to request the one or more cache lines from the second level cache memory. 49. The microprocessor of claim 48, wherein the first level cache includes a queue for storing the address received from the prefetch unit. 50. The microprocessor of claim 42, wherein the prefetching unit is from the microprocessor in order to cause the one or more cache lines to be prefetched into the first level cache. A bus interface unit requires one or more cache lines, and then the cache line obtained by the above-mentioned desired 0608-A43067TW/final 74 201135460 is supplied to the first level cache. 51. The microprocessor of claim 42, wherein the prefetching unit is used to perform the second step in order to cause the one or more cache lines to be prefetched into the first level cache memory. One or more of the above cache lines are required in the level cache. One 52. The microprocessor of claim 51, wherein the prefetching unit is configured to provide the requested cache line to the first stage cache line. ’, 53. The microprocessor of claim 51, wherein the second level cache system is subsequently provided to the first level cache line for the required cache line. 54. The microprocessor of claim 42 wherein the pre-fetching unit detects the direction and the pattern comprises: ???when receiving the recent access request, one of the transfer-memory blocks a maximum address and a minimum address, and a count value of the change of the maximum address and the minimum address; and maintaining the access address of the memory block when receiving the recent access request One of the history records of the most recently accessed cache line; and σ determines the above direction based on the above count value; and determines the above state based on the above history record. 55* The microprocessor of claim 54, wherein the step of determining the direction according to the counting value comprises: between a count value of the change of the maximum address and a count value of the change of the minimum address When the difference is greater than a predetermined value, the direction is determined to be 0608-A43067TW/final 201135460 upward; and the difference between the count values of the change of the minimum address is /, and the change of the maximum address is downward. When the value is verified, the above direction 56 is determined; as in the 42nd paragraph of the patent application, the above historical record includes a microprocessor for the pixel mask, wherein the above-mentioned access address of the block is pregnant. It is used to indicate that the above memory is to receive the above-mentioned access cache line, and the following steps include the above prefetching unit. Calculating the above-mentioned bit mask inter-order index register; and chopping the N-bit of the above-mentioned inter-order index for the recently accessed cache line and the N-bit of the above-mentioned position mask of the middle finger 1 In the case of matching, the bit period of the above-mentioned bit masks for the right of the stolen pirates, the bit period of the same bit (the count value of the distinct matching counter, and the one of the elements of the bit-counting bit period) One is the bit 57 in the above bit period. For example, the scope of the patent application is 微处理器%. According to the above-mentioned pixel mask, the step of detecting the sampling state of the bit period includes: the other of the above bit periods: the related matching counter and the above Whether the value is greater than - the established value; 丨, the difference between the above matching counters is determined by the N-bit of the above-mentioned bit mask side: the bit of the one of the inter-segment register registers The number of elements, upper, '~, 〇 N is the above-mentioned bit week 0608-A43067TW/f, the nal 疋 cycle of the above-mentioned one is the phase 76 201135460 Guan Pizaki (four) and the others of the above bit period have The difference between the relevant matching count benefits is greater than the above predetermined value. 58. A data prefetching method for prefetching data to a micro-processing with a second-level cache memory--a level-level recording body, the data pre-fetching method comprising: detecting the second level fast Taking the direction and the state of the most recent access request in the memory, and prefetching the plurality of cache lines into the second level cache according to the above direction and the mode; the line correlation is faster from the first level Taking the memory 'receives the first-level cache memory: the received-access request--address, wherein the address is after the cache line associated with a cache line in the above direction One or more cache lines as indicated; and causing one or more of the cache lines described above to be prefetched into the first level of cache memory. 59. The pre-fetching of the data as described in claim 58: the above-mentioned detection of the above-mentioned directions and appearances of the recent access request in the second-level cache memory, including the detection-memory area In the above direction and mode of the block, the memory block is a small set of memory ranges accessible by the processor; ^ is determined by the above-mentioned state after the cache line associated with the direction towel The step of one or more cache lines, comprising: placing the above-described state to the memory block such that the address is located in the above-mentioned state; and 〇60S-A43067TW/flnaI 77 201135460 along the above direction, by the above The address begins to search until it encounters one of the cache lines indicated by the above pattern. 60. The data prefetching method according to claim 59, wherein the aspect includes one of a sequence of cache lines, and placing the above-mentioned state to the memory block, so that the address is located in the above manner. The step of transferring the above state to the memory block by the above sequence. 61. For example, the data prefetching method described in claim 59 (4), wherein the address of the uppermost access request of the memory block of the memory block appearing in the second level cache memory is a function of time rather than Increased and decreased monotonically (non_monotonically). 62. The method for prefetching data according to claim 61, wherein the address of the last access request of the memory block that appears in the second level cache memory may be non-contiguous ( Non-sequentail) 〇 . For example, in the data prefetching method described in Item 58 of the patent scope of May, the microprocessor further includes a plurality of cores, and the second level cache body and the prefetching order S system are shared by the cores, and each of the above The core includes the different examples of the above-mentioned first-level cache memory. • b. The data prefetching method described in item 58 of the patent, wherein one or more cache lines are prefetched to the above-mentioned first-level cache ^, two, y steps 'including one of the above microprocessors The fetch unit is configured to provide the address of one or more cache lines to the first-level cache memory, and the body-level cache memory system is used to request the above-mentioned second-level cache memory. One or more cache lines. _-M36〇L^m利范(4)% (4) prefetching method, 78 201135460 wherein the one or more cache lines are prefetched to the above-mentioned _ level cache 3 replies, including the above a pre-fetching unit of the microprocessor is configured to provide the address of the one or more cache lines to the first-level cache memory, wherein the first-level cache memory is connected to the microprocessor The row interface unit is configured to request the one or more cache lines above and then provide one or more cache lines of the above requirements to the first level cache. The data prefetching method of claim 58, wherein the step of causing the one or more cache lines to be prefetched to the first level cache memory comprises: using the prefetch unit The second level cache requires one or more of the above cache lines. 67. The method for prefetching data according to claim 66, wherein the step of causing the one or more admission lines to be prefetched to the first level cache memory includes the prefetch unit being used to request One or more cache lines are then provided to the above-described first-level cache line. The data prefetching method described in claim 66 of the patent scope further includes the second level cache memory to provide the required one of the cache lines to the above-mentioned fourth (four) line. / 69. - A computer program product 'coded on at least the body' and applied to a 呻笞罢腼腼贝贝贝贝取取取取取取取取一上述上述上述上述上述上述上述上述上述上述上述上述上述上述上述上述上述上述上述上述上述上述上述The computer readable medium-microprocessor, the computer readable program includes: - a second:: code ' to define a first level cache device; and a code to define a second level Cache memory device, · 0608-A43067TW/fJnaj 79 201135460 The second code is used to define the prefetch unit, so that the above prefetch early element is used for: , , # test appears in the above second level cache memory The most recent access = seek t direction and the form ' and according to the above direction and the form, the complex cache line is prefetched into the second level cache memory; from the first level cache memory, Receiving the above-mentioned first-level cache Lv Yiyi body received 73⁄4. g» φ ^ takes one of the required addresses, wherein the address is associated with a cache line; and determines one or more cache lines indicated by the above-described mode after the associated cache line in the above direction; And causing one or more of the cache lines described above to be prefetched into the first level cache memory. 70. The computer program product described in claim 69, wherein the at least the computer readable medium is selected from a disc, a tape or other magnetic, optical or electronic material and - network, line, wireless or other Communication media. 71. A microprocessor comprising: a cache memory; and a prefetch unit for: detecting a plurality of memory access requests having a first memory block, and according to the above Pre-fetching a plurality of cache lines from the first memory block to the cache memory; monitoring a new memory access request of a second memory block; determining whether the first memory block is Virtually adjacent to the upper memory 608-A43067TW/fmal dentures 201135460 two memory blocks, and when continuing from the first memory block to the second memory block, determining whether the above state predicts the above The new memory access requirement of the second memory block is related to the cache line being in the second memory block; and responsively from the second memory block according to the above aspect The above wire is pre-fetched into the above-mentioned cache memory. 72. The microprocessor of claim 71, wherein the size of the first and second memory blocks corresponds to a size of a page of each memory. m 73. The micro-processing crying as described in claim 71, wherein the microprocessor comprises a second-level cache memory, wherein the new δ hex memory access request includes the microprocessor- - a request for the level cache memory to the second level cache memory to allocate the cache line of the second memory block. 74. The microprocessor of claim 71, wherein, in order to detect the above-mentioned mode of the memory access request of the first memory block, the pre-fetching unit is configured to detect the above One of the directions of the hate access request; and ~ - in order to determine whether the first memory block is virtually adjacent to the first memory block, the prefetch unit is used to determine the first memory block in the above Whether the direction is virtual adjacent to the memory area. 1. The microprocessor according to claim 74, wherein the above address of the memory access request of the above-mentioned first § memory block is __function non-monotonic 0608-A43067TW/ The microprocessor of claim 74, wherein when the first memory block continues from the first memory block to the second memory block, determining whether the above state is predicted The cache line associated with the new memory access requirement of the second memory block, wherein the pre-fetch unit is used to extend from the first memory along the direction in the second memory block When the block continues to the second memory block, determining whether the mode predicts the cache line associated with the new memory access request on the second memory block in the second memory area In the block. The microprocessor of claim 74, wherein the prefetching unit is prefetched from the second memory block to the cache memory according to the mode described above. And pre-fetching the wire from the second memory block into the silk memory according to the above state and along the above direction. The micro-processing crying as described in claim 71 of the scope of claim π, wherein the pattern includes the plurality of fast-twist lines of the first-memory block: the sequence 'where the continuation from the first memory block to the above In the case of two memory blocks, the above-mentioned cache line associated with the above-mentioned new memory access request for predicting whether the above-mentioned mode predicts the second memory block is in the second memory block, The ordering 7G system is configured to determine whether the state of the second memory block is predicted from the first memory f block to the second memory block according to the sequence of the cache line. The above-mentioned new memory storage=required related cache line is in the above second memory block 0608-A43067TW/final 82 201135460 79. The microprocessor of claim 71, wherein the cache unit is further configured to wait for the cache line to be pre-fetched from the second memory block to the cache memory according to the above manner, until Determining whether the state predicts that at least one predetermined value of the second memory block is after the new memory access request is continued from the first memory block to the second memory block One of the cache lines associated with each of the memory access requirements. 80. The microprocessor of claim 71, wherein the predetermined number of subsequent memory access requests is two. 81. The microprocessor according to claim 71, wherein the prefetching unit is further configured to: maintain a project list consisting of a plurality of items, wherein each item of the item list includes first, second, and third a field, wherein the second field holds a representative value of a virtual address of a recently accessed memory block, wherein the first field is maintained in a direction and the most recently accessed memory block a representative value of a virtual address of a virtual memory block, wherein the third block maintains a virtual bit in one direction of a memory block that is virtually adjacent to the most recently accessed memory block The representative value of the address. 82. The microprocessor of claim 81, wherein the pre-fetching unit is configured to: determine the second memory, in order to determine whether the first memory block is virtually adjacent to the second memory block. Whether the representative value of the virtual address of the body block matches the first field or the third position of one of the items in the item list; and 0608-A43067TW/final 83 201135460 = face matching above the above item (4) It is generally matched with the representative value of the virtual address of the first memory block. 83. In order to maintain the above table, the above pre-acquisition order is used to: in the above-mentioned project list, in response to the above-mentioned micro-invention, according to the method of the first-in first-out method. The processor's memory access requirements. Loading/storing premature production 84. The micro-processing crying as described in claim 81, wherein the representative value of the virtual address of the memory block comprises a hashed bit of one of the virtual addresses of the memory block. Upper 85. The microprocessor according to claim 84, wherein the above-mentioned hash position of the virtual address of the memory block is multiplexed according to one of the following algorithms, wherein hashjj] indicates the first. Ίτ' the hashed bit, and VA[k] represents the kth memory region: the bit of the virtual address: hash[5]=VA[29]AVA[18]AVA[17]; hash[4] ]=VA[28]AVA[19]AVA[16]; hash[3]=VA[27]AVA[20]AVA[15]; hash[2]=VA[26]AVA[21]AVA[14] ; hash[l]=VA[25]AVA[22]AVA[13]; hash[0]=VA[24]AVA[23]AVA[12]. 86. The microprocessor of claim 71, further comprising a retelling core, wherein the cache memory and the prefetching unit are shared by the check. 0608-A43067TW/final 84 201135460 87' A data prefetching method for prefetching data to a cache memory of a microprocessor, the data prefetching method comprising: °° detecting having a first memory region The complex memory access of the block is required to be in the same state, and the cache line is prefetched from the first memory block to the cache memory according to the above manner; and one of the first memory blocks is monitored. a new memory access request; determining whether the first memory block is virtually adjacent to the second memory block and continuing from the first memory block to the second memory area Block, determining whether the above-mentioned state predicts that a new memory access request of the second memory & block is associated with the cache line in the second memory block; and according to the above aspect, The second memory block prefetches the plurality of cache lines to the cache memory to respond to the determining step. The team, as claimed in claim 87, pre-fetches the data, wherein the size of the first and second memory blocks is the size of the page. κ this redundancy 89. The data pre-fetching method of claim 87, wherein the microprocessor comprises a second-level cache memory, and wherein the memory access request comprises the above-mentioned micro-location 5 μ level The cache captures the requirement of the second level cache memory to allocate the cache line of the upper memory block. The data pre-fetching method described in item 87 of the patent scope, /, the step of detecting (4)-memory area Weiwei (4) further includes detecting one of the above-mentioned memory access accesses. The memory (four) block ^ fabric is in the step of the above-mentioned 201135460 ii 1 memory block, and further includes whether the first memory block of the shirt is virtual adjacent to the second memory block in the above direction. In the data prefetching method described in the ninth aspect of the scope, the upper address tilting time function of the memory access access of the memory block having the memory block is not increased or decreased. Please refer to the data pre-fetching method described in item 9 of the patent scope, wherein: from the first memory block to the second memory block, whether the state is pre--the second memory block The step of the cache line associated with the request in the second memory includes: determining whether the mode predicts the integrated area when the first memory area = the second memory block is along the direction The cache line associated with the above-mentioned new memory access requirement is in the second memory block. The data pre-fetching method described in item 90 of the patent scope is as follows: The step of pre-fetching the plurality of cache lines into the cache memory includes: fetching the memory from the cache line from the second memory block according to the above state and along the upper=direction in. & 4· #中专利(四) The data pre-fetching method described in item 87, = upper middle, the form includes a plurality of cache lines having the first memory block = order, wherein from the above-mentioned first memory When the block continues to the second=speak block, the above-mentioned cache line description memory related to the new memory access request of the second block is predicted. In the block, the above prefetch unit is used in the base. The order of the line taken from the above-mentioned first memory block continues to the above-mentioned - 〇6〇8-A43〇67TW/finaI ^ 戸土, geotechnical 一 86 86 。 86; Whether to predict the cache line associated with the new memory access request of the second memory block is in the second memory block. 95. The method for prefetching the data as described in claim 87, further comprising suspending the pre-fetching of the cache line from the second memory block to the cache memory according to the above manner, until when from the foregoing When a memory block continues to the second memory block, determining whether the mode predicts that at least one predetermined amount of the memory of the second memory block is stored after the new memory access request Take one of the cache lines associated with each of the required ones. 96. For example, the data pre-fetching method described in claim 87, wherein the predetermined number of subsequent memory access requests is two. 97. For example, the data pre-fetching method described in claim 87 of the patent application includes: maintaining a list of items constructed by the plurality of items, wherein each item of the item list includes the first, second and third blocks, The second field maintains a representative value of a virtual address of a recently accessed memory block, wherein the first field maintains a memory in a direction that is virtually adjacent to the most recently accessed memory block. A representative value of a virtual address of the body block, wherein the third field maintains a representative value of a virtual address of a memory block that is virtually adjacent to the most recently accessed memory block in another direction. 98. The data pre-fetching method of claim 97, wherein the determining whether the first memory block is virtually adjacent to the second memory block further comprises: determining the second memory block Whether the representative value of the virtual address is 0608-A43067TW/final ' 87 201135460 The first or third position of the above-mentioned one of the items in the above item list; and the above-mentioned item determined in the matched item Whether the second block matches the representative value of the virtual address of the first memory block. 99. For example, the data pre-fetching method described in claim 97, wherein the steps of maintaining the above item list further include: pushing the above project into the above item list in a first-in-first-out manner, in response to one of the above microprocessors Memory access requirements generated by the load/store unit. 100. The data prefetching method of claim 97, wherein the representative value of the virtual address of the memory block comprises a hashed bit of one of the virtual addresses of the memory block. 101. The data prefetching method according to claim 100, wherein the above-mentioned hashed bits of the virtual address of the memory block are hashed according to one of the following algorithms: wherein hash[j] represents the jth hash The bit ' and VA[k] represent the bit of the virtual address of the kth memory block: hash[5]=VA[29]AVA[18]AVA[17]; hash[4]=VA [28] AVA [19] AVA [16]; hash [3] = VA [27] AVA [20] AVA [15]; hash [2] = VA [26] AVA [21] AVA [14]; hash [ l]=VA[25]AVA[22]AVA[13]; hash[0]=VA[24]AVA[23; TVA[12]. 102. A computer program product encoded on at least one computer readable 0608-A43067TW/final 88 201135460 medium and suitable for use in a computing device, the computer program product comprising: a computer readable program code, stored in the computer readable The medium readable program includes: a first code for defining a cache memory device; and a second code for defining a prefetch device to enable the pre-fetching The device is configured to: detect the same state of access with a first memory block, and prefetch the cache line from the first memory block according to the mode; i view a second memory region a new access request of the block; determining that the first memory block is virtually adjacent to the second memory block and the above-described state, when continuing from the first memory block to the second memory area Blocking, predicting access to one of the cache lines associated with the new request having the second memory block described above; and responsively from the second memory Prefetching the blocks into the cache line of the cache memory. 0608-A43067TW/fmal 89