TW201135460A - Prefetcher, method of prefetch data, computer program product and microprocessor - Google Patents

Prefetcher, method of prefetch data, computer program product and microprocessor Download PDF

Info

Publication number
TW201135460A
TW201135460A TW100110731A TW100110731A TW201135460A TW 201135460 A TW201135460 A TW 201135460A TW 100110731 A TW100110731 A TW 100110731A TW 100110731 A TW100110731 A TW 100110731A TW 201135460 A TW201135460 A TW 201135460A
Authority
TW
Taiwan
Prior art keywords
memory
memory block
cache
access
bit
Prior art date
Application number
TW100110731A
Other languages
Chinese (zh)
Other versions
TWI506434B (en
Inventor
Rodney E Hooker
John Michael Greer
Original Assignee
Via Tech Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/033,765 external-priority patent/US8762649B2/en
Priority claimed from US13/033,848 external-priority patent/US8719510B2/en
Priority claimed from US13/033,809 external-priority patent/US8645631B2/en
Application filed by Via Tech Inc filed Critical Via Tech Inc
Publication of TW201135460A publication Critical patent/TW201135460A/en
Application granted granted Critical
Publication of TWI506434B publication Critical patent/TWI506434B/en

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A data prefetcher in a microprocessor having a cache memory is disclosed. The data prefetcher is configured to receive a plurality of memory accesses each to an address within a memory block, wherein the plurality of memory access addresses are non-monotonically increasing or decreasing as a function of time. The data prefetcher includes a storage element and a control logic, whererin the control logic is coupled to the storage element. As the plurality of memory accesses are received, the control logic is configured to maintain within the storage element a largest address and a smallest address of the plurality of accesses and counts of changes to the largest and smallest addresses, to maintain a history of recently accessed cache lines implicated by the access addresses within the memory block. The data prefetcher determines a predominant access direction based on the counts, determines a predominant access pattern based on the history, and prefetches into the cache memory, in the predominant access direction according to the predominant access pattern, cache lines of the memory block which the history indicates have not been recently accessed.

Description

201135460 六、發明說明: 【發明所屬之技術領域】 本發明係有關於一般微處理器之快取記憶體,特別係 有關將資料預取至微處理器之快取記憶體。 【先前技術】 以歲近的電腦糸統而s ’在快取失敗(cache miss)時, 微處理器存取系統記憶體所需的時間,會比微處理器存取 快取記憶體(cache)多上一或兩個數量級。因此,為了提高 快取命中率(cache hit rate),微處理器整合了預取技術,用 來測試最近資料存取樣態(examine recent data access patterns) ’並且企圖預測哪一個資料為程式下一個存取的對 象,而預取的好處已是眾所皆知的範嘴。 然而,申请人注意到某些程式的存取樣態並不為習知 微處理器之預取單it所能偵測的。例如,g i圖所示為當 執行之私式包括經由記憶體進行—序列之儲存動作時,第 二級快取記憶體(L2Caehe)之存取樣態,而射所描繪者為 2間之記憶體位址。由第】圖可知,雖㈣㈣㈣著 時間而增加記憶體位址,即由往上之方向,然而在許多狀 況下’所指定之存取記憶體位址亦可較前—個時間往下, 而非總趨勢之往上,使其不同於習4 的結果。 於白知預取單元實際所預測 雖然就數1相對大的樣本而言 u 總趨勢係朝一個方向 則進,但習知預取單元在面臨小樣 ^ 不^部可能出現混亂狀 況的原因有兩個。第一個原因為程古 、依循其架構對存取 0608-A43067TW/final 4 201135460 記憶體’不論是由演算法特性或是不佳的編程(p〇〇r programming)所造成。第二個原因為非循序(〇ut_〇f_〇rder execution)微處理器核心之管線與佇列在正常功能下執行 時’常常會用不同於其所產生的程式順序來進行記憶體存 取。 因此’需要一個資料預取單元(器)能夠有效地為程式進 行資料預取,其必須考慮到在較小時窗(time windows)進行 記憶體存取指令(動作)時並不會呈現明顯之趨勢(n〇 dear trend) ’但當以較大樣本數進行審查時則會出現明顯之趨 勢0 【發明内容】 本發明揭露一種預取單元,設置於具有—快取記憶體 之一微處理器中,其中預取單元係用以接收對—記憶體區 塊之複數位址的複數存取要求,每一存取要求對應記憶體 區塊之位址中之一者,並且存取要求之位址係隨著時間函 數非單調性地(non-monotonically)增加或減少。預取單元包 括一儲存裝置以及一控制邏輯。控制邏輯,耦接至儲存= 置,其中當接收到存取要求'時,控制邏輯則用以.維持儲存 裝置中之存取要求之一*大位址以及一最小位±止,以及最 大位址以及最小位址之變化的計數值'維持記憔 雩近被存取之快取線的一歷史記錄,最近被存:之:取 係與存取要求之位址相關、根據計數值,決定一存 根據歷史記錄,決定一存取樣態,並且根據存取子:二: 著存取方向’將快取記憶體内尚未被歷史記錄指二已$ 〇608-A43067TW/final , ’ 201135460 取之快取線預取至記憶體區塊中。 本發明揭露一種資料預取方法,用以預取資 處理器之—快取記憶體,資料預取方法,包括接收對 憶體區塊之複數位址的複數存取要求,每—麵 ° 1憶體區塊之位址中之—者,並且存取要求之位址係隨= 寸門函數非單5周性地增加或減少去 收到存取要求時’維持記憶體區塊中之_最大以及—: 位址,並且計算最大以及最小位址之變化的計數值 收到存取要求時,維持記憶體區塊巾最近被麵之快^ 的歷史圮錄,最近被存取之快取線係與存取要求之、 相關;根據計數值決;t-存取方向;根據歷史紀錄決= 存取樣態;以及根據存取樣態並沿著存取方向,將快 憶體内尚未被歷史記錄指示為已存取之快取線、= 體區塊中。 王元憶 本發明揭露一種電腦程式產品,編碼於至少一 讀取媒體之上,並適用於—計算裝置,電腦程式產。= :電腦可讀程式編碼。電腦可讀程式編碼,儲存於=括 讀取媒體中’用以在具有-快取記憶體之—微處理器中可 定義出(specify)—預取單元,電腦可讀程式其中預取單元係 用以接收對-記憶體區塊之複數位址的複數存取要求,每 一存取要求對應記憶體區塊之位址中之一者,並且存取要 求之位址係隨著時間函數非單調性地(n〇n_m〇n〇t〇nically) 增加或減少。電腦可讀程式包括一第一程式碼以及一第二 程式碼。第一程式碼,用以定義出一儲存裝置。第二程式 碼,用以定義出一控制邏輯,輕接至儲存裝置,其中當接 0608-A43067TW/fmal 6 201135460 二 收到存取要求時,控制邏輯則用以藉由儲存裝置維持之存 取之一取大位址以及一最小位址,並且計算最大以及最小 ,址之變化的計數值、藉由記憶體區塊維持記憶體區塊中 最近被存取之快取線的一歷史記錄、根據計數值決定一存 取方向、根據歷史紀錄決定一存取樣態並且根據存取樣態 並沿著存取方向,將快取記憶體内尚未被歷史記錄指示為 已存取之快取線預取至記憶體區塊中。 本發明揭露一種微處理器,包括複數核心、一快取記 憶體以及一預取單元。快取記憶體,由核心所共享,用以 接收對一記憶體區塊之複數位址的複數存取要求,每一存 取要求對應記憶體區塊之位址中之一者,存取要求之位址 系^著日才間函數非單調性地(n〇n_m〇n〇t〇njCally)增加或減 ^預取單元,用以監視存取要求,並維持記憶體區塊中 之一最大位址以及一最小位址,以及最大位址以及最小位 ^之變化的計數值、根據計數值,決定一存取方向並且沿 著存取方向,將记憶體區塊中未命中之快取線預取至快取 記憶體中。 雕本發明揭露一種微處理器,包括一第一級快取記憶 體、一第二級快取記憶體以及一預取單元。預取單元用以 债測出現在第二級快取記憶體中之最近存取要求之一方向 丄及樣心,以及根據方向以及樣態,將複數快取線預取至 第級决取a己憶體中、從第一級快取記憶體,接收第一級 快取記憶體所接收之一存取要求之一位址,其中位址與一 7取線相關、決定在方向中所相關之快取線之後被樣態所 指出之-個或多個快取線並且導致一個或多個快取線被預 7 201135460 取至第一級快取記憶體中。 本發明揭露—種¥料方法,用以預取資料至且有 -第二級快取記憶體之—微處理器之㈣ 田 貞劂出現在第二級快取記憶體中之 取要求之—方向以及樣態,以及根據方向以及樣 也,將複數快取線預取至第二級快取記憶體中;從第一級 快取記憶體,接收第-級快取記憶體所接收之—存取要求 二:其I位址與—快取線相關;決定在方向中所相 關之快取線之後被樣態所指出之—個❹鍊取線;以及 ¥致-個或多個快取線被預取至第—級快取記憶體中。 本發明揭露-種電腦程式產品,編碼於至少一電腦可 讀取媒體之上,並適用於一計算裝置,電腦程式產品包括 -電腦可讀程式編碼。一電腦可讀程式編碼,儲存於電腦 可讀取媒體中,用以定義一微處理器,電腦可讀程式包括 第私式碼、-第二程式碼以及一第三程式碼。第一程 式石馬’用以^義-第—級快取記憶體裝置。第二程式碼, 用以定義-第二級快取記憶體裝置。第三程式碼,用以定 義-預取單元’使得預取單元用以偵測出現在第二級快取 記憶體中之最近存取要求之一方向以及樣態,以及根據方 向以及樣態,將複數快取線預取至第二級快取記憶體中、 從第-級快取記憶體,接收第—級快取記憶體所接收之一 存取要求之一位址,其中位址與一快取線相關、決定在方 向中所相關之快取線之後被樣態所指出之一個或多個快取 線並且導致-個或多個快取線被預取至第一級快取記憶體 中。 〜 〇608-A43067TW/final 201135460 本發明揭露一種微處理器,包括一快取記憶.體以及一 預取單元。預取單元用以偵測具有一第一記憶體區塊之複 數記憶體存取要求之一樣態,並且根據樣態從第一記憶體 區塊預取複數快取線至快取記憶體中、監視一第二記憶體 區塊之一新的記憶體存取要求、決定第一記憶體區塊是否 虛擬鄰近於第二記憶體區塊,並且當自第一記憶體區塊延 續至第二記憶體區塊時,則決定樣態是否預測到第二記憶 體區塊之新的記憶體存取要求所相關之一快取線在第二記 憶體區塊中、並且根據樣態’從第二記憶體區塊將相映的 快取線預取至快取記憶體中。 本發明揭露一種資料預取方法,用以預取資料至一微 處理器之一快取記憶體,資料預取方法包括/[貞測具有一第 一記憶體區塊之複數記憶體存取要求之一樣態,並且根據 樣態從第一記憶體區塊預取快取線至上至快取記憶體中; 監視一第二記憶體區塊之一新的記憶體存取要求;決定第 一記憶體區塊是否虛擬鄰近於第二記憶體區塊,並且當自 第一記憶體區塊延續至第二記憶體區塊時,決定樣態是否 預測到第二記憶體區塊之新的記憶體存取要求所相關之一 快取線在第二記憶體區塊中;以及根據樣態,從第二記憶 體區塊將複數快取線預取至快取記憶體中,以回應決定步 驟。 本發明揭露一種電腦程式產品,編碼於至少一電腦可 讀取媒體之上,並且適用於一計算裝置,電腦程式產品包 括一電腦可讀程式編碼,儲存於電腦可讀取媒體,用以定 義一微處理器。電腦可讀程式包括一第一程式碼以及一第 0608-A43067TW/fina] 9 201135460 二程式碼。第一程式碼,用以定義一快取記憶體裝置。第 二程式碼,用以定義一預取裝置’使得預取裝置用以偵測 具有一第一記憶體區塊之存取之一樣態,並且根據樣態從 第一記憶體區塊預取進入快取線、監視一第二記憶體區塊 之一新的存取要求、決定第一記憶體區塊係虛擬鄰近至第 一s己憶體區塊以及樣態,當持續自第一記憶體區塊至第二 記憶體區塊,預測至與具有第二記憶體區塊之新的要求相 關之一快取線之一存取、並且根據樣態響應地從第二記憶 體區塊預取進入快取記憶體之快取線。 【實施方式】 以下將詳細討論本發明各種實施例之製造及 法。然而值得注意的是,本發明所提供之許多可行的發明 概念可實施在各種料範㈣。這些肢實施龍用於舉 ㈣明本發明之製造及使用方法,但非用於限定本發明之 範圍。 贋之而言 _仍上返問題之解決方法可 以解釋。當—記«之所有麵(齡、動作 =一=上時,所有存取(指令、動作或要求)之 = 加的存取要求亦表示於同; 上述首張圖如第8圖所亍:°周整大小後之定界框圈起來。 或動作Η。猶塊的兩次存取(指令 示具有4KB區塊之存取的二曰二,存取之時間’ Y軸表 描緣第-次之兩個存取 =1、取線之索引。首先, I ^個絲_絲線5進行存 201135460 + 一子要求係對快取線6進行存取。 一定界框將代表存取要求的兩點圈起來。取如圖所示之 再者,第二個存取要求發 — 得代表第三個存取要求的新點可被定二圈:::帳 新的存取不斷發生,定界推必隨 】:二:: 鏟以及下缓的# ^大為向上的例子)°上述定界框上 為白上at動之歷史紀錄將用以決定存取樣態之趨勢 為向上、向下或者都不是。 除了追縱定界框之上緣以及下緣的趨勢以決定—趨勢 方向外’追縱個別的存取要求也是必要的,因為存取要求 跳過-或兩個快取線的事件時常發生。因此,為了避免跳 過所預取快取線的事件發生,〆旦偵測到一向上或向下之 趨勢’預取單元則使用額外的準則決定所要預取之快取 線。由於存取要求趨勢會被重新排列,預取單元會將這歧 暫態的重新排列存取歷史紀錄予以删除。此動作係藉由^ 記位元(marking bit)在一位元遮罩(bit mask)中完成的,每— 位元對應具有一 &amp;己憶體區塊之·/快取線,,且當位元遮罩 中對應之位元被設置時,表示特定之區塊可被存取。—曰 對記憶體區塊的存取要求已達到一充分數量,預取單元會 使用位元遮罩(其中位元遮罩不具有存取之時序的指示), 並基於如下所述之較大的存取觀點(廣義large view)去存取 整個區塊’而泮基於較小的存取觀點(狹義small view)以及 習知預取單元般僅根據存取的時間去存取之區塊。 第2圖所系為本發明之微處理态1〇〇的方塊圖。微處 理器100包祐/個具有複數階層之傳遞路徑,並且傳遞路 0608-A43067TW/final 11 201135460 徑中亦包括各種功能單元。傳遞路徑包括一指令快取記憶 體102 ’指令快取記憶體1 〇2轉接至一指令解碼器1 ;指 令解碼器104耦接至一暫存器別名表1〇6(register&gt; alias table,RAT);暫存器別名表ι〇6耦接至一保留站 108(reservation station);保留站ι〇8耦接至一執行單元 112(execution unit);最後,執行單元112耦接至一引退單 元114(retire unit)。指令解碼器1〇4可包括一指令轉譯器 (instruction translator),用以將巨集指令(例如χ86架構之 巨集指令)轉譯為微處理器1〇〇之類似精簡指令集(reduce instruction set computerRISC)之巨集指令。保留站 1〇8 產 生並且傳送指令至執行單元112,用以使執行單元112依 照程式順序(program order)執行。引退單元114包括一重新 排序緩衝器(reorder buffer) ’用以依據程式順序執行指令之 引退(Retirement)。執行單元112包括載入/儲存單元134以 及其他執行單元132(other execution unit),例如整數單元 (integer unit)、浮點數單元(floating point unit)、分支單元 (branch unit)或者單指令多重資料串流(single hstructi〇n Multiple Data ’ SIMD)單元。載入/儲存單元134用以讀取 第一級資料快取記憶體116(level 1 data cache)之資料,並 且寫入資料至第一級資料快取記憶體116。一第二級快取 記憶體118用以支持(back)第一級資料快取記憶體116以及 指令快取記憶體102。第二級快取記憶體118用以經由一 匯流排介面單元122讀取以及寫入系統記憶體,匯流排介 面單元122係微處理器100與一匯流排(例如一區域匯流排 (local bus)或是記憶體匯流排(memory bus))間之一介面。微 0608-A43067TW/fmal 12 201135460 包括—預取單元124,用以自系統記憶體預 體116。 級快取記憶體118及/或第一級資料快取記憶 鬧。圖所示為第2圖之預取單元124較詳細之方塊 —、、單凡包括一區塊位元遮罩暫存器3〇2。區塊 罩暫存器3〇2中之每_位元對應具有一記憶體區塊 〇 深,其中記憶體區塊之區塊號碼係儲存在一區塊 號=暫,盗303内。換言之,區塊號碼暫存器3〇3儲存了 隐體區塊之上層位址位元(upper address bits)。當區境位 =罩暫存器3〇2中之一位元的數值為真㈣e她^時, 係才曰出所對應之快取線已經被存取了。初始化區塊位元遮 罩暫存益302將使得所有的位元值為假(false)。在一實 中’記憶體區塊的大小為4KB ’並且快取線之大小為料 =兀組°因此,區塊位元遮罩暫存器搬具有64位元之容 量丄在某些實施例中,記憶體區塊之大小亦可與實體記憶 體分頁(physical memory page)之大小相同。然而,快取 線之大小在其他實施例中可為其他各種不同之大小。、再 者’區塊位元遮罩暫存器3G2上所轉之記憶體區域之大 小是可改變的’並不需要對應於實體記憶體分頁的大小。 更確切的說,區塊位元遮罩暫存器3G2上所維持之記憶體 區域(或區塊)之大小可為任何大小(二的倍數最好),只要= 擁有足夠的快取線以便進行利於獅方向與樣態的債測即 可0 預取單元124亦可包括-最小指標暫存器3〇4_ pointer register)以及一最大指標暫存器3〇6(m 血 0608-A43067TW/fmal 13 201135460 °最小指標暫存器304以及最大指標暫存器3〇6分 別用以在預取單% 124開始追縱—記憶體區塊之存取後, 持續地指向此記憶體區塊中已被存取之最低以及最高之快 取線的食引(index)。預取單元124更包括一最小改變計數 益:以及—最大改變計數器312。最小改變計數器遞 以及=大改變計數器312分別用以在預取單元I24開始追 縱此記憶體區塊之存取後,計算最小指標暫存器304以及 最大才曰^暫存器3〇6改變之次數。預取單元以亦包括一 總計數器314,用,v y- 〜 用以在預取早儿124開始追蹤此記憶體區 之子取後5十算已被存取之快取線的總數。預取單元丨24 亦包括一中間指標暫存器316,用以在預取單元124開始 追縱此記憶體區塊之存取後’指向此記憶體區塊之中間 引(例如最小指標暫存器304之計數值以及 最大改變计數器312之計數值的平均)。預取單元124亦 Γ4: ^ t# ^ 342(direCti〇n register)' ~ ^ i ΐ 1 臾尋一^暫週/月暫存器346、一樣態區域暫存器州以及 一搜哥Μ票暫存器352,纟各功能如下所述。 及 預取單元U4亦包括複數週期匹配計數器318加細 match counter)。每一週期匹配計數器3 i 8 之-計數值。在—實施例中,週期為3、4'm不同週期 指中間指標暫存器316左/右之位元 。週期係 夕舛釤秸Αγη仏^ 匹配计數器318 之计數值在^的母-記憶體存取進行之後更新。 位元遮罩暫存器302指示在週期 田£鬼 配 左邊的存取與對侧標暫存器-右316 0608-A43067TW/final 14 201135460 - 時,預取單元124則接著增加與該週期相關之週期匹配計 數器318之計數值。關於週期匹配計數器318更詳細之應 用以及操作,將特別在下述之第四、五圖講述之。 預取單元124亦包括一預取要求佇列328、一提取指 標器324(pop pointer)以及一推進指標器326(push pointer)。預取要求符列328包括一循環的項目(entry)仵 列,上述項目的每一者用以儲存預取單元124之操作(特別 是關於第4、6以及7圖)所產生之預取要求。推進指標器 326指出將分派至預取要求佇列328的下一個項目(entry)。 提取指標器324指出將從預取要求佇列328移出之下一個 項目。在一實施例中,因為預取要求可能以失非循序的方 式(out of order)結束,所以預取要求仔列328係可以非循失 序的方式提取(popping)已完成的(completed)項目。在一實 施例中,預取要求佇列328的大小係由於線路流程中,所 有要求進入第二級快取記憶體118之標記之線路(tag pipeline)的線路流程而選擇的,於是使得預取要求仔列328 中項目之數目至少和第二級快取記憶體118内之管線層級 (stages)—樣多。預取要求將維持直至第二級快取記憶體 118之管線結束,在這個時間點,要求)可能是三個結果之 一,如第7圖更詳細之敘述,亦即命中(hit in)第二級快取 記憶磕118、重新執行(replay)、或者推進一全佇列管道項 目,用以從系統記憶體預取需要的資料。 0608-A43067TW/fmal 15 201135460 預取單疋124亦包括控制邏輯322,控制邏輯322控 制預取單元124之各元件執行其功能。 雖」第3圖只顯示出一組與一主動(⑽㈣記憶體區塊 有關之硬體單元332(區塊位域罩暫存器3()2、區境號碼 暫存器則、最小指標暫存器304、最大指標暫存器遍、 最小改變計數器3〇8、最大改變計數器312、總計數器叫、 中間指標暫存n 316、樣態順序暫存器346、樣態區域暫存 器348以及搜尋指標暫存器352),但預取單幻24可包括 複數個如第3圖所示之硬體單元说,収追縱多個主動 記憶體區塊的存取。 在一實施例中,微處理器100亦包括-個或多個高 度反應式的(highly reactive)預取單元(未圖示),高度反應式 的預取單元係在非常小的暫時樣本(sample)中使用不同的 演算法來進行存取,並且與預取單元124配合動作,其說 明如下。由於此處所述之預取單元m分析較大記憶體存 取之數目(相較於高度反應式的預取單元), 更長的時間去開始預取一新的記憶體區塊, 其必趨向使用 如下所述,但 卻比高反應式_取單元更精確。因此,❹高度反應式 的預取單^與預取單元124同時動作,微處iqq可擁 有高反應式的預取單元之更快反應時間以及預取單元i24 之高精確度。另外’預取單元124可監控來自其他預取單 元之要求,並且在其預取演算法中使用這些要求。 〇608-A43067TW/fmal 201135460 如第4圖所示為第2圖之微處理器100的操作流程 圖,並且特別是第3圖之預取單元124的動作。流程開始 於步驟402。 在步驟402中,預取單元124接收一載入/儲存記憶 體存取要求,用以存取對一記憶體位址之一載入/儲存記憶 體存取要求。在一實施例中,預取單元124在判斷預取哪 些快取線時,會將出載入記憶體存取要求與儲存記憶體存 取要求加以區分。在其他實施例中,預取單元124並不會 在判斷預取哪些快取線時,辨別載入以及儲存。在一實施 例中,預取單元124接收載入/儲存單元134所輸出之記憶 體存取要求。預取單元124可接收來自不同來源之記憶體 存取要求,上述來源包括(但不限於)載入/儲存單元134、 第一級資料快取記憶體116(例如第一級資料快取記憶體 116所產生之一分派要求,於載入/儲存單元134記憶體存 取未擊中第一級資料快取記憶體116時),及/或其他來源, 例如微處理器100之用以執行與預取單元124不同預取演 算法以預取資料之其他預取單元(未圖示)。流程進入步驟 404。 在步驟404中,控制邏輯322拫據比較記憶體存取 位址與每一區塊號碼暫存器303之數值,判斷是否對一主 動區塊之記憶體進行存取。也就是,控制邏輯322判斷第 3圖所示之硬體單元332是否已被分派給記憶體存取要求 0608-A43067TW/final 17 201135460 所指定之記憶體位址所相關的記憶體區塊。若是,則進入 步驟406。 在步驟406中,控制邏輯322分派第3圖所示之硬 體單兀332給相關之記憶體區塊。在一實施例中,控制邏 輯322以—輪替(round-robin)的方式分派硬體單元332。在 其他實施例,控制邏輯322為硬體單元332維持最久未用 到的頁取代法(least_recently_used)之資訊,並且以一最久未 用到的頁取代法(least_recently_used)之基礎進行分派。另 外’控制邏輯322 #初始化所分派之硬體單元332。特別 是’控制邏輯322會清除區塊位域罩暫存器逝之所有 位儿’將記憶體存取位址之上層位元填充(pGpulate)至區塊 號碼暫存器3G3 ’並且清除最小指標暫存!| 3G4、最大指標 暫存器3〇6、最小改變計數器308、最大改變計數器312、 總計數器3M以及匹配計數器318為g。流程進入到 步驟408。 在步驟4〇8中’控制邏輯322根據記憶體存取位址 更新硬體單it 332 ’如第5圖所述。流程進入步驟412。 在步驟412中,石争辦留-, 更體早7〇 332測試(examine)總計數 β 314用以判斷程式是否已對記憶體區塊進行足夠之存取 要求讀偵’則存取樣態。在—實施例中,控制邏輯功 判斷總計數器314之朴盤伯θ ^ 冲數值疋否大於一既定值。在—實施 例中,此既定值為1〇,妙二&quot; 、 …、、而此既定值有很多種本發明不 0608-A43067TW/final 201135460 流程進行至步驟414 ; 於此。若已執行足夠之存取要求, 否則流程結束。 ^判斷在區塊位元 暫存器302中所指定的存取要求是 ’… 勒u B 疋否具有一個明顯的趨 勢。也就疋說,控制邏輯322判斷存取要求有明顯向上的 趨勢(存取位址增加)或是向下_勢(存取位址減少)。在— 實施例中,控制邏輯322根據最小改變計數器3〇8以及最 大改變计數器312兩者的差值(difference)是否大於一既定 值來決定存取要求是否有明顯的趨勢。在一實施例中,既 定值為2’而在其他實施例中既定值可為其他數值。當最 小改變計數器308之計數值大於最大改變計數器312之叶 數值一既定值’則有明顯向下的趨勢;反之,當最大改變 5十數器312之計數值大於最小改變計數器308之計數值— 既定值’則有明顯向上的趨勢。當有一明顯的趨勢已產生, 則進入步驟416,否則結束流程。 在步驟416中,控制邏輯322判斷在區塊位元遮罩 暫存器302所指定的存取要求中是否為具有一明顯的樣態 週期贏家(pattern period winner)。在一實施例中,控制邏輯 322根據週期匹配計數器318之一者與其他週期匹配計數 器318計數值之差值是否大於一既定值來決定是否有_明 顯的樣態週期贏家。在一實施例中,既定值為2,而在其 他貫施例中既定值可為其他數值。週期匹配計數器3 1 8之 〇608-A43067TW/final 19 201135460 更新動作將於第5圖加以詳述。當有_ ,'肩的樣態週期贏 家產生’流程進行到步驟418 ;否則,流程纟士束 在步驟418中’控制邏輯322填充方向暫存器342 以指出步驟414所判斷之明顯的方向趨執 ° 另。另外,控制邏 輯322用在步驟416偵測之清楚赢家樣態週期⑽肛 winning pattern period)N填充樣態順序智左w 只存盗346。最後, 控制邏輯322將㈣416削貞測到之明―㈣· t 以’控制邏輯322用區 塊位元遮罩暫存器302之N位元至中間指標暫存器之 右側或者左側(根據第5圖步驟518所诂二π 所地而匹配)來填充樣 態暫存器344。流程進行到步驟422。 在步驟422中,控制邏輯322椒μ α 科以根據所偵測到之方向 以及樣態開始對記憶體區塊中尚未被預取之快取線 (non-fetchedcacheiine)進行預取(如第6圖中所述)。流程在 步驟422結束。 第5圖所示為第3圖所示之預取單元124執行第4 圖所示之步驟的操作流程。流程開始於步驟逝。 在步驟502中,控制邏輯 得J22增加總計數器314之 計數值。流程進行到步驟5〇4。 在步驟504中,控制邏輯 铒22判斷目前的記憶體存 取位址(特別是指’最近記情體左 μ體存取位址所相關之快取線之 器306之值 記憶體區塊的索引值)是否大於最大指標暫存 0608-A43067TW/final 20 201135460 若是机輊進行到步驟506 ;若否則流程進行至步驟5〇8。 在步驟506中,控制邏輯322用最近記憶體存取位 止所相關之快取線之記憶體區塊的索引值來更新最大指標 暫存益306,並增加最大改變計數器312之計數值。流程 進行到步驟514。 在步驟508中,控制邏輯322判斷被最近記憶體存 取位址所相關之快取線之記憶體區塊的索引值是否小於最 J才曰私暫存态304之值。若是,流程進行至步驟512 ;若 否’則流程進行至步驟514。 在步驟512中,控制邏輯322帛最近記憶體存取位 址所相關之快取線之記憶體區塊的索;丨值來更新最小指標 暫存器304,並增加最小改變計數器3〇8之計數值。流程 進行到步驟514。 在步驟5丨4中,控制邏輯322計算最小指標暫存器 3〇4與最大指標暫存器3〇6之平均值,並且用所算之出平 均值更新中間指標暫存n 316。流程進行到步驟516。 在步驟5!6中,控制邏輯322檢查區塊位元遮罩暫 存器302,並且以中間指標暫存器316為中心,切割成左 側與右側各N位元,其中N為與每—週期匹配計數器训 有關之每—者之位元的位元數。流程進行到步驟518。 。。在步驟5财,控制邏輯322決定在令間指標暫存 态316之左側的N位元是否與中間指標暫存器幻6之右| 〇608-A43067TW/finai ^ 201135460 的N位元相匹配。若是,流程進行到步驟522 ;若否,則 流程結束。 在步驟522中,控制邏輯322增加具有一 N週期之 週期匹配計數器318之計數值。流程結束於步驟522。 第6圖所示為第3圖之預取單元124執行第4圖之 步驟422的操作流程圖。流程開始於步驟602。 在步驟602中,控制邏輯322初始化會在離開偵測 方向之中間指標暫存器316的樣態噸序暫存器346中,對 搜尋指標暫存器3M以及樣態區域暫存器(patt?n locat1〇n)348進行初始化。也就是說,控制邏輯322會將搜 尋指標暫存器352以及樣態區域暫存器348初始化成中間 指標暫存器316與所彳貞測到之週期(N)兩者之間相加/相減 後的值。例如,當中間指標暫存器3 i 6之值為! 6,n為$ 並且方向暫存器342所示之趨勢為向上時,控制邏輯3: 則將搜尋指標暫存器352以及樣態區域暫存器撕初始 為2卜因此,在本例中,為了比較之目的(如下所述卜 態暫存器344之5位元可設置於區塊位元遮罩暫存器3( 之位元21到25。流程進行到步驟604。 在步驟604中,控制邏輯322測試區塊位元遮罩 2中在方向暫存器342所指之位元以及樣能暫存 344中之對應位元(該位元係位於樣態區域 用以對應區塊位元遮罩暫存哭I ^ 22 201135460 體區鬼中之對應快取線。流程進行到步驟606。 之快取^步^ _中’控制邏輯322預測是否需要所測試 線。當樣態暫存器⑽之位元為真㈣,控制邏輯 ^-彳此㈣物的,咖㈣將會^^ =:若快靖需要的,流程進行綱614 , 曰否已^驟6〇8中,控制邏輯322根據方向暫存器342 ^區塊中^區塊位元遮罩暫存器3〇2之盡頭,判斷在記憶 :π否有其他未測試之快取線。若已無未測試之快 則·結束;否則,流程進行至步驟612。 342之^驟612中,控制邏輯322增加/減少方向暫存器 ^ 。另外,若方向暫存器342已超過樣態暫存器344 1ΓΓ位元時,控制邏輯322將用方向暫存器如之新 新樣態區域暫存器州,例如將樣態暫存器344轉 移(Shlft)至方向暫存器342之位置。流程進行到步驟604。 θ在步驟614中,控制邏輯322決定所需要之快取線 是否已被預取。當區塊位元遮罩暫存器302之位元為直, 控制邏輯奶則判斷所需要之快取線已被預取。若所需要 之快取線已被預取’流程進行到步驟608 ;否則,流程進 行到步驟616。 在判斷步驟616中,若方向暫存器342為向下,控 制邏輯322決定判斷列入參考之快取線是否自最小指標暫 0608-A43067TW/final 201135460 存器304多於一既定值(既定值在一實施例中為16);,或 者若方向暫存器342為向上,控制邏輯322將判斷決定列 入參考之快取線是否自最大指標暫存器306多於一既定 值。右控制邏輯322決定列入參考之多於上述的判斷為真 既定值,則流程結束;否則,流程進行到判斷步驟618。 值得注意的是,若快取線大幅多於(遠離)最小指標暫存器 304/最大指標暫存器306則流程結束,但這樣並不代表預 取單元124將不接著預取記憶體區塊之其它快取線,因為 根據第4圖之步驟’對記憶體區塊之快取線的後續存取亦 會再觸發更多的預取動作。 在步驟618中,控制邏輯322判斷預取要求仔列328 是否滿了。若是預取要求佇列328滿了’則流程進行到步 驟622,否則流程進行到步驟624。 在步驟622中,控制邏輯322暫停(stall)直到預取要 求仔列328不滿(n〇n-full)為土。流程進行到步驟624。 在步驟624中,控制邏輯322推進一項目(entry)至 預取要求佇列328,以預取快取線。流程進行到步驟608。 如第7圖所示為第3圖t預取要求佇列328的操作 流程圖。流程開始於步驟702。 在步驟702中,在步驊624中被推進到預取要求佇 列328中之一預取要求被允許進行存取(其中此預取要求用 以對第二級快取記憶體118進行存取)’並繼續進行至第二 0608-A43067TW/final 201135460 - 級快取記憶體118的管道。流程進行到步驟704。 在步驟704中,第二級快取記憶體118判斷快取線 位址是否命中第二級快取記憶體118。若快取線位址命中 第二級快取記憶體118,則流程進行到步驟706 ;否則,流 程進行到判斷步驟708。 在步驟706中,因為快取線已經在第二級快取記憶 體118中準備好,故不需要預取快取線,流程結束。 在步驟708中,控制邏輯322判斷第二級快取記憶 體118之回應是否為此預取要求必須被重新執行。若是, 則流程進行至步驟712 ;否則,流程進行至步驟714。 在步驟712中,預取快取線之預取要求係重新推進 (re-pushed)至預取要求仔列328中。流程結束於步驟712。 在步驟714中,第二級快取記憶體118推進一要求 至微處理器100之一全佇列(fill queue)(未圖示)中,用以要 求匯流排介面單元122將快取線讀取至微處理器100中。 流程結束於步驟714。 如第9圖所示為第2圖之微處理器100的操作範 例。如第9圖所示為對一記憶體區塊進行十次存取後,區 塊位元遮罩暫存器3 0 2 (在一位元位置上之星號表示對所對 應之快取線進行存取)、最小改變計數器308、最大改變計 數器312、以及總計數器314在第一、第二以及第十存取 之内容。在第9圖中,最小改變計數器308稱 0608-A43067TW/final 25 201135460 為”cntr一min_change”,最大改變計數器312稱 為 ’’cntr一max_change”,以及總計數器 3! 4 稱為”cntr_t〇tal”。 中間指標暫存器316之位置在第9圖中則以”μ”所指示。 由於對位址0x4dced300所進行的第一次存取(如第4 圖之步驟402)係在記憶體區塊中位於索引12上的快取線 上進行,因此控制邏輯322將設定區塊位元遮罩暫存器3〇2 之位元12 (第4圖之步驟408),如圖所示。另外,控制邏 輯322將更新最小改變計數器3〇8、最大改變計數器312 以及總計數器314(第5圖之步驟502、506以及512)。 由於對位址0x4Ced260之第二次存取係在記憶體區 塊中位於索引9上的快取線進行,控制邏輯322根據將設 定區塊位元遮罩暫存器302之位元9,如圖所示。另外, 控制邂輯322將更新最小改變計數器308以及總計數器314 之計數值。 在第三到第十次存取中(第三到第九次存取之位址 未予圖示,第十次的存取位址為0x4dced6c0),控制邏輯 322根據會對區塊位元遮罩暫存器進行適當元之設 置,如圖所示。另外,控制邏輯322對應於每一次存取更 新最小改變計數11通、最大改變計數器312以及總計數 器314之計數值。 “ 在每個執行十次的記憶 522後的週期匹配計數 第9圖底部為控制邏輯322 體的存取中,當執行完步驟514到 〇608-A43067TW/final 26 201135460 —318之内容。在第9圖中,週期匹配計數器3Ϊ8稱 為”她―peri〇d_N_matches”,其中 ν為卜2、3、4 或者 5。 如第9圖所示之範例,雖然符合步驟412的準則(總 。十數器314至少為十)以及符合步驟416的準則(週期$之 週/月匹配汁數益318較其他所有之週期匹配計數器318至 少大於2),但不符合步驟414的準則(最小改變計數器 Τ及區塊位元遮罩暫存器3〇2之間的差少於2)。因此,此 時將不會在此記憶體區塊内執行預取操作。 如第9圖底部亦顯示在週期3、4以及5中,從週期 3 4以及5至中間指標暫存器316之右側與左側的樣態。 如第10圖所示為第2圖之微處理器延續第9圖 所示之範例的操作流程圖。第1〇圖描繪相似於第9圖之資 訊,但不同處於在對記憶體區塊之進行第十一次以及第十 人的存取(第十一次存取之位址為〇x4dced76〇)。如圖所 不,其符合步驟412的準則(總計數器314至少為十)、步 驟414的準則(最小改變計數器3〇8以及區塊位元遮罩暫存 302之間的差至少為2)以及步驟416的準則(週期5之週 期匹配計數器318在週期5之計數較其他所有之週期匹配 計數器318至少大於2)。因此,根據第4圖之步驟418, 控制邏輯322填充(populate)方向暫存器342(用以指出方向 趨勢為向上)、樣態順序暫存器346 (填入數值5)、樣態暫 存器344(用樣態,,**,,或者,,〇1〇1〇,,)。控制邏輯322亦根據 〇608-A43067TW/fina! 201135460 第4圖之步驟422與第6圖,為記憶體區塊執行預取預測, 如第圖所示。第10圖亦顯示控制邏輯322在第6圖之 步驟602操作中’方向暫存器342在位元21之位置。 如第11圖所示為第2圖之微處理器1〇〇延續第9以 及10圖之範例的操作流程圖。» U ®經由範例中描繪十 不同範例之每—者(表標示成G到11)經過第6圖之步驟 604到父驟616直到記憶體區塊之快取線被預取單元 預測發現需要被預取之記憶體區塊之的操作。如圖所示, 在每範例中’方向暫存器342的值是根據第6圖步驟612 而曰加如第11圖所示,在範例5以及1〇中,樣態區域 暫存器348會根據第6圖之倾612被更新。如範例〇、2、 4、5、7以及1〇所示’由於在方向暫存器342之位元為假 (false),樣態指出在方向暫存器⑷上之快取線將不被需 要。圖中更顯示,在範例卜3、6以及8中,由於在方向 暫存器342中樣態暫存器344之位元為真㈣,樣態暫存 器344指出在方向暫存器342上的快取線將被需要,然而 快取線已經準備被取时etehed),如區塊位元遮罩暫存器 302之位元為真(ture)之指示。最後如圖所示,在範例u中, 由於在方向暫存器342中樣態暫存器344之位元為直 (㈣’所以樣態暫存器344指出在方向暫存器⑷上之快 取線將被需要’但是因區塊位元遮罩暫存器302之位元為 饭(false) ’所以此快取線尚未被取出㈣加幻。因此,控制 0608-A43067TW/final 28 201135460 邏輯322根據第6圖之步驟624推進—預取要求至預取要 求fr歹〗328中,用以預取在位址⑻之快取線,其 對應於在區塊位域罩暫存器搬之位元32。 在貝施例中,所描述之一或多個既定值係可藉由 操作系統(例如經由-樣態特定暫存器(mQdei specie 叫▲,廳R))或者經由微處理器1〇〇之溶絲(fuses)來編 程,其中炫絲可在微處理器刚的生產過程中溶斷。 在一貝施例中,區塊位元遮罩暫存器302之大小可 為了節省電源(power)以與及裸片晶片大小機板(die㈣ estate)而減小。也就是言兒’在每一區塊位元遮罩暫存器如 中的位元數’將少於在-記憶體區塊中快取線之數量。例 如,在-實施例中,每-區塊位元遮罩暫存器3〇2之位元 數僅為記憶體區塊所包含之快取線之數量的—半。區塊位 元遮罩暫存器搬僅追縱對上半區塊或者下半區翻存位 取,端看記憶體區塊的那-半先被存取,而—額外之位元 用以指出記憶體區塊之下半部或者上半部是否先被存取Γ 在一實施例中,控制邏輯322並不如步驟516⑽ 所述地測試中間指標暫存器316上下N位元,而是包括— 序列引擎(senaUng㈣,-次—個或兩個位元地掃:區塊 位元遮罩暫存H 3G2,Μ尋找大於—最大週期之樣 態(如前所述為5位元)。 在-實施例中,若在步驟414沒有偵測出明顯的方 0608-A43067TW/fmal 29 201135460 向趨勢、或者在步驟416並未偵測出明顯的樣態週期、以 及總計數器314之計數值到達一既定臨界值(用以指出在記 憶體區塊令之大部份的快取線已被存取)時,控制邏輯 則繼續執行以及預取在記憶體區塊中剩下的快取線。上述 既定臨界值係為記憶體區塊之快取記憶體數量之一相對高 的百分比值,例如區塊位元遮罩暫存器3〇2之位元的值。 第-絲取記憶快取記憧體夕炉中 οσ 一 单元 近代的微處理器包括具有一階層結構之快取記憶 體。典型地’-微處理器包括一又小又快的第一級資料快 取》己It肢以及較大但較慢之第二級快取記憶體,分別如 ^ 2圖之第-級資料快取記憶體116以及第二級快取記憶 把118。具有-階層結構之快取記憶體有利於預取資料至 快取記憶體’以改善快取記憶體之命中率速度⑽咖)。由 於第一級資料快取記憶體116之速度較快,故較佳的狀況 為預取資料至第-級資料快取記憶體ιΐ6。然而,由於第 級貝料快取記憶體116之記憶體容量較小,快取記憶體 命中之速度率可能實際上較差變慢,由於如果預取單元不 正確預取資料進第一級資料快取記憶體U6使得最後資料 W不而要的’便需要而替代以其他需要的資料做替代。 因此貝料被載人第—級資料快取記憶體116或者第二級 〇608-A43067TW/final 201135460 Γ取記憶體118的結果,軸取單^是否能正確預測資料 疋否被需要的函數(funeti°n)。因為第—級資料快取記憶體 ⑴被要求較小的尺寸,第一級資料快取記憶體】關向 較小之容量以及因此具有較差的準確性;反之,由於第二 級快取記憶體標籤以及資料陣列之大小使得第—級快取記 憶體預取單元之大小顯得很小,所以—第二級快取記憶體 預取單元可為較大之容量因此具有較佳之準確性。 本發明實施例所述微處理器2⑽的優勢,在於一載 入/儲存單幻34用以作為第二級快取記憶體ιΐ8以及第一 級資料快取記憶體]16之預取需要之基礎。本發明之實施 例提升載入/儲存單元134(第二級快取記憶體ιΐ8)之準確 陡用以應用在解決上述預取進入第一級資料快取記憶體 116之問題。再者,實施例中也完成了運用單體邏輯(single body 〇fIogic)來處理第一級資料快取記憶體⑴以及第二 級快取記憶體118之預取操作的目標。 如第12圖所示為根據本發明各實施例之微處理器 100。第12圖之微處理器100相似於第2圖之微處理器_ 並具有如下所述之額外的特性。 第一級資料快取記憶體116提供第一級資料記憶體 位址196至預取單&amp; 124。第一級資料記憶體位址196係 藉由載人/儲存單元134對第-級資料快取記憶體116進行 載入/儲存存取的實體位址。也就是說,預取單元124會隨 0608-A43067TW/fmal , 201135460 著載入/儲存單元134存取第—級資料快取記憶體m時進 仃竊聽(eavesdr〇ps)。預取單元124提供一樣態預測快取線 位址194至第一級資料快取記憶冑116之-仔列198,樣 態賴快取線位址194為快取線之位址,其中之快取線係 預,單疋124根據第一級資料記憶體位址196預測載入/儲 子單7G 134即將對第-級資料快取記憶體出所提出之要 求。第-級資料快取記憶體116提供—快取線配置要求⑼ 至預取f元124 ’用以從第二級快取記憶體ιι8要求快取 線而k些快取線之位址係儲存於传列⑽中。最後,第 二級快取記憶體m提供所要求之快取線資料m至第一 級資料快取記憶體116。 預取單元124,亦包括第一級資料搜尋指標器]72以 及第級&quot;貝料樣悲位址178,如第12圖所示。第一級資料 搜尋才曰U 172以及第-級資料樣態位址178之用途與第 4圖相關且如下所述。 。如第13圖所示為第12圖之預取單元124的操作流 程圖。流程開始於步驟13〇2。 在步驟1302中,預取單元124從第一級資料快取記 隐體116接收第12圖之第一級資料記憶體位土止196。流程 進行到步驟1304。 在步驟1304中,由於預取單元124已事先偵測到一 存取樣態並已開始從系統記憶體預取快取線進入第二級快 °6〇8-A43〇67TW/final ” 201135460 : 取記憶體118,故預取單元124偵測屬於一記憶體區塊(例 如分頁(page))之第一級資料記憶體位址196,如第】至n 圖中相關處所述。仔細而言,由於存取樣態已被偵測,故 預取單元124用以維持(maintain)—區塊號碼暫存器303, 其指定記憶體區塊之基本位址。預取單元124藉由偵測區 塊號碼暫存器303之位元是否匹配第一級資料記憶體位址 196之對應位元,來偵測第一級資料記憶體位址196是否 落在記憶體區塊中。流程進行到步驟13〇6。 在步驟1306中,從第一級資料記憶體位址1開 始,預取單元124在記憶體區塊中所偵測到之存取方向 (detected access direction)上尋找下兩個快取線,這兩個快 取線與先前所偵測的存取方向有關。步驟13〇6更詳細之執 行操作將於後續的第14圖中加以說明。流程進行到步驟 1308。 在步驟1308中,預取單元124提供在步驟13〇6找 到之下兩個快取線之實體位址至第一級資料快取記憶體 116,作為樣態預測快取線位址194。在其他實施例中,預 取單元丨24所提供之快取線位址的數量可多於或少於2。 流程進行到步驟1312。 在步驟1312中,第一級資料快取記憶體116把在步 驟1308中所提供之位址推進至佇列198中。流程進行到步 驟 1314 。 0608-A43067TW/fmal 201135460 在步驟1314中,無論何時只要佇列198為非空 (non-empty),第一級資料快取記憶體116將下一個位址取 出佇列198,並發出一快取線配置要求192至第二級快取 s己憶體118,以便取得在該位址之快取線。然而,若在佇 列198之一位址已出現於第一級資料快取記憶體116,第 一級負料快取記憶體116將拋棄(dumps)該位址以及放棄自 第二級快取記憶體118要求其快取線。第二級快取記憶體 118接著提供所要求之快取線資料188至第一級資料快取 記憶體116。流程結束於步驟1314。 如第14圖所示為第12圖所示之預取單元124根據 第13圖之步驟1306的操作流程圖。第14圖所敘述之操作 係在第3圖所偵測到樣態方向為向上(叫^肛句的狀泥下'。 然而,右所偵測到之樣態方向為向下,預取單元亦可 用以執行同樣的功能。步驟!術到剛之操作係用以將 第3圖中之樣態暫存器344放置在記憶體區塊中適當的位 置’使得預取單A 124藉由從第一級資料記憶體位址⑼ 上開始的樣態暫存器344之樣態搜尋下兩個快取線中進行 搜尋,並只要有需求時在該記憶體區塊上複製該樣態暫存 器344之樣態344即可。、流程開始於步驟14〇2。 在步驟中,預取單元124咖以於第6圖在步 驟6〇2初始化搜尋指標暫存器352以及樣態區域暫存器⑽ 之方式’用第3圖之樣態順序暫存器以及中間指。 0608-A43067TW/final 日标 $ 34 201135460 ; 存器316的總和,來初始化第12圖之第一級資料搜尋指標 器172以及第一級資料樣態位址178。例如,若中間指標 暫存器316之值為16以及樣態順序暫存器346為5,並且 方向暫存器342之方向為往上,預取單元124初始化第一 級資料搜尋指標器Π2以及第一級資料樣態位址178至 21。流程進行到步驟1414。 在步驟14014中,預取單元124決定第一級資料記 憶體位址196是否落入在具有目前所指定位置之樣態暫存 器344之樣態中,樣態的目前位置開始係根據步驟1402所 決定的,並可根據步驟1406進行更新。也就是說,預取單 元124決定第一級資料記憶體位址196之適當位元(relevant bits)的值(即除了去確認記憶體區塊的位元,以及具有快取 線中用來之指定位元組補償偏移(byte offset)的位元外),是 否大於或者等於第一級資料搜尋指標器172之值,以及是 否小於或者等於第一級資料搜尋指標器172之值與樣態順 序暫存器346之值兩者所相加之總合。若第一級資料記憶 體位址196落入(fall within)樣態暫存器344之樣態中,流 程進行到步驟1408 ;否則流程進行到步驟1406。 在步驟1406中,預取單元124根據樣態順序暫存器 346增加第一級資料搜尋指標器172以及第一級資料樣態 位址178。根據步驟1406(與後續之步驟1418)所述之操作, 若第一級資料搜尋指標器172已達到記憶體區塊之終點則 0608-A43067TW/fmal 35 201135460 結束搜尋。流程回到步驟1404。 在步驟1408中,預取單元124將第一級資料搜尋指 標器172之值設置(set)為第一級資料記憶體位址196所相 關之快取線之記憶體頁的偏移量(offset)。流程進行到步驟 1412。 在步驟1412中,預取單元124在第一級資料搜尋指 標器172中測試樣態暫存器344中之位元。流程進行到步 驟 1414 。 在步驟1414中,預取單元124決定步驟1412所測 試之位元是否設置好了。如果在步驟1412所測試之位元設 置好了,流程進行到步驟1416;否則流程進行到步驟1418。 在步驟1416中,預取單元124將步驟1414被樣態 暫存器344所預測之快取線標記為已準備好傳送實體位址 至第一級資料快取記憶體116,以作為一樣態預測快取線 位址194。流程結束於步驟1416。 在步驟1418中,預取單元124增加第一級資料搜尋 指標器172之值。另外,若第一級資料搜尋指標器172已 超過上述樣態暫存器344之最後一個位元,預取單元124 則用第一級資料搜尋指標器172之新的數值更新第一級資 料搜尋指標器Π2之值,亦即轉換(shift)樣態暫存器344至 新的第一級資料搜尋指標器172的位置。步驟1412到1418 之操作係反覆執行,直到兩快取線(或者快取線之其他既定 0608-A43067TW/final 36 201135460 值)被找到為止。流程結束於步驟1418。 第13圖中預取快取線至第-級資料快取記憶體116 的好處係第-級資料快取記憶體ιι6以及第二級快取記憶 體118所需要之改變較小。然而,在其他實施例中,預取 單元124亦可不提供樣態預測快取線位址194至第-級資 料快取記憶體116。例如’在—實施例中,預取單元124 直接要求匯流排介面單元122自記憶體獲擷取快取線,然 後將所接收之寫人絲線寫人至第—級㈣快取記憶體 在另-實施例中,預取單元124自用以提供資料至預 取早70 124的第二級快取記憶冑118要求並取得快取線(如 果為命中失敗(missing)職記憶體取得快取線),並將收到 之快取線寫入至第-級資料快取記憶體116。在其他實施 例中,預取單元124自第二級快取記憶體118要求快取線 (如果為命巾失敗(missin_從記憶體取得快取線),其直接 將快取線寫入第一級資料快取記憶體116。 如上所述’本#明之各實施例的好處在於具有單一 的預取單元m總計數器314,作為第二級快取記憶體】18 以及第-級資料快取記憶體116兩者之預取f要之基礎。 雖然第2、n以及15圖所示(如下討論之内容)為名明不同 之區塊,預取單元m在㈣安排上可佔據鄰近於第二級 快取記憶體! i 8之標籤㈣以及資料列(_虹㈣之位置 並且概念上包括第二級快取記憶體118,如第2】圖所示 〇608-A43067TW/finaI 37 201135460 各實施例允許載人/儲存單元134具大空間之安排來提升之: 其精確度與其大空間之需求,以應用一單體邏輯來處理第 一級資料快取記憶體116以及第二級快取記憶體⑴之預 取操作’以解決習知技術中只能預取進人資料給容量較小 的第一級資料快取記憶體116之問題。 warm-up penaltvH^i ^ 取單元 本發韻叙難料124在—雜縣塊(例如, -貫體記憶體頁)上偵測較複雜之存取樣態(例如,一實體 記憶體頁)’其不同於習知一般預取單元之積測。舉例而 έ ’預取單元124可以根據—樣態制正在進行存取一記 憶體區塊之程式,即使微處理器1〇〇之非循失序執行 (out-of-onier execution)管線(pipeline)會不以程式命令的順 序而重新排序(re-order)記憶體存取,這可能會造成習知一 般預取單元不去偵測記憶體存取樣態以及而導致沒有預取 動作。这疋由於預取單元124只考慮對記憶體區塊之進行 有效地存取’而時間順序(time order)並非其考量點。 然而,為了滿足辨識更複雜之存取樣態及/或重新排 序存取樣態之能力,相較於習知的預取單元,本發明之預 取單元124可能需要一較長之時間去偵測存取樣態,如下 所述之’’暖機時間(warm-up time)”。因此需要一減少預取單 0608-A43067TW/final 38 201135460 ; 元124暖機時間之方法。 預取單元124 “預測—個之前藉由—存取樣態正 在存卜記憶體區塊之程式,是否已經跨到(⑽心術)實 際上與售的記憶體區塊相鄰之一新記憶體區塊,以及預測 此程式是否會根據相同之樣態繼續存取這個新的記憶體區 a應於此’預取單元】24使用來自舊的記憶體區塊之 ' 方向以及其他相關資訊,以加快在新的記憶體區塊 谓測存取樣態的速度,即減少暖機時間。 如第15圖所示為具有一預取單元124之微處理器 100的方塊圖。第15圖之微處理器^⑻相似於第2以及丄2 圖之微處理器100,並且具有如下所述之其它特性。 如第3圖中之相關敘述,預取單元124包括複數硬 體單元332。每一硬體單元332相較於第3圖所述更包括 5己憶體區塊虛擬雜湊虛擬位址攔(hashed virtual address ofmemory,HVAMB)354 以及一狀態攔(status)356。在第 4 圖所述之步驟406初始化已分派之硬體單元332的過程 中,預取單元124取出區塊號碼暫存器303中之實體區塊 碼(physical block number),並在將實體區塊碼轉譯成一虛 擬位址後,根據後續第17圖所述之步驟1704所執行之相 同雜湊法則(the same hashing algorithm)將實體區塊碼轉譯 成一虛擬位址(雜湊(hash)此之虛擬位址),並將其雜湊演算 之結果儲存至記憶體區塊虛擬雜湊位址欄354。狀態欄356 0608-A43067TW/fmal 39 201135460201135460 VI. Description of the Invention: [Technical Field] The present invention relates to a cache memory of a general microprocessor, and more particularly to a cache memory for prefetching data to a microprocessor. [Prior Art] With a computer system that is close to the old one, when the cache miss occurs, the time required for the microprocessor to access the system memory is higher than that of the microprocessor accessing the cache (cache). ) One or two orders of magnitude more. Therefore, in order to increase the cache hit rate, the microprocessor integrates prefetching techniques to test the recent data access patterns and attempts to predict which data is the next program. Access to objects, and the benefits of prefetching are well known. However, the Applicant has noticed that the access pattern of some programs is not detectable by the prefetching itit of the conventional microprocessor. For example, the gi diagram shows the access mode of the second-level cache memory (L2Caehe) when the execution of the private mode includes the storage operation via the memory-sequence, and the depiction is two memories. Body address. It can be seen from the figure that although (4) (4) and (4) increase the memory address in time, that is, from the upward direction, in many cases, the specified access memory address can be lower than the previous time, not the total. The trend is going up, making it different from the results of Xi 4 . Yu Baizhi pre-fetch unit actually predicts that although the total number of samples is relatively large, the general trend of u is toward one direction, but there are two reasons why the pre-fetch unit may face chaos in the face of the sample. One. The first reason is that Cheng Gu, according to its architecture, accesses 0608-A43067TW/final 4 201135460 memory', either by algorithmic features or poor programming (p〇〇r programming). The second reason is that the non-sequential (〇ut_〇f_〇rder execution) microprocessor core pipeline and the queue are executed under normal functions. 'The memory sequence is often different from the program sequence generated by it. take. Therefore, 'requires a data prefetching unit to efficiently prefetch data for the program. It must take into account that memory access commands (actions) are not obvious when the time windows are used. Trend (n〇dear trend) 'But when there is a large sample number, there will be a clear trend. 0 SUMMARY OF THE INVENTION The present invention discloses a prefetch unit that is provided in a microprocessor having a cache memory The prefetching unit is configured to receive a complex access request for a complex address of the memory block, each access requesting one of the addresses of the corresponding memory block, and the access request position The address system increases or decreases non-monotonically with time. The prefetch unit includes a storage device and a control logic. The control logic is coupled to the store = set, where the control logic is used when the access request is received. Maintaining one of the access requirements in the storage device * large address and a minimum bit ± stop, and the count value of the change of the maximum address and the minimum address 'maintains a history of the cache line that is accessed near Record, recently saved: it: the system is related to the address of the access request, according to the count value, it is determined according to the history, the access mode is determined, and according to the accessor: two: the access direction The cache memory has not been pre-fetched into the memory block by the history record, which has not been indexed by $ 〇 608-A43067TW/final, '201135460. The present invention discloses a data prefetching method for prefetching a processor-cache memory, a data prefetching method, including receiving a complex access request for a complex address of a memory block, each surface 1 Recalling the address of the block, and the address of the access request is not increased by 5 weeks or more when the access function is received, and the address is maintained in the memory block. The maximum and -: address, and the count value of the change of the maximum and minimum addresses is calculated. When the access request is received, the history of the memory block is recently updated. The most recently accessed cache is accessed. The line system is related to the access requirement; according to the count value; t-access direction; according to the historical record = access mode; and according to the access state and along the access direction, the fast memory is not yet Indicated by the history as the accessed cache line, = in the body block. Wang Yuanyi The present invention discloses a computer program product encoded on at least one reading medium and suitable for use in a computing device or a computer program. = : Computer readable program code. The computer readable program code is stored in the reading medium for 'specifying in the microprocessor with the -memory memory|prefetch unit, the computer readable program prefetching unit a complex access request for receiving a complex address of a pair of memory blocks, each access requesting one of the addresses of the corresponding memory block, and the address of the access request is non-time function Monotonically (n〇n_m〇n〇t〇nically) increases or decreases. The computer readable program includes a first code and a second code. The first code is used to define a storage device. The second code is used to define a control logic, which is connected to the storage device, wherein when the access request is received by 0608-A43067TW/fmal 6 201135460, the control logic is used to maintain the access by the storage device. One of the largest address and a minimum address, and the maximum and minimum, the count value of the change of the address, the memory block to maintain a history of the most recently accessed cache line in the memory block, Determining an access direction according to the count value, determining an access mode according to the history record, and according to the access mode and along the access direction, the cache memory is not indicated by the history record as the accessed cache line. Prefetched into the memory block. The invention discloses a microprocessor comprising a complex core, a cache memory and a prefetch unit. The cache memory is shared by the core to receive a complex access request for a plurality of addresses of a memory block, and each access request corresponds to one of the addresses of the memory block, and the access request The address is a non-monotonic (n〇n_m〇n〇t〇njCally) addition or subtraction prefetch unit to monitor access requirements and maintain one of the largest memory blocks. The address and a minimum address, and the count value of the change of the maximum address and the minimum bit ^, according to the count value, determine an access direction and along the access direction, the cache miss in the memory block The line is prefetched into the cache memory. The present invention discloses a microprocessor comprising a first level cache memory, a second level cache memory and a prefetch unit. The prefetching unit is configured to measure one of the latest access requirements and the centroids appearing in the second-level cache memory, and prefetch the plurality of cache lines to the first level according to the direction and the state. In the memory, from the first-level cache memory, one of the access requirements received by the first-level cache memory is received, wherein the address is related to a 7-take line and is determined in the direction. The cache line is followed by one or more cache lines as indicated by the pattern and causes one or more cache lines to be fetched into the first level cache memory. The invention discloses a method for purchasing materials, which is used for pre-fetching data to and having a second-level cache memory--microprocessor (4) field 贞劂 appearing in the second-level cache memory - The direction and the mode, and the pre-fetching of the plurality of cache lines to the second-level cache memory according to the direction and the sample; from the first-level cache memory, receiving the first-level cache memory is received - Access requirement 2: its I address is related to the cache line; it is determined by the state after the relevant cache line in the direction - a chain is taken; and ¥ to - or more caches The line is prefetched into the first level cache memory. The invention discloses a computer program product encoded on at least one computer readable medium and adapted for use in a computing device, the computer program product comprising: a computer readable program code. A computer readable program code is stored in the computer readable medium for defining a microprocessor, and the computer readable program includes a first private code, a second code, and a third code. The first-stage stone horse is used for the ^-level-level cache memory device. The second code is used to define - the second level cache device. The third code is used to define a prefetch unit that causes the prefetch unit to detect one of the most recent access requirements and appearances in the second level of cache memory, and according to the direction and the form, Pre-fetching the plurality of cache lines into the second-level cache memory, from the first-level cache memory, receiving one of the address requests received by the first-level cache memory, wherein the address is A cache line is associated with one or more cache lines that are indicated by the pattern after the cache line associated with the direction and causes one or more cache lines to be prefetched to the first level cache memory In the body. ~ 〇 608-A43067TW/final 201135460 The present invention discloses a microprocessor including a cache memory. Body and a prefetch unit. The prefetching unit is configured to detect the same state of the multiple memory access request of the first memory block. And prefetching the plurality of cache lines from the first memory block to the cache memory according to the mode, Monitoring a new memory access request of one of the second memory blocks, Determining whether the first memory block is virtually adjacent to the second memory block, And when continuing from the first memory block to the second memory block, Determining whether the mode predicts that a new memory access request of the second memory block is associated with one of the cache lines in the second memory block, And the cache line is pre-fetched from the second memory block to the cache memory according to the mode.  The invention discloses a data prefetching method, Used to prefetch data to one of the microprocessor's cache memories, The data prefetching method includes /[measuring the same state of the complex memory access request with a first memory block, And prefetching the cache line from the first memory block to the cache memory according to the mode;  Monitoring a new memory access request of one of the second memory blocks; Determining whether the first memory block is virtually adjacent to the second memory block, And when continuing from the first memory block to the second memory block, Determining whether the mode predicts that one of the new memory access requirements of the second memory block is related to the cache line in the second memory block; And according to the state, Prefetching multiple cache lines from the second memory block into the cache memory, In response to the decision step.  The invention discloses a computer program product, Coded on at least one computer readable medium, And suitable for a computing device, The computer program product includes a computer readable program code. Stored on computer readable media, Used to define a microprocessor. The computer readable program includes a first code and a code number 0608-A43067TW/fina] 9 201135460. First code, Used to define a cache memory device. Second code, For defining a prefetching device, the prefetching device is configured to detect the same state of access with a first memory block. And prefetching from the first memory block into the cache line according to the mode, Monitoring one of the second memory blocks, one of the new access requirements, Determining that the first memory block is virtually adjacent to the first sth memory block and the pattern, When continuing from the first memory block to the second memory block, Predicting access to one of the cache lines associated with a new request having a second memory block, And the cache line entering the cache memory is prefetched from the second memory block in response to the mode.  [Embodiment] The manufacture and method of various embodiments of the present invention will be discussed in detail below. However, it is worth noting that Many of the possible inventive concepts provided by the present invention can be implemented in a variety of materials (4). These limbs are used to implement (4) the manufacturing and use methods of the invention. However, it is not intended to limit the scope of the invention.  In the case of 赝 _ still solve the problem of the problem can be explained. When - remember all faces (age, Action = one = on, All access (instructions, Action or request) = added access requirements are also indicated in the same;  The above first picture is as shown in Figure 8: °The bounding frame of the whole size is circled.  Or action Η. Two accesses to the quarantine (instructions that have access to 4KB blocks, Access time 'Y-axis table description-first two accesses =1, Take the index of the line. First of all,  I ^ wire _ wire 5 is stored 201135460 + One sub-requires access to the cache line 6.  The bounding box will circle the two points representing the access requirements. Take the picture as shown, The second access request is sent – the new point representing the third access request can be set twice: : : New access to the account continues to occur, Delimitation must follow 】: two: :  Shovel and slow down #^Greatly upward example) °The above bounding box is the white on the historical record that will be used to determine the trend of accessing the pattern. Down or not.  In addition to tracking the upper and lower edges of the bounding box to determine the trend direction, it is necessary to track individual access requirements. Because access requests skip - or two cache line events occur frequently. therefore, In order to avoid jumping over the prefetched cache line, Once an up or down trend is detected, the prefetch unit uses additional criteria to determine which cache line to prefetch. Since the access requirements trend will be rearranged, The prefetch unit deletes the transient reordering history. This action is done in a bit mask by the ^ marking bit. Each bit has a & Recalling the block/cache line of the body block, , And when the corresponding bit in the bit mask is set, Indicates that a particular block can be accessed. —曰 The access requirements for the memory block have reached a sufficient number, The prefetch unit uses a bit mask (where the bit mask does not have an indication of the timing of the access),  And accessing the entire block based on the larger access point (generalized large view) as described below, and based on the smaller access view (small view) and the conventional prefetch unit, only based on access Time to access the block.  Figure 2 is a block diagram of the micro-process state of the present invention. The microprocessor 100 has a pass path with multiple levels. And the transmission road 0608-A43067TW/final 11 201135460 also includes various functional units. The transfer path includes an instruction cache memory 102' instruction cache memory 1 〇 2 transferred to an instruction decoder 1; The instruction decoder 104 is coupled to a register alias table 1〇6 (register&gt;  Alias table, RAT); The register alias table ι〇6 is coupled to a reservation station 108 (reservation station); The reservation station ι 8 is coupled to an execution unit 112 (execution unit); At last, The execution unit 112 is coupled to a retirement unit 114. The instruction decoder 1〇4 may include an instruction translator. A macro instruction that is used to translate a macro instruction (such as a macro instruction of the χ86 architecture) into a microprocessor-like reduced instruction set computer RISC. The reservation station 1〇8 generates and transmits an instruction to the execution unit 112, The execution unit 112 is configured to execute in accordance with a program order. The retirement unit 114 includes a reorder buffer </ RTI> for performing the retirement of the instructions in accordance with the program sequence. The execution unit 112 includes a load/store unit 134 and other execution units 132. Such as an integer unit, Floating point unit Branch unit or single hstructi〇n Multiple Data (SIMD) unit. The load/store unit 134 is configured to read the data of the first level data cache 116 (level 1 data cache). And the data is written to the first level data cache 116. A second level cache memory 118 is used to back up the first level data cache memory 116 and the instruction cache memory 102. The second level cache memory 118 is used to read and write system memory via a bus interface unit 122. The bus interface unit 122 is an interface between the microprocessor 100 and a bus (e.g., a local bus or a memory bus). Micro 0608-A43067TW/fmal 12 201135460 includes a prefetch unit 124, Used to self-system memory pre-116.  The level cache memory 118 and/or the first level data cache memory. The figure shows a more detailed block of the prefetch unit 124 of Fig. 2, , A single block includes a mask mask register 3〇2. Each of the block hood registers 3 〇 2 has a memory block 〇 deep, The block number of the memory block is stored in a block number = temporary, Pirates 303. In other words, The block number register 3〇3 stores the upper address bits of the hidden block. When the location bit = the value of one of the Sockets 3 〇 2 is true (four) e her ^,  The corresponding cache line has been accessed. Initializing the block bit mask temporary save 302 will cause all bit values to be false. In a real world, the size of the memory block is 4 KB ’ and the size of the cache line is the material = 兀 group. Therefore, The block bit mask register has a capacity of 64 bits, in some embodiments, The size of the memory block can also be the same as the size of the physical memory page. however, The size of the cache line can be of various other sizes in other embodiments. , Furthermore, the size of the memory area transferred to the block occlusion mask 3G2 is changeable' and does not need to correspond to the size of the physical memory page.  More precisely, The size of the memory area (or block) maintained on the block bit mask register 3G2 can be any size (the best of two is the best). As long as = has enough cache lines to facilitate the lion direction and the form of the debt test 0 prefetch unit 124 can also include - the minimum indicator register 3 〇 4_ pointer register) and a maximum indicator register 3 〇 6 (m blood 0608-A43067TW/fmal 13 201135460 ° minimum indicator register 304 and maximum indicator register 3 〇 6 are used to start the access to the memory block after the prefetching order % 124  The index of the lowest and highest cache lines that have been accessed in this memory block is continuously pointed. The prefetch unit 124 further includes a minimum change count benefit: And - the maximum change counter 312. The minimum change counter and the large change counter 312 are respectively used after the prefetch unit I24 starts to track the access of the memory block. The number of times the minimum indicator register 304 and the maximum value of the register 3〇6 are changed are calculated. The prefetch unit also includes a total counter 314. use, v y- ~ The total number of cache lines that have been accessed after the prefetch early 124 starts tracking the memory area. The prefetch unit 丨24 also includes an intermediate indicator register 316. Used to refer to the middle index of the memory block after the prefetch unit 124 begins to track the access of the memory block (for example, the count value of the minimum indicator register 304 and the count value of the maximum change counter 312). Average). Prefetch unit 124 is also Γ4:  ^ t# ^ 342(direCti〇n register)' ~ ^ i ΐ 1 臾 Find a ^ temporary week / month register 346, The same state area register state and a Sogou ticket register 352, The functions are as follows.  And the prefetch unit U4 also includes a complex period matching counter 318 to fine the match counter). Each cycle matches the counter 3 i 8 - the count value. In the embodiment, The period is 3, 4'm different period refers to the left/right bit of the intermediate indicator register 316. The count value of the periodic system Αη仏仏 matching counter 318 is updated after the mother-memory access of ^ is performed.  The bit mask register 302 indicates that during the period of the access and the opposite side register - right 316 0608-A43067TW/final 14 201135460 - Prefetch unit 124 then increments the count value of period match counter 318 associated with the cycle. Regarding the more detailed application and operation of the period matching counter 318, Will be especially in the fourth, The five pictures tell.  The prefetch unit 124 also includes a prefetch request queue 328, An extract pointer 324 (pop pointer) and a push pointer 326 (push pointer). The prefetch request column 328 includes a loop entry column. Each of the above items is used to store the operation of the prefetch unit 124 (especially regarding the fourth, Pre-fetch requirements generated in Figures 6 and 7). The push indicator 326 indicates the next entry to be dispatched to the prefetch request queue 328.  The extraction indicator 324 indicates that the next item will be removed from the prefetch request queue 328. In an embodiment, Because prefetching requirements may end in an out of order, Therefore, the prefetch requires that the 328 system can populate the completed project in a non-disordered manner. In one embodiment, The size of the prefetch request queue 328 is due to the line flow, All required to enter the line flow of the tag pipeline of the second level cache 118, Thus, the number of items in the prefetch request queue 328 is at least as large as the pipeline stages in the second level cache 118. The prefetch request will be maintained until the end of the pipeline of the second level cache 118. At this point in time, Request) may be one of three outcomes, As described in more detail in Figure 7, That is, hit in the second-level cache memory 118, Replay, Or advance a full pipeline project, Used to prefetch the required data from the system memory.  0608-A43067TW/fmal 15 201135460 The prefetch unit 124 also includes control logic 322. Control logic 322 controls the various components of prefetch unit 124 to perform their functions.  Although Fig. 3 only shows a set of hardware units 332 related to an active ((10) (four) memory block (block bit mask register 3 () 2 Area number, the register, Minimum indicator register 304, The largest indicator register,  Minimum change counter 3〇8, Maximum change counter 312, The total counter is called,  The intermediate indicator is temporarily stored n 316, Pattern sequential register 346, The mode area register 348 and the search indicator register 352), However, the prefetching single illusion 24 may include a plurality of hardware units as shown in Fig. 3, Access to multiple active memory blocks.  In an embodiment, Microprocessor 100 also includes one or more highly reactive prefetching units (not shown). Highly reactive prefetching units use different algorithms for access in very small temporary samples. And cooperate with the prefetch unit 124, It is explained below. Since the prefetch unit m described herein analyzes the number of larger memory accesses (compared to the highly reactive prefetch unit),  Longer time to start prefetching a new memory block,  It must be used as described below, But it is more accurate than the high-reaction _ taking unit. therefore, ❹ Highly reactive prefetching unit ^ and prefetching unit 124 operate simultaneously, The micro-site iqq can have a faster reaction time of the highly reactive prefetch unit and a high precision of the prefetch unit i24. In addition, the prefetch unit 124 can monitor requirements from other prefetch units. And use these requirements in its prefetch algorithm.  〇608-A43067TW/fmal 201135460 As shown in Fig. 4, the operation flow chart of the microprocessor 100 of Fig. 2 is shown. And in particular the operation of the prefetch unit 124 of Fig. 3. The process begins in step 402.  In step 402, Prefetch unit 124 receives a load/store memory access request, Used to access a load/store memory access request for one of the memory addresses. In an embodiment, When the prefetch unit 124 determines which cache lines are prefetched, The load memory access request is distinguished from the storage memory access requirement. In other embodiments, The prefetch unit 124 does not determine which cache lines are prefetched. Identify loading and storage. In an embodiment, The prefetch unit 124 receives the memory access request output by the load/store unit 134. The prefetch unit 124 can receive memory access requests from different sources. The above sources include, but are not limited to, a load/store unit 134,  The first level of data cache memory 116 (eg, one of the first level data caches 116 generated by the distribution request, When the load/store unit 134 stores the memory of the first level data cache 116, And/or other sources,  For example, other prefetching units (not shown) of microprocessor 100 for performing prefetching algorithms different from prefetch unit 124 to prefetch data. The flow proceeds to step 404.  In step 404, Control logic 322 compares the value of the memory access address to the value of each block number register 303. It is judged whether or not the memory of a main block is accessed. That is, The control logic 322 determines whether the hardware unit 332 shown in FIG. 3 has been assigned to the memory block associated with the memory address specified by the memory access request 0608-A43067TW/final 17 201135460. if, Then, go to step 406.  In step 406, Control logic 322 dispatches the hardware unit 332 shown in Figure 3 to the associated memory block. In an embodiment, Control logic 322 dispatches hardware unit 332 in a round-robin manner. In other embodiments, The control logic 322 maintains the information of the last unused page replacement (least_recently_used) for the hardware unit 332. It is also dispatched on the basis of a long-lasting page replacement method (least_recently_used). In addition, the control logic 322 # initializes the assigned hardware unit 332. In particular, the 'control logic 322 will clear all bits of the block bit field mask register' to fill (pGpulate) the upper bit of the memory access address to the block number register 3G3' and clear the minimum indicator. Temporary storage! | 3G4, The largest indicator, the register 3〇6, Minimum change counter 308, Maximum change counter 312,  The total counter 3M and the matching counter 318 are g. The flow proceeds to step 408.  In step 4〇8, the control logic 322 updates the hardware unit it 332' according to the memory access address as described in FIG. The flow proceeds to step 412.  In step 412, Stone fight to stay -,  More 7 〇 332 test (examine) total count β 314 is used to determine whether the program has sufficient access to the memory block. In the embodiment, The control logic determines whether the value of the general counter 314 is greater than a predetermined value. In the embodiment, This default value is 1〇, Wonderful two&quot;  ,  ..., , However, there are many types of the present invention, and the process of the invention is not 0608-A43067TW/final 201135460.  herein. If sufficient access requirements have been fulfilled,  Otherwise the process ends.  ^ Judge whether the access request specified in the block bit buffer 302 is '... 勒 u B 疋 has a significant trend. That is to say, Control logic 322 determines that the access request has a significantly upward trend (access address increase) or down_pot (access address reduction). In the embodiment, Control logic 322 determines whether the access request has a significant tendency based on whether the difference between the minimum change counter 3〇8 and the maximum change counter 312 is greater than a predetermined value. In an embodiment, The predetermined value is 2' and in other embodiments the predetermined value may be other values. When the count value of the minimum change counter 308 is greater than the leaf value of the maximum change counter 312 by a predetermined value, there is a clear downward trend; on the contrary, When the maximum value of the tensor 312 is greater than the count value of the minimum change counter 308 - the predetermined value ', there is a clear upward trend. When a clear trend has emerged,  Then proceed to step 416, Otherwise the process ends.  In step 416, Control logic 322 determines if there is an apparent pattern period winner in the access request specified by block cipher mask register 302. In an embodiment, The control logic 322 determines whether there is a _ explicit modal cycle winner based on whether the difference between the one of the period match counters 318 and the count values of the other period match counters 318 is greater than a predetermined value. In an embodiment, The default value is 2, In other embodiments, the established value can be other values. Cycle Match Counter 3 1 8 〇 608-A43067TW/final 19 201135460 The update action will be detailed in Figure 5. When there is _ , The 'shoulder cycle cycle winner generation' process proceeds to step 418; otherwise, Flow Gentleman's Beam In step 418, control logic 322 fills direction register 342 to indicate the apparent direction gestation determined at step 414. In addition, The control logic 322 uses the clear winner pattern period (10) that is detected in step 416. The padding sequence is left. At last,  The control logic 322 cuts the (four) 416 to the right - (4) · t with the control logic 322 masking the N bits of the register 302 to the right or left side of the intermediate indicator register (according to Figure 5) The step 518 is matched by the second π to fill the modal register 344. The flow proceeds to step 422.  In step 422, The control logic 322 is configured to prefetch the non-fetched cacheiine in the memory block (as described in FIG. 6) according to the detected direction and the mode. . The process ends at step 422.  Fig. 5 is a flow chart showing the operation of the prefetch unit 124 shown in Fig. 3 to execute the steps shown in Fig. 4. The process begins with the steps gone.  In step 502, Control logic J22 increments the count value of the total counter 314. The flow proceeds to step 5〇4.  In step 504, The control logic 22 determines whether the current memory access address (especially the index value of the value memory block of the cache line 306 associated with the nearest left body access address) is greater than The maximum index temporary storage 0608-A43067TW/final 20 201135460 If the machine proceeds to step 506; Otherwise, the flow proceeds to step 5〇8.  In step 506, The control logic 322 updates the maximum indicator temporary storage benefit 306 with the index value of the memory block of the cache line associated with the most recent memory access. And the count value of the maximum change counter 312 is increased. The flow proceeds to step 514.  In step 508, The control logic 322 determines whether the index value of the memory block of the cache line associated with the most recent memory access address is less than the value of the most recent private temporary storage state 304. if, The process proceeds to step 512; If no, the flow proceeds to step 514.  In step 512, Control logic 322 索 the memory block of the cache line associated with the most recent memory access address; Depreciate to update the minimum indicator register 304, And increase the minimum change counter 3 〇 8 count value. The flow proceeds to step 514.  In step 5丨4, Control logic 322 calculates an average of the minimum indicator register 3〇4 and the maximum indicator register 3〇6, And update the intermediate indicator temporary storage n 316 with the calculated average value. The flow proceeds to step 516.  At step 5! 6, in Control logic 322 checks block bit mask register 302, And centered on the intermediate indicator register 316, Cut into N bits on the left and right sides, Where N is the number of bits per bit associated with each period-matching counter training. The flow proceeds to step 518.  . . In step 5, Control logic 322 determines whether the N-bit to the left of inter-arrival indicator temporary state 316 matches the N-bit of the middle indicator register phantom 6 〇 608-A43067TW/finai ^ 201135460. if, The process proceeds to step 522; If not, Then the process ends.  In step 522, Control logic 322 increments the count value of period match counter 318 having an N cycle. The process ends at step 522.  Fig. 6 is a flow chart showing the operation of the prefetching unit 124 of Fig. 3 to perform step 422 of Fig. 4. The flow begins in step 602.  In step 602, Control logic 322 initializes the ton-like register 346 in the intermediate indicator register 316 that leaves the detection direction. For the search indicator register 3M and the sample area register (patt? n locat1〇n) 348 is initialized. That is, The control logic 322 initializes the search index register 352 and the sample region register 348 to the value added/subtracted between the intermediate index register 316 and the detected period (N). . E.g, When the value of the intermediate indicator register 3 i 6 is!  6, n is $ and the trend shown by direction register 342 is up, Control Logic 3:  Then the search index register 352 and the sample area register are torn to 2, so In this case, For comparison purposes (5 bits of the scratch register 344 as described below can be placed in the block bit mask register 3 (bits 21 through 25). The flow proceeds to step 604.  In step 604, The control logic 322 tests the bit locations in the block buffer 2 in the direction register 342 and the corresponding bits in the sample temporary storage 344 (the bit is located in the mode area for corresponding block bits) Mask Temporary Cry I ^ 22 201135460 Corresponding cache line in the body ghost. The flow proceeds to step 606.  The fast control step ^ _ middle 'control logic 322 predicts whether the test line is needed. When the bit of the state register (10) is true (four), Control logic ^-彳(4), Coffee (four) will ^^ =: If you need it, The process proceeds to outline 614,  曰No, ^6〇8, The control logic 322 masks the end of the scratchpad 3〇2 according to the direction register 342 ^ block block Judging in memory: π No other untested cache lines. If there is no untested, then the end; otherwise, The flow proceeds to step 612.  In 342, in step 612, Control logic 322 increments/decreases the direction register ^. In addition, If the direction register 342 has exceeded the level register 344 1 bit, The control logic 322 will use the direction register as a new state region register state, For example, the mode register 344 is shifted (Shlft) to the position of the direction register 342. The flow proceeds to step 604.  θ is in step 614, Control logic 322 determines if the desired cache line has been prefetched. When the block bit masks the register 302, the bit is straight.  The control logic milk determines that the required cache line has been prefetched. If the required cache line has been prefetched, the process proceeds to step 608; otherwise, The flow proceeds to step 616.  In decision step 616, If the direction register 342 is down, The control logic 322 determines whether the cache line included in the reference is more than a predetermined value from the minimum indicator temporarily 0608-A43067TW/final 201135460 (the established value is 16 in one embodiment); , Or if the direction register 342 is up, Control logic 322 will determine if the cache line listed in the reference is greater than a predetermined value from maximum indicator register 306. The right control logic 322 determines that more than the above determinations included in the reference are true, Then the process ends; otherwise, Flow proceeds to decision step 618.  It is worth noting that If the cache line is significantly more than (away from) the minimum indicator register 304/maximum indicator register 306, the process ends. However, this does not mean that the prefetch unit 124 will not prefetch other cache lines of the memory block. Subsequent access to the cache line of the memory block according to step 4 of Figure 4 will also trigger more prefetch actions.  In step 618, Control logic 322 determines if prefetch request queue 328 is full. If the prefetch request queue 328 is full, then the flow proceeds to step 622. Otherwise the flow proceeds to step 624.  In step 622, Control logic 322 stalls until the prefetch request queue 328 is not full (n〇n-full). The flow proceeds to step 624.  In step 624, Control logic 322 advances an entry to a prefetch request queue 328, Prefetch the cache line. The flow proceeds to step 608.  As shown in Fig. 7, a flowchart of the operation of the prefetch request queue 328 of Fig. 3 is shown. The process begins in step 702.  In step 702, One of the prefetch requests advanced to the prefetch request queue 328 in step 624 is allowed access (where the prefetch request is used to access the second level cache 118) and continues To the second 0608-A43067TW/final 201135460 - the pipeline of the cache memory 118. The flow proceeds to step 704.  In step 704, The second level cache memory 118 determines whether the cache line address hits the second level cache memory 118. If the cache line address hits the second level cache memory 118, Then the process proceeds to step 706; otherwise, Flow proceeds to decision step 708.  In step 706, Because the cache line is already ready in the second level cache memory 118, Therefore, there is no need to prefetch the cache line. The process ends.  In step 708, Control logic 322 determines if the response of second level cache 118 has to be re-executed for this prefetch request. if,  Then the process proceeds to step 712; otherwise, The flow proceeds to step 714.  In step 712, The prefetch request for the prefetch cache line is re-pushed to the prefetch request queue 328. The process ends at step 712.  In step 714, The second level cache memory 118 advances a request to one of the microprocessor 100 fill queues (not shown). The bus bar interface unit 122 is required to read the cache line into the microprocessor 100.  The process ends at step 714.  An operation example of the microprocessor 100 of Fig. 2 is shown in Fig. 9. As shown in Figure 9, after ten accesses to a memory block, Block bit mask register 3 0 2 (the asterisk at the one-bit position indicates access to the corresponding cache line), Minimum change counter 308, Maximum change counter 312, And the total counter 314 is at first, The contents of the second and tenth accesses. In Figure 9, The minimum change counter 308 is called 0608-A43067TW/final 25 201135460 is "cntr one min_change", The maximum change counter 312 is called ’’cntr_max_change”, And the total counter 3!  4 is called "cntr_t〇tal".  The position of the intermediate indicator register 316 is indicated by "μ" in Fig. 9.  Since the first access to address 0x4dced300 (step 402 of Figure 4) is performed on the cache line on index 12 in the memory block, Therefore, the control logic 322 masks the block bit to mask the bit 12 of the register 3〇2 (step 408 of FIG. 4). as the picture shows. In addition, Control logic 322 will update the minimum change counter 3〇8, The maximum change counter 312 and the total counter 314 (step 502 of Figure 5, 506 and 512).  Since the second access to the address 0x4Ced260 is performed on the cache line located on index 9 in the memory block, Control logic 322 masks bit 9 of register 302 in accordance with the set block cipher. as the picture shows. In addition,  Control 322 will update the count value of minimum change counter 308 and total counter 314.  In the third to tenth accesses (the third to ninth access addresses are not shown, The tenth access address is 0x4dced6c0), The control logic 322 sets the appropriate element according to the block bit mask register. as the picture shows. In addition, The control logic 322 corresponds to each access update update minimum change count 11 pass, The maximum change counter 312 and the count value of the total counter 314 are changed.  “The cycle match count after each memory 522 of execution 522 is at the bottom of the control logic 322 body access at the bottom of Figure 9. When step 514 is performed, the contents of 〇608-A43067TW/final 26 201135460-318 are performed. In Figure 9, The period matching counter 3Ϊ8 is called "her-peri〇d_N_matches", Where ν is 2 3, 4 or 5.  As shown in Figure 9, Although the criteria of step 412 are met (total. The decimator 314 is at least ten) and conforms to the criteria of step 416 (the cycle/month matching juice number 318 of the cycle $ is at least greater than 2 than all other cycle matching counters 318), However, the criteria of step 414 are not met (the difference between the minimum change counter 区 and the block bit mask register 3 〇 2 is less than 2). therefore, Prefetching will not be performed within this memory block at this time.  As shown in the bottom of Figure 9, also shown in cycle 3, 4 and 5, From the period 3 4 and 5 to the state of the right side and the left side of the intermediate indicator register 316.  As shown in Fig. 10, the microprocessor of Fig. 2 continues the operational flow chart of the example shown in Fig. 9. Figure 1 depicts an information similar to Figure 9, However, the difference is in the eleventh and tenth access to the memory block (the address of the eleventh access is 〇x4dced76〇). As shown in the figure, It meets the criteria of step 412 (the total counter 314 is at least ten), The criteria of step 414 (minimum change counter 3〇8 and the difference between block bit mask scratchpad 302 is at least 2) and the criteria of step 416 (cycle match counter 318 of cycle 5 counts in cycle 5 compared to all other The period match counter 318 is at least greater than 2). therefore, According to step 418 of Figure 4,  Control logic 322 populates direction register 342 (to indicate that the direction trend is up), Pattern sequence register 346 (fill in the value 5), Pattern register 344 (using the form, , **, , or, , 〇1〇1〇, , ). Control logic 322 is also based on 〇608-A43067TW/fina!  201135460 Step 422 and Figure 6 of Figure 4, Perform prefetch prediction for memory blocks,  As shown in the figure. Figure 10 also shows the control logic 322 in the operation of step 602 of Figure 6 where the direction register 342 is at bit 21.  As shown in Fig. 11, the microprocessor 1 of Fig. 2 continues the operational flow chart of the examples of Figs. 9 and 10. » U ® depicts each of the ten different paradigms in the example (the table is labeled G to 11) through step 604 of Figure 6 to the parent step 616 until the cache line of the memory block is predicted by the prefetch unit to be found The operation of the prefetched memory block. as the picture shows,  In each of the examples, the value of the 'direction register 342 is added as shown in Fig. 11 according to step 612 of Fig. 6, In examples 5 and 1 , The mode area register 348 is updated according to the tilt 612 of Figure 6. As an example, 2,  4, 5, 7 and 1 are shown as 'because the bit in the direction register 342 is false (false), The pattern indicates that the cache line on the direction register (4) will not be needed. The figure shows more, In the example, 3, 6 and 8, Since the bit of the mode register 344 is true (4) in the direction register 342, The mode register 344 indicates that the cache line on the direction register 342 will be needed, However, the cache line is ready to be taken when etehed), For example, the bit of the block bit mask register 302 is an indication of true. Finally, as shown in the figure, In example u,  Since the bit of the mode register 344 is straight ((4)' in the direction register 342, the mode register 344 indicates that the cache line on the direction register (4) will be required 'but due to the block bit The bit of the meta-mask register 302 is a meal (false) 'so this cache line has not been taken out (four) plus magic. therefore, Control 0608-A43067TW/final 28 201135460 Logic 322 advances according to step 624 of Figure 6 - prefetch request to prefetch request fr歹 328, Used to prefetch the cache line at address (8), It corresponds to the bit 32 moved in the block bit field.  In the case of Beishi, One or more of the established values described may be by an operating system (e.g., via a -like specific register (mQdei specie called ▲, Hall R)) or programmed via a microprocessor 1 fuses Among them, the glare can be dissolved in the production process of the microprocessor.  In a case, The size of the block bit mask register 302 can be reduced in order to save power and die dies. That is to say, the number of bits in the mask of each block will be less than the number of cache lines in the block of memory. E.g, In an embodiment, The number of bits per block-level mask register 3〇2 is only half of the number of cache lines included in the memory block. The block bit mask register is moved to track only the upper half block or the lower half area. Looking at the memory block, the half-first is accessed first. And the additional bits are used to indicate whether the lower half or the upper half of the memory block is accessed first. In an embodiment, The control logic 322 does not test the upper and lower N bits of the intermediate indicator register 316 as described in step 516 (10). But include - the sequence engine (senaUng (four), -Second- or two-bit sweep: Block zoning masks temporary storage of H 3G2, Μ Look for a pattern larger than the maximum period (5 bits as described above).  In an embodiment, If the obvious square 0608-A43067TW/fmal 29 201135460 is not detected in step 414, Or in step 416, no obvious period of time is detected, And when the count value of the total counter 314 reaches a predetermined threshold (to indicate that most of the cache lines have been accessed in the memory block block), The control logic continues to execute and prefetches the remaining cache lines in the memory block. The above-mentioned predetermined threshold is a relatively high percentage of the number of cache memories of the memory block. For example, the block bit masks the value of the bit of the scratchpad 3〇2.  The first-wire memory is fast-received in the 夕 夕 ο 一 一 近代 unit Modern microprocessor includes a cache memory with a hierarchical structure. Typically, the 'microprocessor includes a small and fast first level data cache" and its larger and slower second level cache memory. For example, the first-level data cache memory 116 and the second-level cache memory are as shown in Fig. 2. A cache memory having a - hierarchical structure facilitates prefetching data to the cache memory to improve the hit rate of the cache memory (10). Since the first level data cache memory 116 is faster, Therefore, the preferred condition is to prefetch data to the first level data cache memory ιΐ6. however, Since the memory capacity of the first-level batting memory 116 is small, Cache memory hit rate may actually be slower and slower, If the prefetch unit does not correctly prefetch the data into the first level data cache memory U6 so that the last data W is not needed, it needs to be replaced by other required data.  Therefore, the bead material is loaded with the first-level data cache memory 116 or the second-level 〇608-A43067TW/final 201135460 to retrieve the result of the memory 118. Whether the axis can be used to correctly predict the data 疋 is not required function (funeti°n). Because the first level data cache memory (1) is required to be smaller, The first level of data cache memory] is relatively small and therefore has poor accuracy; on the contrary, Due to the size of the second-level cache memory and the size of the data array, the size of the first-stage cache memory prefetch unit is small. Therefore, the second-stage cache memory prefetch unit can have a larger capacity and therefore has better accuracy.  The advantages of the microprocessor 2 (10) according to the embodiment of the present invention, It is based on the prefetching requirement of a load/store single phantom 34 for the second level cache memory ι 8 and the first level data cache memory 16 . The embodiment of the present invention enhances the accuracy of the load/store unit 134 (second level cache memory ι 8) for application to solve the above problem of prefetching into the first level data cache 116. Furthermore, The goal of using the single body 〇fIogic to process the prefetch operation of the first level data cache memory (1) and the second level cache memory 118 is also implemented in the embodiment.  As shown in Fig. 12, a microprocessor 100 in accordance with various embodiments of the present invention is shown. The microprocessor 100 of Fig. 12 is similar to the microprocessor of Fig. 2 and has additional features as described below.  The first level data cache 116 provides a first level data memory address 196 to a prefetch &amp;  124. The first level data memory address 196 is a physical address for loading/storing access to the first level data cache memory 116 by the person/storage unit 134. That is, Prefetch unit 124 will follow 0608-A43067TW/fmal.  201135460 The load/store unit 134 accesses the first-level data cache memory m when it is eavesdropping (eavesdr〇ps). The prefetch unit 124 provides the same state prediction cache line address 194 to the first level data cache memory 116 - the 198 column, The sample fast access line address 194 is the address of the cache line. Among them, the cache line is pre- The unit 124 predicts that the load/store 7G 134 is about to be requested for the level-level data cache based on the first level data memory address 196. The first level data cache memory 116 is provided - the cache line configuration requirement (9) to the prefetch f element 124 ' is used to request the cache line from the second level cache memory ιι8 and the cache lines of the cache lines are stored. In the biography (10). At last, The second level cache memory m provides the required cache line data m to the first level data cache memory 116.  Prefetch unit 124, Also includes the first level data search indicator] 72 and the level &quot; The material of the shell is 178, As shown in Figure 12. The first level of information is used to search for the U 172 and the level of the data sample address 178 in relation to Figure 4 and is described below.  . An operational flowchart of the prefetch unit 124 of Fig. 12 is shown in Fig. 13. The process begins at step 13〇2.  In step 1302, The prefetch unit 124 receives the first level data memory location 196 of Fig. 12 from the first level data cache. The flow proceeds to step 1304.  In step 1304, Since the prefetch unit 124 has detected an access pattern in advance and has started prefetching the cache line from the system memory into the second level, the speed is higher. 6〇8-A43〇67TW/final ” 201135460:  Taking memory 118, Therefore, the prefetch unit 124 detects the first level data memory address 196 belonging to a memory block (for example, a page). As described in the relevant section of the figure ~n. Carefully speaking, Since the access pattern has been detected, Therefore, the prefetch unit 124 is used to maintain the block number register 303.  It specifies the base address of the memory block. The prefetch unit 124 detects whether the bit of the block number register 303 matches the corresponding bit of the first level data memory address 196. To detect whether the first level data memory address 196 falls in the memory block. The flow proceeds to step 13〇6.  In step 1306, Starting with the first level data memory address 1, The prefetch unit 124 searches for the next two cache lines in the detected access direction detected in the memory block. These two cache lines are related to the previously detected access direction. A more detailed operation of Step 13〇6 will be explained in the subsequent Figure 14. The flow proceeds to step 1308.  In step 1308, The prefetch unit 124 provides the physical address of the two cache lines to the first level data cache 116 in step 13〇6. The cache line address 194 is predicted as a pattern. In other embodiments, The number of cache line addresses provided by the prefetch unit 丨 24 may be more or less than two.  The flow proceeds to step 1312.  In step 1312, The first level of data cache memory 116 advances the address provided in step 1308 to queue 198. The flow proceeds to step 1314.  0608-A43067TW/fmal 201135460 In step 1314, Whenever the queue 198 is non-empty, The first level data cache 116 takes the next address out of the queue 198. And issue a cache line configuration request 192 to the second level cache s memory body 118, In order to get the cache line at the address. however, If one of the addresses in the queue 198 has appeared in the first level data cache 116, The first level of the negative cache memory 116 will dump the address and discard the second level of cache memory 118 from its cache line. The second level cache memory 118 then provides the requested cache line data 188 to the first level data cache memory 116. The process ends at step 1314.  As shown in Fig. 14, the prefetch unit 124 shown in Fig. 12 is a flow chart according to the operation of step 1306 of Fig. 13. The operation described in Fig. 14 is as shown in Fig. 3, and the direction of the sample is upward (called the underside of the anal sentence).  however, The direction detected by the right is downward. The prefetch unit can also be used to perform the same function. step! The operation is just to place the mode register 344 in FIG. 3 in the appropriate location in the memory block' so that the prefetch list A 124 starts from the first level data memory address (9). The state register 344 searches for the next two cache lines for searching. And the state 344 of the state register 344 can be copied on the memory block as needed. , The process begins at step 14〇2.  In the steps, The prefetch unit 124 uses the mode of the third embodiment to initialize the search index register 352 and the state area register (10) in step 6 of FIG.  0608-A43067TW/final Japanese standard $ 34 201135460 ;  The sum of the registers 316, To initialize the first level data search indicator 172 of Fig. 12 and the first level data sample address 178. E.g, If the intermediate indicator register 316 has a value of 16 and the sample sequence register 346 is 5, And the direction of the direction register 342 is upward. The prefetch unit 124 initializes the first level data search indicator Π2 and the first level data sample addresses 178 to 21. The flow proceeds to step 1414.  In step 14014, The prefetch unit 124 determines whether the first level data memory address 196 falls into the state of the temporary register 344 having the currently specified position. The current position of the pattern is initially determined according to step 1402. It can be updated according to step 1406. That is, Prefetch unit 124 determines the value of the appropriate bits of the first level data memory address 196 (i.e., in addition to identifying the bits of the memory block, And a bit with a specified byte offset offset used in the cache line), Is it greater than or equal to the value of the first level data search indicator 172, And whether it is less than or equal to the sum of the value of the first level data search indexer 172 and the value of the sample order register 346. If the first level data memory address 196 falls into the state of the state register 344, The process proceeds to step 1408; Otherwise the flow proceeds to step 1406.  In step 1406, The prefetch unit 124 adds the first level data search indexer 172 and the first level data sample address 178 according to the sample sequence register 346. According to the operation described in step 1406 (and subsequent step 1418),  If the first level data search indexer 172 has reached the end of the memory block, then 0608-A43067TW/fmal 35 201135460 End the search. The flow returns to step 1404.  In step 1408, The prefetch unit 124 sets the value of the first level data search indexer 172 to the offset of the memory page of the cache line associated with the first level data memory address 196. The flow proceeds to step 1412.  In step 1412, The prefetch unit 124 tests the bits in the mode register 344 in the first level data search indexer 172. The flow proceeds to step 1414.  In step 1414, Prefetch unit 124 determines if the bit tested in step 1412 is set. If the bit tested in step 1412 is set, The process proceeds to step 1416; Otherwise the flow proceeds to step 1418.  In step 1416, The prefetch unit 124 marks the cache line predicted by the mode register 344 in step 1414 as being ready to transmit the physical address to the first level data cache 116. The cache line address 194 is predicted as the same state. The process ends at step 1416.  In step 1418, The prefetch unit 124 increments the value of the first level data search indicator 172. In addition, If the first level data search indexer 172 has exceeded the last bit of the mode register 344, The prefetch unit 124 updates the value of the first level data search indicator Π2 with the new value of the first level data search indicator 172. That is, shifting the state register 344 to the position of the new first level data search indexer 172. The operations of steps 1412 to 1418 are repeated, Until the two cache lines (or other established line of the cache line 0608-A43067TW/final 36 201135460) are found. The process ends at step 1418.  The benefit of prefetching the cache line to the level-level data cache 116 in Fig. 13 is that the change required for the level-level data cache memory ιι6 and the second-level cache memory 118 is small. however, In other embodiments, The prefetch unit 124 may also not provide the aspect prediction cache line address 194 to the level-level data cache memory 116. For example, in the embodiment, The prefetch unit 124 directly requests the bus interface interface unit 122 to obtain the cache line from the memory. Then, the received writing thread is written to the first level (four) cache memory. In another embodiment, The prefetch unit 124 requests and obtains a cache line from the second-level cache memory 118 for providing the data to the pre-fetching 70 124 (if a memory line is obtained for a missed job memory), The received cache line is written to the level-level data cache 116. In other embodiments, The prefetch unit 124 requests the cache line from the second level cache memory 118 (if the fuse is failed (missin_ gets the cache line from the memory), It directly writes the cache line to the first level data cache 116.  An advantage of the various embodiments described above is that there is a single prefetch unit m total counter 314, As the second-level cache memory 18 and the first-level data cache memory 116 are the basis for prefetching.  Although the second The n and 15 diagrams (as discussed below) are blocks of different names, The prefetch unit m can occupy adjacent to the second level cache memory in the (four) arrangement!  The label (4) of i 8 and the position of the data column (_虹(四) and conceptually include the second level cache memory 118, As shown in Fig. 2, 〇608-A43067TW/finaI 37 201135460 Embodiments allow the manned/storage unit 134 to have a large space arrangement to enhance it:  Its precision and its large space needs, Applying a single logic to process the first-level data cache memory 116 and the second-level cache memory (1) prefetch operation to solve the problem in the prior art that only the pre-fetched data can be pre-fetched to a smaller capacity. The problem of the primary data cache memory 116.  Warm-up penaltvH^i ^ Take the unit This is a rhyme-telling material 124 in the Zaxian block (for example,  - On-complex memory pages) detect more complex access patterns (for example, An entity memory page) is different from the conventional general prefetch unit. For example, the pre-fetch unit 124 can perform a program for accessing a memory block according to the pattern. Even if the microprocessor 1 out-of-onier execution pipeline does not re-order memory access in the order of program commands, This may cause the conventional prefetch unit to not detect memory access patterns and cause no prefetch actions. This is because the prefetch unit 124 only considers efficient access to the memory blocks' and the time order is not a point of consideration.  however, In order to meet the ability to identify more complex access patterns and/or reorder access patterns, Compared to the conventional prefetch unit, The prefetching unit 124 of the present invention may take a long time to detect the access pattern. ''warm-up time'' as described below. Therefore, it is necessary to reduce the pre-take order 0608-A43067TW/final 38 201135460;  Yuan 124 method of warming up time.  Prefetch unit 124 "predicts a program that was previously used to access the memory block in the memory mode. Has it crossed ((10) the heartbeat) a new memory block that is actually adjacent to the memory block sold, And predicting whether the program will continue to access the new memory area according to the same pattern. a should use the 'pre-fetch unit' 24 to use the 'direction and other related information from the old memory block. To speed up the prediction of access patterns in new memory blocks, That is to reduce the warm-up time.  A block diagram of a microprocessor 100 having a prefetch unit 124 is shown in FIG. The microprocessor ^(8) of Fig. 15 is similar to the microprocessor 100 of the second and second figures, And has other characteristics as described below.  As described in Figure 3, The prefetch unit 124 includes a plurality of hardware units 332. Each hardware unit 332 further includes a hashed virtual address of memory (hashed virtual address of memory) as compared with FIG. HVAMB) 354 and a status 356. In the process of initializing the assigned hardware unit 332 in step 406 described in FIG. 4, The prefetch unit 124 fetches the physical block number in the block number register 303. And after translating the physical block code into a virtual address, The physical block code is translated into a virtual address (a hash of the virtual address) according to the same hashing algorithm performed by step 1704 described in the subsequent Figure 17. The result of the hash calculation is stored in the memory block virtual hash address field 354. Status bar 356 0608-A43067TW/fmal 39 201135460

具有三種可能之數值:非主動(inactive)、主動(active)或者試 用(probationary) ’如下所述。預取單元124亦包括一虛擬 雜湊表(virtual hash table ’ VHT)162,關於虛擬雜湊表IQ 組織架構以及操作之詳細說明請參考後續_ 16到19圖之 敛述。 如第16圖所示為第15圖之虛擬雜湊表162。虛擬 雜湊表162包括複數項目,最好組織成一仔列。每一項目 包括一有效位元(valid bit)(未圖示)以及三個欄:一負1雜 湊虛擬位址1602(HVAM1)、一未修改雜湊虛擬位址 16〇4(HVAUN)以及一正!雜湊虛擬位址16〇6(HVApi)。用 以填充上述欄位之數值的生成請參考後續第17圖所述。 第17圖所述為第15圖之微處理器1〇〇之操作流程 圖。流程開始於步驟1702。 在步驟1702中,第一級資料快取記Μ 116接收來 自載入/儲存單元134之—載人/儲存要求,其載人/儲存要 求包括一虛擬位址。流程進行到步驟17〇4。 在步驟1704中,第一級資料快取記憶體116對步驟 中所接收之料位址選擇之位元執行—雜凑功能(函 數),用以產生一未修改雜凑虛擬位址1604(HVAUN)。另 外’第一級資料快取記愔 已U體116增加一記憶體區塊大小 (MBS)至在步驟17〇2 尼 、 厅接收之雜凑位址所選擇的位元,用 以產生-加總值’並對加總值執行&amp; 0608-A43067TW/fmal '、’、月b,以產生一 201135460 正1雜凑虛擬位址1606(HVAP1)。另外,第一級資料快取 δ己憶體116從在步驟1702所接收之雜凑位址選擇的位元, 減去s己憶體區塊之大小’用以產生一差值,並對此差值執 行一雜湊功能,以產生一負1雜湊虛擬位址 1602(HVAjVH)。在一實施例中,記憶體區塊大小為4ΚΒ。 在一實施例中,虛擬位址為40位元,虛擬位址之位元39 : 30以及11 : 〇被會雜湊功能忽略。剩下之18個虛擬位址 位元為’’已處理(dealt)” ’如已擁有之資訊,係透過雜湊位元 位置來處理。其想法為虛擬位址之較低位元具有最高亂度 (entropy)以及較高位元具有最低亂度。用此方法處理可保 證亂度階級(entropy level)為較一致交叉雜湊之位元。在一 實施例中,剩下之虛擬位址之18位元係根據後續表1之方 法雜湊至6位元。然而,在其他實施例中,亦可考慮使用 不同雜凑演算法;此外’若有性能支配空間(performance dominates space)以及電力消耗之設計考量,實施例可考慮 不使用雜湊演算法。流程進行到步驟1706。 assign hash[5]=VA|79]AVA[18]AVA[17]; assign hash[4]=VA[28]AVA[19]AVA[16]; assign hash[3]=VA[27]AVA[20]AVA[15]; assign hash[2]=VA[26]AVA[21]AVA[14]; assign hash[l]=VA[25]AVA[22]AVA[13]; assign hash[0]-VA[24]AVA[23]AVA[12]; 0608-A43067TW/final 41 201135460 表1 在步驟1706中,第一級資料快取記憶體116提供在 步驟1704中所產生之未修改雜湊虛擬位址 (HVAUN)1604、JL 1雜湊虛擬位址(HVApi)16〇6以及負工 雜湊虛擬位址(hVAM1) 1602至預取單幻24。流程進行到 步驟1708。 在步驟1708中,預取單元124用步驟17〇6所接收 之未修改雜凑虛擬位址(HVAUN)16〇4、正}雜凑虛擬位址 (HVAP1)1606以及負1雜溱虛擬位址(阶八⑷》選擇性 地更新虛擬雜絲162。也就是說,如果虛娜湊表162 已包括一具有未修改雜湊虛擬位址16〇4(hvaun)、正工雜 湊虛擬位址16〇6(HVAP1)以及負i雜湊虛擬位址 ’ 1602(HVAM1)之項目,預取單元124⑽棄更新虛擬雜凑 表162。相反地,預取單元124則以先進先出 (first-in-first_out)的方式將未修改雜湊虛擬位址 1604(HVAUN)、正1雜湊虛擬位址16〇6(HVApi)以及負工 雜湊虛擬位址]602(HVAM1)推進至虛擬雜湊表162最頂端 之項目’並將所推進之項目標記為有效⑽⑷。流程結束 於步驟1708。 如第18圖所示為第16圖之虛擬雜湊表162在預取 單元124在載入/儲存單元134根據第17圖之敘述操作之 0608-A43067TW/flnal 201135460 後的内容,其中在載入/儲存單元134因應於程式的執〜 已經由兩記憶體區塊(標示為A and A+MBS)在〜a 订 冋上的 方向上進行’並進入一第三記憶體區塊(標示為 A+2*MBS),以便回應已填充虛擬雜湊表162之預取單_ 124。仔細而言,虛擬雜湊表162距離尾端之兩個項目 目包括在負1雜湊虛擬位址(HVAM1)1602之Α-1\4Βς;、 ^的雜 湊、在未修改雜湊虛擬位址(HVAUN)16〇4之A的雜'、奏以 在正1雜湊虛擬位址(HVAP1)1606之A+MBS的雜、、奏·及 、’虛 擬雜湊表162項目係距離尾端之一個項目的項目包括負 雜溱虛擬位址(HVAM1) 1602之A的雜湊、在未修改雜、奏卢 擬位址(HVAUN)1604之A+MBS的雜湊以及在正!崦生此 i雜凑虛 擬位址(HVAP1)1606之A+2*MBS的雜凑;虛擬雜凑表162 項目係在尾端的項目(即最近時間所最近之推進的項目)包 括在負1雜湊虛擬位址(HVAM1)1602之A+MBS的雜凑、 在未修改雜湊虛擬位址(HVAUN)1604之A+2*MBS的雜湊 以及在正1雜湊虛擬位址(HVAP1)1606之A+3*MBS的雜 湊0 如第19圖所示(由第19A圖以及第19B圖組成)之第 5圖之預取單元124的操作流程圖。流程開始於步驟1902。 在步驟1902中,第一級資料快取記憶體116傳送一 新的配置要求(allocation request,AR)至第二級快取記憶體 118。新的配置要求係要求一新記憶體區塊。也就是說預取 〇608-A43067TW/fmal 43 201135460 單元124決定與配置要求相關之記憶 尚未配置一硬體單元332給新的配置^係新的,意即 區塊。也就是說,預取單幻244 ^所相關之記憶體 一新記憶體區塊之阶$西七备 接文(encountered) 在-載人增存第-㈣中’配置要求係 由第二級快取記憶體118要❿體116結果失敗並隨之 在-實施例中,配置 〉同决取線所產生的要求。 直要未用以指定―眚 所相關之一虛擬位址是由银 又只體位址,實體位址 料快取記憶H 116根據一、、體位址轉譯而來的。第-級資 1704相同之縣功 雜凑功能(意即與第17圖之步驟 、力月&gt;0’雜凑枭 虛擬位址,用以產生配置、?配置要求之實體位址有關之 (HVAAR),並且將配置要^求之一已雜凑虛擬位址. 單元124。流程進行 /之已雜湊虛擬位址提供至預取 主步騍1903。 在步驟1903中,預 _ 元332給新的記憶體區塊早元124配至一個新的硬體單 單元332存在,預取單 果有不活動(mactive)的硬體 給新的記憶體區塊。否則4配置一不活動的硬體單元332 則配置-個最近最少使。在—實施财’預取單元⑶ 塊。在-實施例中,—旦箱石更體單几332給新的記憶體區 示之記憶體區塊的所有快、早凡124已經預取樣態所指 r 、取線時,預取單元124則會鈍化 (inactivate)硬體皁元 332。/ 有固定㈣硬體單元332 =實施例中,預取單元124具 0608-A43067TW/final %力’使其就算為-個最近最 44 201135460 少使用之硬體呈分2 „β 一 更體早tl 332亦不會被重置。舉例而言,若預取 早兀124该測到已經根據樣態對記憶體區塊進行-既定次 —、 仁預取單元124尚未根據樣態對整個記憶體區 t成所有的預取,預取單元124即可固定與記憶體區塊 硬體單元332,使其就算成為一個最近最少使用之 硬體早①332仍不夠資格被重置。在-實施例中,預取單 y ,隹持每硬體單元332之相對期間(從原始配置), 並且當其期間(age)到達—既定期間臨界值時,預取單元 124則會鈍化硬體單元说。在另-實施例中,若預取單元 1叫猎由彳㈣的㈣19G4到1926_—虛擬相鄰的記憶 體,塊’亚且已完成自虛擬鄰近的記憶體區塊之預取,預 取早凡124則會選擇性地重複使用在虛擬相鄰的記憶體區 免之硬紅單疋332 ’而不是配置一新的硬體單元332。在此 “例中,預取單A 124選擇性地初始化重複使用之硬體 早儿332之各種儲存元件(例如方向暫存器342、樣態暫存 器344與樣態區域暫存器348),以便維持儲存在其内之可 用資訊。流程進行至步驟丨9〇4。 在步驟1904中,預取單元124比較在步驟19〇2所 產生之已雜湊虛擬位址(HVAAR)與虛擬雜凑表162之每一 項目之負1雜凑虛擬位址麗(HVAM1)和正丨雜湊虛擬位 址1606(HVAP1)。預取單元124根據步驟19〇4到1922之 操作係為了決定-已主動(aGtive)記憶體區塊是否虛擬相鄰 0608-A43067TW/fmal Ac 201135460 至新記憶體區塊,預取單元丨24根據步驟1924到1928之 操作係為了制記憶體存取是否將根據事先彳貞測到之存取 樣態與方向’繼續自虛擬相鄰之已主動記憶體區塊進入新 的記憶體區塊,用以以降低預取單元124之暖機時間,使 得預取單元124可較快開始預取新的記憶體區塊。流程進 行至步驟1906。 在步驟1906中’預取單元124根據步驟!9〇4執行 之比較方式,決定已雜湊虛擬位址(HVAAR)是否與虛擬雜 湊表162之任何一項目匹配。若已雜湊虛擬位址(HVAAR) 與虛擬雜湊表162之一項目匹配,流程進行至步驟1908 ; 否則,流程進行至步驟1912。 在步驟1908中’預取單元124設定一候補方向旗幟 (candidate_direction flag)至一數值,以指示向上之方向。流 程進行至步驟1916。 在步驟1912中,預取單元124根據步驟1908所執 行之比較方式,決定已雜湊虛擬位址(HVAAR)是否與虛擬 雜湊表162之任何一項目匹配。若已雜湊虛擬位址(HVAAR) 與虛擬雜湊表162之一項目匹配,流程進行至步驟1914 ; 否則,流程結束。 在步驟1914中,預取單元124設定候補方向旗幟 (candidate_direction flag)至一數值,以指示向下之方向。流 程進行至步驟1916。 0608-A43067TW/final 46 201135460 ; 在步驟1916中,預取單元124將候補雜湊暫存器 (candidate_havregister)(未圖示)設定為步驟 1906 或]912 所決定之虛擬雜湊表162之未修改雜湊虛擬位址 1604(HVAUN)的一數值。流程進行至步驟1918。 在步驟1918中,預取單元124比較候選雜凑 (candidatejiva)與預取單元124中每一主動記憶體區塊之 記憶體區塊虛擬雜湊位址欄(HVAMB) 354。流程進行至步 驟 1922 。 在步驟1922中,預取單元124根據步驟1918所執 行之比較方式’決定候選雜湊(candidate—hva)是否與任何— 記憶體區塊虛擬雜湊位址欄(HVAMB) 354匹配。若候選雜 湊(candidatejiva)與一記憶體區塊虛擬雜湊位址欄 (HVAMB) 354匹配,流程進行至步驟1924;否則,流程結 束。 在步驟1924中,預取單元124已確定步驟1922所 找到之匹配主動記憶體區塊確實虛擬鄰近於新的記憶體區 塊。因此’預取單元124比較(步驟1%8或者DM所指定 之)候選方向與匹配主動記憶體區塊之方向暫存器342,用 以根據先㈣測到之存取樣態與方向,預測記憶體存取是 否將繼續自虛擬相鄰的已主動記憶體區塊進人新的記憶體 區塊。仔細而f ’奸選方向與虛擬相鄰記憶體區塊:方 向暫存器342不同’記憶體存取不太可能會根據There are three possible values: inactive, active, or probationary as described below. The prefetch unit 124 also includes a virtual hash table (VHT) 162. For a detailed description of the virtual hash table IQ organization structure and operation, please refer to the following _ 16 to 19 diagram. As shown in Fig. 16, the virtual hash table 162 of Fig. 15 is shown. The virtual hash table 162 includes a plurality of items, preferably organized into a series of columns. Each item includes a valid bit (not shown) and three columns: a negative 1 hash virtual address 1602 (HVAM1), an unmodified hash virtual address 16〇4 (HVAUN), and a positive ! The hash virtual address is 16〇6 (HVApi). For the generation of the values for filling the above fields, please refer to the subsequent Figure 17. Fig. 17 is a flow chart showing the operation of the microprocessor 1 of Fig. 15. The flow begins in step 1702. In step 1702, the first level data cache 116 receives the manned/storage request from the load/store unit 134, the manned/storage request including a virtual address. The flow proceeds to step 17〇4. In step 1704, the first level data cache memory 116 performs a hash function (function) on the bit selected by the material address received in the step to generate an unmodified hash virtual address 1604 (HVAUN). ). In addition, the 'first-level data cache record has been added to the memory block size (MBS) of the U-body 116 to the bit selected in the hash address received in step 17〇2, for generating-plus The total value 'executes &amp; 0608-A43067TW/fmal ', ', month b for the summed value to produce a 201135460 positive 1 hash virtual address 1606 (HVAP1). In addition, the first level data cache δ mnemonic 116 subtracts the size of the s hexamed block from the bit selected by the hash address received at step 1702 to generate a difference, and The difference performs a hash function to produce a negative 1 hash virtual address 1602 (HVAjVH). In one embodiment, the memory block size is 4 ΚΒ. In one embodiment, the virtual address is 40 bits, and the virtual address bits 39:30 and 11: are ignored by the hash function. The remaining 18 virtual address bits are ''dealt''. If the information is already owned, it is processed by the hash bit position. The idea is that the lower bits of the virtual address have the highest degree of chaos. (entropy) and higher bits have the lowest degree of chaos. This method is used to ensure that the entropy level is a more consistent cross-heavy bit. In one embodiment, the remaining virtual address is 18 bits. It is hashed to 6 bits according to the method in Table 1. However, in other embodiments, different hash algorithms may be considered; in addition, if there is performance dominates space and power consumption design considerations, The embodiment may consider not using a hash algorithm. The flow proceeds to step 1706. assign hash[5]=VA|79]AVA[18]AVA[17]; assign hash[4]=VA[28]AVA[19]AVA [16]; assign hash[3]=VA[27]AVA[20]AVA[15]; assign hash[2]=VA[26]AVA[21]AVA[14]; assign hash[l]=VA[ 25] AVA[22]AVA[13]; assign hash[0]-VA[24]AVA[23]AVA[12]; 0608-A43067TW/final 41 201135460 Table 1 In step 1706, the first level data cache Memory 116 provides an unmodified hash virtual address (HVAUN) 1604, a JL 1 hash virtual address (HVApi) 16〇6, and a negative hash virtual address (hVAM1) 1602 to prefetch phantom 24 generated in step 1704. Flow proceeds to step 1708. In step 1708, prefetch unit 124 uses the unmodified hash virtual address (HVAUN) 16〇4, positive } hash virtual address (HVAP1) 1606, and negative received in step 17〇6. A chowder virtual address (order eight (4)" selectively updates the virtual shred 162. That is, if the vain table 162 already includes an unmodified hash virtual address 16 〇 4 (hvaun), a duplication of work The virtual address 16 〇 6 (HVAP1) and the negative i hash virtual address ' 1602 (HVAM1), the prefetch unit 124 (10) discards the update virtual hash table 162. Conversely, the prefetch unit 124 is FIFO (first The method of -in-first_out advances the unmodified hash virtual address 1604 (HVAUN), the positive 1 hash virtual address 16 〇 6 (HVApi), and the negative hash virtual address 602 (HVAM1) to the virtual hash table 162. The top item 'marks the project being promoted as valid (10)(4). The process ends at step 1708. As shown in FIG. 18, the virtual hash table 162 of FIG. 16 is after the prefetch unit 124 operates in the load/store unit 134 according to the description of FIG. 17, 0608-A43067TW/flnal 201135460, in which the load/ The storage unit 134 performs the program in the direction of the ~a subscription by the two memory blocks (labeled A and A+MBS) and enters a third memory block (labeled A+). 2*MBS) in response to the prefetch _ 124 of the filled virtual hash table 162. In detail, the virtual hash table 162 from the end of the two items is included in the negative 1 hash virtual address (HVAM1) 1602 Α -1 \ 4 Βς;, ^ hash, in the unmodified hash virtual address (HVAUN) In the case of a project of the end of the project, the item of the project is included in the item of the end of the project. The hash of the A MB 溱 溱 HV HV 602 602 602 602 602 602 HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV HV The hash of the A+2*MBS of this i-heavy virtual address (HVAP1) 1606; the virtual hash table 162 project at the end of the project (ie the most recently promoted project in the most recent time) is included in the negative 1 hash The hash of the A+MBS of the virtual address (HVAM1) 1602, the hash of the A+2*MBS of the unmodified hash virtual address (HVAUN) 1604, and the A+3 of the positive 1 hashed virtual address (HVAP1) 1606 *Block 0 of MBS is an operational flowchart of the prefetch unit 124 of Fig. 5 of Fig. 19 (composed of Fig. 19A and Fig. 19B). The flow begins in step 1902. In step 1902, the first level data cache 116 transmits a new allocation request (AR) to the second level cache 118. The new configuration requirement requires a new memory block. That is to say, prefetching 〇608-A43067TW/fmal 43 201135460 unit 124 determines the memory related to the configuration requirements. A hardware unit 332 has not been configured to give a new configuration, that is, a block. That is to say, the pre-fetching single illusion 244 ^ related to the memory of a new memory block of the ranks of the seventh occupant (encountered) in the - manned increase - (four) 'configuration requirements are based on the second level The cache memory 118 fails to result in the body 116 and, in the embodiment, is configured to meet the requirements of the line. It is not used to specify “眚” that one of the virtual addresses is composed of silver and only the physical address, and the physical address cache memory H 116 is translated according to the first and the physical address. The same county-level cosmic function of the first-level capital 1704 (meaning that it is the same as the step of the 17th figure, the force month &gt; 0' hashed virtual address, which is used to generate the physical address of the configuration and configuration requirements ( HVAAR), and will configure one of the hashed virtual addresses. Unit 124. The process proceeds/the hashed virtual address is provided to the prefetch main step 1903. In step 1903, the pre_element 332 is given to the new The memory block early 124 is assigned to a new hardware unit 332, and the pre-fetched single fruit has a mactive hardware for the new memory block. Otherwise, 4 configures an inactive hardware unit. 332 then configure - the least recently. In the - implementation of the pre-fetch unit (3) block. In the embodiment, - the box stone is more than a few 332 to the new memory area of the memory block all the fast The pre-fetch unit 124 inactivates the hard soap element 332. / There is a fixed (four) hardware unit 332. In the embodiment, the pre-fetch unit 124 has 0608-A43067TW/final % force 'Let it be the most recent 44 201135460 Less used hardware score 2 „β一Even earlier, tl 332 will not be reset. For example, if the prefetch early 124 is detected, the memory block has been determined according to the state - the predetermined time - the kernel prefetch unit 124 has not been based on the pattern The entire memory area t is all prefetched, and the prefetch unit 124 can be fixed to the memory block hardware unit 332 so that it is not eligible to be reset even if it is a least recently used hardware 1332. In an embodiment, the prefetch unit y holds the relative period of each hardware unit 332 (from the original configuration), and when its age reaches a predetermined period threshold, the prefetch unit 124 passesivates the hardware unit. In another embodiment, if the prefetching unit 1 is called (4) 19G4 to 1926_-virtual adjacent memory, the block 'sub-and has completed the pre-fetching of the virtual neighboring memory block, pre-fetching In the case of 124, the virtual adjacent memory area is selectively reused instead of the new hard unit 332. In this example, the prefetching order A 124 is selected. Initially initialize various storage elements of the reusable hardware 332 (example) The direction register 342, the mode register 344 and the mode area register 348) are used to maintain the available information stored therein. The flow proceeds to step 丨9〇4. In step 1904, the prefetch unit 124 The negative 1 hash virtual address (HVAM1) and the positive hash virtual address 1606 (HVAP1) of each of the hashed virtual address (HVAAR) and virtual hash table 162 generated in step 19〇2 are compared. The prefetch unit 124 operates according to steps 19〇4 to 1922 in order to determine whether the aGtive memory block is virtual adjacent to 0608-A43067TW/fmal Ac 201135460 to the new memory block, and the prefetch unit 丨24 is based on The operations of steps 1924 to 1928 are for the memory access to enter the new memory block by the virtual memory adjacent to the active memory block according to the previously detected access state and direction ' In order to reduce the warm-up time of the prefetch unit 124, the prefetch unit 124 can start prefetching a new memory block faster. The flow proceeds to step 1906. In step 1906, the prefetch unit 124 follows the steps! The comparison of the executions of the 〇4 determines whether the hashed virtual address (HVAAR) matches any of the items of the virtual hash table 162. If the hashed virtual address (HVAAR) matches one of the virtual hash tables 162, the flow proceeds to step 1908; otherwise, the flow proceeds to step 1912. In step 1908, the prefetch unit 124 sets a candidate_direction flag to a value to indicate the upward direction. The flow proceeds to step 1916. In step 1912, prefetch unit 124 determines whether the hashed virtual address (HVAAR) matches any of the virtual hash table 162 based on the comparison performed in step 1908. If the hashed virtual address (HVAAR) matches one of the virtual hash tables 162, the flow proceeds to step 1914; otherwise, the flow ends. In step 1914, prefetch unit 124 sets the candidate direction flag (canddate_direction flag) to a value to indicate the downward direction. The flow proceeds to step 1916. 0608-A43067TW/final 46 201135460; In step 1916, prefetch unit 124 sets a candidate hash register (not shown) (not shown) to the unmodified hash virtual of virtual hash table 162 determined in step 1906 or ]912. A value of the address 1604 (HVAUN). The flow proceeds to step 1918. In step 1918, prefetch unit 124 compares the candidate hashes (medidentjiva) with the memory block virtual hash address field (HVAMB) 354 of each active memory block in prefetch unit 124. The flow proceeds to step 1922. In step 1922, prefetch unit 124 determines whether the candidate hash (hd) matches any of the memory block virtual hash address columns (HVAMB) 354 based on the comparison mode performed in step 1918. If the candidate jigsaw matches a memory block virtual hash address field (HVAMB) 354, the flow proceeds to step 1924; otherwise, the process ends. In step 1924, prefetch unit 124 has determined that the matching active memory block found at step 1922 is indeed virtually adjacent to the new memory block. Therefore, the prefetch unit 124 compares the candidate direction (specified by step 1%8 or DM) with the direction register 342 matching the active memory block for predicting the mode and direction according to the first (four) access. Whether the memory access will continue to enter a new memory block from the virtual adjacent active memory block. Carefully and f' the direction of the selection and the virtual adjacent memory block: the direction register 342 is different. 'Memory access is unlikely to be based on

0608-A43067TW/fina] 闲 /只J 201135460 到之存取樣態與方向,繼續 只曰湿擬相鄰的已主動記憶體區 塊進入新的記憶體區塊。流程進行至步驟1926。 在步驟1926中’預取單元124根據步驟1924所執 行之比較方法,決定候選方向 ,,..〇〇 6,、匹配主動記憶體區塊之方 向暫存器342是否匹配。若佐、煙 &amp;遊方向與匹配主動記憶體區 塊之方向暫存器342匹配 、j/,IL&amp;進行至步驟1928;否則, 流程結束。 在步驟1928中,預取里-^ 、取早凡124決定在步驟1902所 接收到之新的重置要求是否 破‘到步驟1926所偵測到之 匹配虛擬相鄰主動記憶體區 , 思之一已被樣態暫存器344所 預測之快取線。在一實施例 f 為了執行步驟1928之決 疋,預取單元124根據其樣 相卞制 〜項序暫存器346有效地轉換 ,、複1¾匹配虛擬相鄰主動記恃 由、,—處體區埤之樣態暫存器344, 用以在虛擬相鄰記憶體區塊 %樣態位置樣態區域暫存器 348以便在新的記憶體區 ™ Φ ,、 维持樣態334連貫性。若新的 配置要求係要求匹配主動記憔 .^ ^ 〜、體£塊之樣態暫存器344所 相關之一快取記憶體列,流 仃至步驟1934 ;否則,流 程進订至步驟1932。 在步驟1932中,預取i - , 早疋124根據第4圖之步驟 406與408,初始化與填充(步 .-^ (,驟19〇3所配置之)新的硬體 早兀332 ’希望其最後可根據 象上述與第4到6圖相關之方 法,偵測對新的記憶體區塊之 關之万 __黯編 子取的新樣態,而這將需要 201135460 暖機時間。流程結束於步驟1932。 在步驟1934中,預取單幻24預測存取要求將 據匹配虛擬相鄰主動記憶體區塊之樣態暫存器^4與方^ 暫存器342繼續進入新的記憶體區塊。因此,預取單元%向 以相似於步驟1932之方式填充新的硬體單元332,^^24 些許不同。仔細而言’預取單元124會用來自虛擬相 憶體區塊之硬體單元332的對應數值來填充方向暫存器 342、樣態暫存斋344以及樣態順序暫存哭346。另外 態區域暫存器348之新的數值係藉由繼續轉換於增加之^ 態順序暫存器346之值所決定,直到其交又進入新的記二 體區塊’以提供樣態暫存器344持續地進入新的記憶體‘ 塊,如步驟1928中之相關敘述。再者,新的硬體單元 中之狀態欄356係用以標記新的硬體單元332為試用 (probationary)。最後,搜尋指標暫存352被初使化以便由 一記憶體區塊之開頭進行搜尋。流程進行至步驟。 在步驟1936中,預取單元124繼續監視發生於新記 憶體區塊之存取要求。若預取單元124偵測到對記憶體區 塊之至少一既定數量的後續存取要求是要求樣態暫存器 344所預測之記憶體線,接著預取單元124促使硬體單元 332之狀態攔356自試用(probationary)轉為主動,並且接著 如第6圖所述開始自新的記憶體區塊進行預取。在—實施 例中’存取要求之既定數量為2 ’雖然其他實施例可考虞 0608-A43067丁 W/fmal 49 201135460 為其它既定數量。流程進行至步驟1936。 如第20圖所示為第15圖所示之預取單元124所用 之雜湊只體位址至雜凑虛擬位址庫(hashed physical address-to-hashed virtual address thesaurus)2002。雜湊物理 位址至雜奏虛擬位址庫2002包括一項目陣列。每一項目包 括一實體位址(PA)2〇〇4以及一對應的雜湊虛擬位址 (HVA)2006。對應的雜凑虛擬位址2006係由實體位址2〇〇4 轉譯成之虛擬位址加以雜湊的結果。預取單元124藉由對 最近之雜凑物理位址至雜湊虛擬位址庫2〇〇2進行竊聽,用 以在跨越載入/儲存單元134的管線。在另一實施例中,於 第19圖之步驟19〇2,第一級資料快取記憶體116並未提 供已雜湊虛擬位址(HVAAR)至預取單元124,但只提供配 置要求所相關之物理位址。預取單元124在雜湊物理位址 至雜湊虛擬位址庫2002中尋找實體位置,以找到一匹配實 體位址(PA) 2004,並獲得相關之雜湊虛擬位址(HVa) 2006,所獲得之雜湊虛擬位址(HVA) 2006將在第19圖其 他部分成為已雜湊虛擬位址(HVAAR)。將雜凑物理位址至 雜湊虛擬位址庫2002包括在預取單元124可緩和第—級資 料快取記憶體116提供配置要求所要求之雜湊虛擬位址的 需要,因此可簡化第一級資料快取記憶體116與預取單元 124之間的介面。 在一實施例中,雜湊實體位址至雜湊虛擬位址庫 0608-A43067TW/finaI Λ 201135460 : 2002之每一項目包括一雜凑實體位址,而不是杂舰 、, X 體位址 2004 ’亚且預取單&amp; 124將自第一級資料快取記憶體】 所接收之配置要求實體位址雜湊成一雜湊物理位址— 找哥雜凑實體位址至雜凑虛擬位址庫2〇〇2, Λ • 吏獲得適當 之對應的雜湊虛擬位址(HVA)2006。本實施例允許_ 雜湊實體位址至雜湊虛擬位址庫2002,但需μ 之 文乃外之時間 對實體位址進行雜湊。 如第21圖所示為本發明實施例之多核微處理哭 1.00。多核微處理器100包括兩個核心(表示成核心^ 以及核心Β2102Β),可整個視為核心2102(友去- 号早一核心 2102)。每一核心具有相似於如第2圖所示之恩姑 干极微處理器 100之元件12或15。另外,每一核心2102具有如々所\ 之高度反應式的預取單元2104。該兩個核心北一— 旱第 二級快取記憶體118以及預取單元124。特別的是,4 核心2012之第一級資料快取記憶體116、載入/储存單一 134以及高度反應式的預取單元2104係耦接至共享之第一 級快取記憶體118以及預取單元124。另外,一共享之高 度反應式的預取單元21 〇6係耦接至第二級快取記憶體]! 8 以及預取單元124。在一實施例中,高度反應式的預取單 元2104/共享之高度反應式的預取單元2106只預取一記憶 體存取所相關之快取線後的下一個相鄰之快取線。 預取單元124除了監控載入/儲存單元134以及第一 0608-A43067TW/fina1 51 201135460 級資料快取記憶體116之記憶體存取之外,亦可監押言户 反應式的預取單元2104/共享之高度反應式的預取單元a 2106所產生之記憶體存取,用以進行預取決定。預取單_ 124可監控從不同組合之記憶體存取來源的記憶體存取 以執行本發明所述之不同的功能。例如’預取單元 監控記憶體存取之一第一組合,以執行第2到丨〗圖所述之 關相功能,預取單元124可監控記憶體存取之一第_組 合,以執行第12到14圖所述之相關功能,並且預取單元 124可監控記憶體存取之一第三組合,以執行第^到w 圖所述之相關功能。在實施例中,共享之預取單元由 於時間因素難以監控每一核心2102的載入/儲存單元134 之行為。因此,共享之預取單幻24,經由第―級資料快取 記憶體116所產生之傳輸狀況(traffic)間接地監控載入/儲 存單元134之行為,作為其載入/儲存未命中加以)之結果。 本發明的不同實施例已於本文敘述,但本領域具有通 常知識者應能瞭解這些實施例僅作為範例,而非限定於 此。本領域具有通常知識者可在不脫離本發明之精神的情 況下,對形式與細節上做不同的變化。例如,軟 本發明實施例所述的裝置與方法之:能體:; (fabrication)、塑造(modeling)、模擬、描述(descripti〇n)、 以及/或測試’亦可透過一般程式語言(C、c++)、硬體 描述語言(Hardware Description Languages,HDL)(包括0608-A43067TW/fina] Idle / J 201135460 Access to the mode and direction, continue to wet only the adjacent active memory block into the new memory block. The flow proceeds to step 1926. In step 1926, the prefetch unit 124 determines the candidate direction according to the comparison method performed in step 1924, . . . 6, and matches whether the orientation register 342 of the active memory block matches. If the direction of the match, the smoke &amp; swim direction matches the direction register 342 matching the active memory block, j/, IL&amp; proceeds to step 1928; otherwise, the flow ends. In step 1928, prefetching -^, taking early 124 determines whether the new reset request received in step 1902 is broken "to the matching virtual adjacent active memory area detected in step 1926, thinking A cache line that has been predicted by the mode register 344. In an embodiment f, in order to perform the step 1928, the prefetch unit 124 effectively converts according to the sample phase-to-item register 346, and matches the virtual adjacent active record by, The mode register 344 is used to maintain the state 334 continuity in the new memory area TM Φ in the virtual adjacent memory block % modal location area register 348. If the new configuration requirement is to match the active memory record, the cache memory column associated with the active register 344 is flown to step 1934; otherwise, the process advances to step 1932. . In step 1932, prefetching i - , early 124 is initialized and filled according to steps 406 and 408 of FIG. 4 (step .-^ (, configured by step 19〇3) new hardware early 332 'hope Finally, according to the method described above in connection with Figures 4 to 6, the new pattern of the new memory block is detected, and this will require 201135460 warm-up time. The process ends in step 1932. In step 1934, the prefetching single-prediction 24 predictive access request continues to enter the new memory according to the mode register 4 and the register 342 of the matching virtual adjacent active memory block. Thus, the prefetch unit % is slightly different in filling the new hardware unit 332 in a manner similar to step 1932. Carefully speaking, the prefetch unit 124 will use the block from the virtual memory block. The corresponding value of the hardware unit 332 is used to fill the direction register 342, the state temporary storage 344, and the state sequence temporary storage cry 346. The new value of the state area register 348 is further converted to increase by ^ The value of the state sequence register 346 is determined until it is re-entered into the new two-body block' to provide The state register 344 continues to enter the new memory 'block, as described in step 1928. Again, the status bar 356 in the new hardware unit is used to mark the new hardware unit 332 for trial use (probationary Finally, the search indicator temporary storage 352 is initialized to be searched by the beginning of a memory block. The flow proceeds to step. In step 1936, the prefetch unit 124 continues to monitor the occurrence of the new memory block. If the prefetch unit 124 detects that at least a predetermined number of subsequent access requests to the memory block is a memory line predicted by the mode register 344, then the prefetch unit 124 causes the hardware unit The state block 356 of 332 transitions from the probationary to the active, and then begins prefetching from the new memory block as described in Fig. 6. In the embodiment, the 'number of access requests is 2' although Other embodiments may consider 0608-A43067 DW/fmal 49 201135460 for other established quantities. The flow proceeds to step 1936. As shown in Fig. 20, the hash address of the prefetch unit 124 shown in Fig. 15 is shown to Hash virtual Hased physical address-to-hashed virtual address thesaurus 2002. The hash physical address to the vocal virtual address library 2002 includes an array of items. Each item includes a physical address (PA) 2〇〇4 and a Corresponding Hash Virtual Address (HVA) 2006. The corresponding hash virtual address 2006 is the result of hashing the virtual address translated into the physical address 2〇〇4. The prefetch unit 124 is used to traverse the load/store unit 134 by eavesdropping on the nearest hash physical address to the hash virtual address base 2〇〇2. In another embodiment, in step 19〇2 of FIG. 19, the first level data cache memory 116 does not provide a hashed virtual address (HVAAR) to the prefetch unit 124, but only provides configuration requirements. Physical address. The prefetch unit 124 looks for the entity location in the hash physical address to hash virtual address pool 2002 to find a matching physical address (PA) 2004 and obtains the associated hash virtual address (HVa) 2006, the hash obtained. Virtual Address (HVA) 2006 will become a hashed virtual address (HVAAR) in other parts of Figure 19. The inclusion of the hash physical address into the hash virtual address library 2002 includes the need for the prefetch unit 124 to alleviate the need for the hash virtual address required by the first level data cache 116 to provide configuration requirements, thereby simplifying the first level of data. The interface between the memory 116 and the prefetch unit 124 is cached. In one embodiment, the hash entity address to the hash virtual address pool 0608-A43067TW/finaI Λ 201135460: 2002 each item includes a hashed physical address, rather than a miscellaneous ship, the X-body address 2004 'Aya The prefetching order &amp; 124 will be hashed from the first level data cache memory to the configuration request entity address to a hashed physical address - find the tangible entity address to the hash virtual address base 2〇〇2 , Λ • 吏 Obtain the appropriate corresponding hash virtual address (HVA) 2006. This embodiment allows the _ hashed entity address to be hashed to the virtual address pool 2002, but requires the time of μ to hash the physical address. As shown in Fig. 21, the multi-core micro-processing crying 1.00 according to the embodiment of the present invention. The multi-core microprocessor 100 includes two cores (represented as a core ^ and a core Β 2102 Β), which can be regarded as a core 2102 (friend-to-first early core 2102). Each core has an element 12 or 15 similar to that of the NPG microprocessor 100 as shown in FIG. In addition, each core 2102 has a highly reactive prefetch unit 2104 as described. The two cores are a north-severe second-level cache memory 118 and a prefetch unit 124. In particular, the 4-core 2012 first-level data cache memory 116, the load/store single 134, and the highly reactive pre-fetch unit 2104 are coupled to the shared first-level cache memory 118 and prefetched. Unit 124. In addition, a shared highly reactive prefetch unit 21 〇6 is coupled to the second level cache memory 8! and the prefetch unit 124. In one embodiment, the highly reactive prefetch unit 2104/shared highly reactive prefetch unit 2106 prefetches only the next adjacent cache line after the memory access associated with the cache line. The prefetch unit 124 can monitor the load/store unit 134 and the memory access of the first 0608-A43067TW/fina1 51 201135460 level data cache 116, and can also monitor the pre-fetch unit 2104/ of the respondent. The memory access generated by the shared highly reactive prefetch unit a 2106 is used to make prefetch decisions. Prefetch _ 124 can monitor memory accesses from different combinations of memory access sources to perform the different functions described herein. For example, the 'prefetch unit monitors one of the first combinations of memory accesses to perform the phase closure function described in the second to the second figure, and the prefetch unit 124 can monitor one of the memory accesses to perform the first The related functions described in Figures 12 through 14, and the prefetch unit 124 can monitor a third combination of memory accesses to perform the related functions described in Figures 1-4. In an embodiment, the shared prefetch unit is difficult to monitor the behavior of the load/store unit 134 of each core 2102 due to time factors. Therefore, the shared prefetch phantom 24 indirectly monitors the behavior of the load/store unit 134 via the transmission generated by the level-level data cache 116 as its load/store miss. The result. Various embodiments of the invention have been described herein, but those of ordinary skill in the art should understand that these embodiments are only by way of example and not limitation. Variations in form and detail may be made by those skilled in the art without departing from the spirit of the invention. For example, softening the apparatus and method described in the embodiments of the present invention: "fabrication", modeling, simulation, descriptive, and/or testing 'can also be through a general programming language (C) , c++), Hardware Description Languages (HDL) (including

VerilogHDL、VHDL等等)、或其他可利用的程式語言來 0608-A43067TW/fmal 52 201135460 π成此軟體可配置在任何已知的電腦可使用媒介,例如 磁π半‘體、磁碟,或是光碟(例如cd_r〇m、dvd_r〇m 等等)、網際網路、有線、無線、或其他通訊媒介的傳輸 方式之中。本發明所述之裝置與方法實施例可被包括於半 導體智慧財產核心,例如微處理器核心(a HDL來實現), 並轉換成積體電路產品的硬體。此外,本發明所述之裝置 與方法透過硬體與軟體的結合來實現。目此,本發明不應 侷限於所揭露之實施例,而是依後附之中請專利範圍與等 效實施所界定。特別是,本發明可實施在使用於—般用途 電腦中的微處理ϋ裝置内。最後,本發明雖以較佳實施例 揭露如上’然其並非用以限定本發明的範圍,任何所屬技 術領域中具有通常知識者’在不脫離本發明之精神和範圍 内,當可做些許的更動與潤飾’因此本發明之保護範圍當 視後附之申請專利範圍所界定者為準。 本發明的不同實施例已於本文敘述,但本領域具有通 常知識者應能瞭解這些實施例僅作為範例,而非限定於 此。本領域具有通常知識者可在不脫離本發明之精神的情 況下,對形式與細節上做不同的變化。例如,軟體可致能 本發明實施例所述的裝置與方法之功能、組建 (fabrication)、塑造(modeling)、模擬、描述(descripti〇n)、 以及/或測試’亦可透過一般程式語言(C、C++ )、硬體VerilogHDL, VHDL, etc., or other available programming language to 0608-A43067TW/fmal 52 201135460 π into this software can be configured in any known computer usable medium, such as magnetic π half body, disk, or Among the transmission methods of optical discs (such as cd_r〇m, dvd_r〇m, etc.), Internet, wired, wireless, or other communication media. The apparatus and method embodiments of the present invention can be included in a semiconductor intellectual property core, such as a microprocessor core (a HDL implementation), and converted into hardware for an integrated circuit product. Furthermore, the apparatus and method of the present invention are realized by a combination of a hardware and a soft body. The invention is not limited to the disclosed embodiments, but is defined by the scope of the appended claims and the equivalent implementation. In particular, the present invention can be implemented in a micro-processing device for use in a general purpose computer. In the following, the present invention is not limited to the scope of the present invention, and any one of ordinary skill in the art can be made a part of the present invention without departing from the spirit and scope of the present invention. The scope of protection of the present invention is therefore defined by the scope of the appended claims. Various embodiments of the invention have been described herein, but those of ordinary skill in the art should understand that these embodiments are only by way of example and not limitation. Variations in form and detail may be made by those skilled in the art without departing from the spirit of the invention. For example, the software can enable the functions, fabrication, modeling, simulation, descriptive, and/or testing of the apparatus and methods described in the embodiments of the present invention, as well as through a general programming language ( C, C++), hardware

描述語言(Hardware Description Languages,HDL)(包括 Verilog HDL、VHDL等等)、或其他可利用的程式語言來 完成。此軟體可配置在任何已知的電腦可使用媒介,例如 磁帶、半導體、磁碟’或是光碟(例如CD-ROJvl、DVD-ROM 0608-A43067TW/final 53 201135460 等等)、網際網路、有線、無 方式之中。本發明所述之裝寮、線、或其他通訊媒介的傳輸 導體智慧財產核心,例如微x卢置與方法實施例可被包括於半 並轉換成_電路產品的^理器核心UXHDL來實現), 與方法透過石更體與軟體的結二匕外’本發明所述之裝置 侷限於所揭露之實施 :來實現。因此,本發明不應 效實施所界I特別j 是依後附之申請專利範圍與等 雷腦中疋 發明可實施在使用於一般用途 冤肠肀的微處理器裝置内。 ^ 揭露如上,然其並非用本發明雖以較佳實施例 Γ=Γ常知識者’在不脫離本發明之精神和範圍 ‘許的更動與潤飾,因此本發明之保護範圍當 視後附之中請專利範Ϊ所界定者為準。 【圖式簡單說明】 第1 @所示為當執行經由記憶體包括一序列儲存操作 之的程式時,一種第二級快取記憶體之樣態存取表現。 第2圖為本發明的一種微處理器的方塊圖。 第3圖為本發明第2圖之預取單元更詳細之方塊圖。 第4圖為本發明第2圖之微處理器以及特別係第3圖 之預取單元的操作流程圖。 第5圖為本發明第3圖之預取單元對第4圖之步驟的 才呆作流程圖。 第6圖為本發明第3圖之預取早元對第4圖之步驟的 操作流程圖。 第7圖為本發明第3圖之預取要求仔列的操作流程圖。 0608-Α43067丁 W/final &lt;4 201135460 第8圖為本發明—記憶體區塊之兩個 以表示本發名之定界框預取單元。 躲點’用 第9圖為本發明第2圖所示之 方塊圖。 〈裸作乾例的 第10@為本發明延續第9圖之範例 處理器之操作範例的方塊圖。 輯不之微 所二1Λ為本發明延續第9以及10圖之範例的第2圖 所不之镟處理器之操作範例的方塊圖。 圖 第 圖為本發明另一實施例之一種微處理器之方塊 圖 第13圖為本發明第12圖所示之預取單元之操作流程 第14圖為本發明根據第13圖步驟之第〗2 取單元的操作流程圖。 Τ下之預 第15圖為本發明另一實施例具有一定界框預取單元之 一種微處理器的方塊圖。 第16圖為本發明第15圖之虛擬雜湊表的方塊圖。 第17圖為本發㈣15圖之微處理器的操作流程圖。 仲一第Μ圖為本發明根據經由第17圖範例敘述之在預取 單^^操作後之f 16目之虛擬雜湊表的内容。 第19圖(統合第19A以及_圖)為本發明帛15圖之 預取單元的操作流程圖。 一第20圖為本發明另一實施例之用在第15圖之預取單 元之二雜祕理彳紐至齡虛擬健庫的方塊圖。、 第21圖本發明之一多核微處理器的方塊圖。 0608-A43067丁 W/final 201135460 【主要元件符號說明】 100〜 微處理器 102〜 指令快取記憶體 104〜 指令解碼器 106〜 暫存器別名表 108〜 保留站 112〜 執行單元 132〜 其他執行單元 134〜 載入/儲存單元 124〜 預取單元 114〜 引退單元 116〜 第一級資料快取記憶體 118〜 第二級快取記憶體 122〜 匯流排介面單元 162〜 虛擬雜凑表 198〜 仔列 172〜 第一級資料搜尋指標器 178〜 第一級資料樣態位址 196〜 第一級資料記憶體位址 194〜 樣態預測快取線位址 192〜 快取線配置要求 188〜 快取線資料 354〜 記憶體區塊虛擬雜湊位址欄 0608-A43067TW/finaI 56 201135460The Description Languages (HDL) (including Verilog HDL, VHDL, etc.), or other available programming languages. This software can be configured on any known computer usable medium, such as tape, semiconductor, diskette or CD (eg CD-ROJvl, DVD-ROM 0608-A43067TW/final 53 201135460, etc.), internet, cable There is no way. The transmission conductor intellectual property core of the device, the line, or other communication medium of the present invention, such as the micro-x and the method embodiment, may be implemented by half-converted into a processor core UXHDL of the _ circuit product. And the method of the invention is limited to the implementation of the disclosed invention by means of a stone-like body and a soft body. Therefore, the present invention is not limited to the implementation of the invention. The special application is in the scope of the appended claims and the invention can be implemented in a microprocessor device for general use. The disclosure of the present invention is not intended to be a modification of the present invention, and the scope of protection of the present invention is attached to the present invention without departing from the spirit and scope of the present invention. The patent is defined by the patent. [Simple Description of the Drawing] The first @@ shows the state of the access performance of a second-level cache memory when executing a program including a sequence of storage operations via the memory. Figure 2 is a block diagram of a microprocessor of the present invention. Figure 3 is a more detailed block diagram of the prefetching unit of Figure 2 of the present invention. Figure 4 is a flow chart showing the operation of the microprocessor of Figure 2 of the present invention and, in particular, the prefetching unit of Figure 3. Fig. 5 is a flow chart showing the steps of the prefetching unit of Fig. 3 of the present invention for the steps of Fig. 4. Fig. 6 is a flow chart showing the operation of the steps of the prefetching early element to the fourth drawing of Fig. 3 of the present invention. Figure 7 is a flow chart showing the operation of the prefetch request in Figure 3 of the present invention. 0608-Α43067丁 W/final &lt;4 201135460 Figure 8 is a two-memory block of the present invention to represent the bounding box prefetching unit of this name. Fig. 9 is a block diagram of Fig. 2 of the present invention. <10th Example of the Naked Work Example is a block diagram of an example of the operation of the processor in the example of the ninth embodiment of the present invention. The second section is a block diagram of an example of the operation of the processor of the second embodiment of the present invention. FIG. 13 is a block diagram of a microprocessor according to another embodiment of the present invention. FIG. 13 is an operation flow of the prefetch unit shown in FIG. 12 of the present invention. FIG. 14 is a diagram of the present invention according to the steps of FIG. 2 Take the operation flow chart of the unit. BRIEF DESCRIPTION OF THE DRAWINGS Figure 15 is a block diagram of a microprocessor having a bounding frame prefetching unit in accordance with another embodiment of the present invention. Figure 16 is a block diagram of a virtual hash table of Figure 15 of the present invention. Figure 17 is a flow chart showing the operation of the microprocessor of Figure 4 (b). The second diagram is the content of the virtual hash table of the f 16 mesh after the prefetch operation according to the example illustrated in Fig. 17. Fig. 19 (integrated 19A and _) is a flow chart showing the operation of the prefetching unit of Fig. 15 of the present invention. Figure 20 is a block diagram of a second handy 彳 至 至 虚拟 虚拟 虚拟 用 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 。 Figure 21 is a block diagram of a multi-core microprocessor of the present invention. 0608-A43067丁 W/final 201135460 [Main component symbol description] 100~ Microprocessor 102~ Instruction cache memory 104~ Instruction decoder 106~ Register alias table 108~ Reserved station 112~ Execution unit 132~ Other execution The unit 134~ the loading/storing unit 124~ the prefetch unit 114~ the retiring unit 116~ the first level data cache memory 118~ the second level cache memory 122~ the bus interface unit 162~ the virtual hash table 198~ 172~1st level data search indexer 178~ first level data sample address 196~ first level data memory address 194~ state prediction cache line address 192~ cache line configuration requirement 188~ fast Take the line data 354~ Memory block virtual hash address bar 0608-A43067TW/finaI 56 201135460

356 狀態欄 302 區塊位元遮罩暫存 303 區塊號碼暫存器 304 最小指標暫存器 306 最大指標暫存器 308 最小改變計數器 312 最大改變計數器 314 r-&gt;^/ 總計數器 316 r^/ 中間指標暫存器 318 週期匹配計數器 342 r·^ 方向暫仔器 344 樣態暫存器 346 樣態順序暫存器 348 /-w 樣態區域暫存器 352 搜尋指標暫存器 332 r·^ 硬體單元 322 r*s^/ 控制邏輯 328 /-*%«✓ 預取要求佇列 324 提取指標器 326 推進指標器 2002 /-s. -雜湊虛擬位址庫 2102A ^核心A 2102B 〜核心B 0608-A43067TW/fina1 201135460 2104〜高度反應式的預取單元 2106〜共享之高度反應式的預取單元 0608-A43067TW/final 58356 Status Bar 302 Block Bit Mask Temporary 303 Block Number Register 304 Minimum Indicator Register 306 Maximum Indicator Register 308 Minimum Change Counter 312 Maximum Change Counter 314 r-&gt;^/ Total Counter 316 r ^/ Intermediate Indicator Register 318 Period Match Counter 342 r·^ Direction Temporary 344 Sample Register 346 Sample Sequence Register 348 /-w Sample Area Register 352 Search Indicator Register 332 r ·^ Hardware Unit 322 r*s^/ Control Logic 328 /-*%«✓ Prefetch Request 伫 324 Extract Indicator 326 Push Indicator 2002 /-s. - Hash Virtual Address Library 2102A ^ Core A 2102B ~ Core B 0608-A43067TW/fina1 201135460 2104~ highly reactive prefetch unit 2106~ shared highly reactive prefetch unit 0608-A43067TW/final 58

Claims (1)

201135460 ; 七、申請專利範圍: 1. 一種預取單元,設置於具有一快取記憶體之一微 處理器中,包括: 其中上述預取單元係用以接收對一記憶體區塊之複數 位址的複數存取要求,每一存取要求對應上述記憶體區塊 之位址中之一者,並且上述存取要求之位址係隨著時間函 數非單調性地(non-monotonically)增加或減少; 一儲存裝置;以及 一控制邏輯,耦接至上述儲存裝置,其中當接收到上 述存取要求時,上述控制邏輯則用以: 維持上述儲存裝置中之上述存取要求之一最大位址以 及一最小位址,以及上述最大位址以及最小位址之變化的 計數值; 維持上述記憶體區塊中最近被存取之快取線的一歷史 記錄,上述最近被存取之快取線係與上述存取要求之位址 相關; 根據上述計數值,決定一存取方向; 根據上述歷史記錄,決定一存取樣態;以及 根據上述存取樣態並沿著上述存取方向,將上述快取 記憶體内尚未被上述歷史記錄指示為已存取之快取線預取 至上述記憶體區塊中。 2. 如申請專利範圍第1項所述之預取單元,其中上 述控制邏輯更用以在上述記憶體區塊中最近被存取之快取 線的數量大於一既定值之前,暫緩上述預取的動作。 3. 如申請專利範圍第2項所述之預取單元,其中上 0608-A43067TW/fmal 59 201135460 述既定值至少為9。 4. 如申請專利範圍第2項所述之預取單元,其中上 述既定值至少為上述記憶體區塊中快取線之數量的百分之 十。 5. 如申請專利範圍第1項所述之預取單元,其中為 了根據上述計數值決定上述存取方向,上述控制邏輯係用 以: 當上述最大位址之變化的計數值與上述最小位址之變 化的計數值之間的差值係大於一既定值時,決定上述存取 方.向係向上;以尽 當上述最小位址之變化的計數值與上述最大位址之變 化的計數值之間的差值係大於上述既定值時,決定上述存 取方向係向下。 6. 如申請專利範圍第1項所述之預取單元,其中上 述控制邏輯更用以在上述最大位址之變化的計數值與上述 最小位址之變化的計數值間之差值的絕對值係大於一既定 值之前,暫緩上述預取動作。 7. 如申請專利範圍第1項所述之預取單元,其中: 上述歷史記錄包括一位元遮罩,上述位元遮罩用以指 出上述最近被存取之快取線,並且上述最近被存取之快取 線係與上述記憶體區塊之位址所相關; 當接收到上述存取要求時,上述控制邏輯更用以: 計算上述位元遮罩中之上述最近被存取之快取線 的一中間指標暫存器;以及 當上述中間指標暫存器之左側的上述位元遮罩之 0608-A43067TW/finaI 60 201135460 N位元與上述中間指標暫存器之右側的上述位元遮罩 之N位元匹配時,為複數個不同的位元週期(distinct bit period)中之每一者,增加上述位元週期所相關之一 匹配計數器的計數值,其中N為上述位元週期中之位 元數。 8. 如申請專利範圍第1項所述之預取單元,其中為 了根據上述位元遮罩決定上述存取樣態,上述控制邏輯係 用以: 偵測上述位元週期之一者所相關的上述匹配計數器與 上述位元週期之其它者所相關的上述匹配計數器之間的差 值是否大於一既定值;以及 決定被上述位元遮罩之上述中間指標暫存器之其中一 侧的N位元所指定的上述存取樣態,其中N為上述位元週 期之一者的位元數,上述位元週期之上述一者所具有之相 關匹配計數器與上述位元週期之其它者所具有之相關匹配 計數器之間的差值大於上述既定值。 9. 如申請專利範圍第8項所述之預取單元,其中為 了根據上述存取樣態並沿著上述存取方向,將上述記憶體 區塊中被上述位元遮罩標指為最近尚未被存取之快取線預 取至上述快取記憶體中,上述控制邏輯係用以: 沿著上述存取方向,分派一搜尋指標器以及距離上述 中間指標器N位元之上述存取樣態;以及 當上述搜尋指標器上之上述存取樣態中的位元指示一 存取時,預取上述搜尋指標器上之上述位元遮罩中之上述 位元所相關之快取線。 0608-A43067TW/final 61 201135460 1〇.如申晴專利範圍第9項所述之預取單 了根據上述存取樣態並沿著上述存取方向,將 中被上述位元遮罩標指為最近尚未被存取之; 取至上述快取記憶體中,上述控_輯係心:、 以及 根據上述存取方向,增加/減少上述搜尋指標器的值; 當:加/減少後之上述搜尋指標器上之 ===預取上述已增加/減少之上述搜= ^疋遮罩中之上述位元所相關之快取線。 、.、如申》月專利|巳圍第1〇項所述之預取單元,其 上述控制邏輯更用以: 、 重複上述增加上述搜尋指標器的值以及 作,直到^犬況出現,其中上述狀況包括:仃預取的動 當上述存取方向係向上時,上述搜尋指標器上之 上述位元遮罩之中的位元與在上述最大位址所相關之 上述位70遮罩之中的位元之間的距離係大於一第二既 定值;以及 當上述存取方向係向下時,上述搜尋指標器上之 上述位兀遮罩之中的位元與上述最小位址所相關之上 述位7G遮罩中的位元之間的距離係大於上述第二既定 值。 12.如申請專利範圍第7項所述之預取單元,其中上 述控制邏輯更用以在上述不同位元週期之一者所相關的上 述匹配計數器與上述不同位元週期之其它者所相關的上述 匹配計數器之間的差值係大於一既定值之前,暫緩上述預 0608-A43067TW/final · ^ 201135460 取動作。 13. 如申請專利範圍第1項所述之預取單元,其中上 述位元週期為3、4以及5位元。 14. 如申請專利範圍第1項所述之預取單元,其中上 述控制邏輯更於上述快取線已出現在上述微處理器之任一 快取記憶體時,放棄預取上述快取線。 15. 如申請專利範圍第1項所述之預取單元,其中上 述記憶體區塊之大小係4千位元組。 16. 如申請專利範圍第1項所述之預取單元,更包括: 複數個上述儲存裝置; 其中上述控制邏輯係用以接收一存取要求,上述存取 要求之位址不在上述儲存裝置之一者所相關之一新的記憶 體區塊中,並且分派上述儲存裝置之一者給上述新的記憶 體區塊。 17. 如申請專利範圍第16項所述之預取單元,其中 上述控制邏輯更用以清除上述最大位址之改變的計數值、 上述最小位址改變之計數值,以及上述儲存裝置被分派之 一者的上述歷史記錄。 18. 一種資料預取方法,用以預取資料至一微處理器 之一快取記憶體,上述資料預取方法,包括: 接收對一記憶體區塊之複數位址的複數存取要求,每 一存取要求對應上述記憶體區塊之位址中之一者,並且上 述存取要求之位址係隨著時間函數非單調性地 (non-monotonically)增力口或減少; 當接收到上述存取要求時,維持上述記憶體區塊中之 0608-A43067TW/final 63 201135460 一最大以及一最小位址,並且計算上述最大以及最小位址 之變化的計數值; 當接收到上述存取要求時,維持上述記憶體區塊中最 近被存取之快取線的一歷史記錄,上述最近被存取之快取 線係與上述存取要求之位址相關; 根據上述計數值決定一存取方向; 根據上述歷史紀錄決定一存取樣態;以及 根據上述存取樣態並沿著上述存取方向,將上述快取 記憶體内尚未被上述歷史記錄指示為已存取之快取線預取 至上述記憶體區塊中。 19. 如申請專利範圍第18項所述之資料預取方法, 更包括在上述記憶體區塊中最近被存取之快取線的數量大 於一既定值之前,暫缓上述預取的動作。 20. 如申請專利範圍第19項所述之資料預取方法, 其中上述既定值至少為9。 21. 如申請專利範圍第19項所述之資料預取方法, 其中上述既定值至少為上述記憶體區塊中之快取線之數量 的百分之十。 22. 如申請專利範圍第18項所述之資料預取方法, 其中為了上述之根據上述計數值決定上述存取方向更包 括: 當上述最大位址之變化的計數值與上述最小位址之變 化的計數值之間的差值係大於一既定值時,決定上述存取 方向係向上;以及 當上述最小位址之變化的計數值與上述最大位址之變 0608-A43067TW/fmal 64 201135460 化的計數值之間的i 取方向係向下。值係大於上述既定值時,決定上述存 23· &gt;申請專利範 更包括在上迷最大㈣弟】8項所述之資料預取方法, 變化的計數值之變化的計數值與上述 最小位址之 缓上述預取動作。俊的絕對值係、大於-既定值之前,暫 其中一請專利範圍第—資料預取方法, 上述歷史記錄 _ 出上述最近被存取之 位70遮罩,上述位元遮罩用以指 線係與上述記憶體區塊之1 止述最近被存取之快取 當已接ί上述存取時,更包括: 十算在上述位元遮罩中之上述最 線的:中間指標暫存器;以及 破存取之快取 ν二Ί&quot;4中間指標暫存器之左側的上述位元遮罩之 兀/、上速中間指標暫存器之右 :、 之Ν位元匹配時,為複數個不同的仅=兀遮罩 bitpe—中之每—者,增加上述位 U(d_Ct 匹配計數器的計數值,1中 巧摘相關之一 數。 ,、?N為上輕元週期中位元 25·如申請專利範圍第24項所述之資料 其中為了根據上述位元鮮決定上述存取樣能包括., 债測上述位元週期之者所相關的上 述位元週期之其它者所相關的上述匹配計數哭 ^上 係否係大於一既定值;以及 。3的差值 0608-A43067TW/flnal 65 201135460 決定被上述位元遮罩之上述中間指標暫存器之其中一 側的N位元所指定的上述存取樣態,其中N為上述位元週 期之一中之位元的號碼,上述位元週期之上述一者所具有 之相關匹配計數器與上述位元週期之其它者所具有之相關 匹配計數器之間的差值大於上述既定值之上述計數器關於 上述所有其他清楚位元週期之間的差。 26. 如申請專利範圍第25項所述之資料預取方法, 其中為了根據上述存取樣態並沿著上述存取方向,將上述 記憶體區塊中被上述位元遮罩標指為最近尚未被存取之快 取線預取至上述快取記憶體中,上述控制邏輯係用以: 沿著上述存取方向,分派一搜尋指標器以及距離上述 中間指標器N位元之上述存取樣態;以及 當上述搜尋指標器上之上述存取樣態中的位元指示一 存取時,預取上述搜尋指標器上之上述位元遮罩中之上述 位元所相關之快取線。 27. 如申請專利範圍第26項所述之資料預取方法, 其中為了根據上述存取樣態並沿著上述存取方向,將上述 記憶體區塊中被上述位元遮罩標指為最近尚未被存取之快 取線預取至上述快取記憶體中,更包括: 根據上述存取方向,增加/減少上述搜尋指標器的值; 以及 當增加/減少後之上述搜尋指標器上之上述存取樣態中 的位元指示一存取時,預取上述已增加/減少之上述搜尋指 標器上之上述位元遮罩中之上述位元所相關之快取線。 28. 如申請專利範圍第27項所述之資料預取方法, 0608-A43067TW/fmal 66 201135460 更包括: 作,ί = 尋指標器的值以及進行預取的動 ^狀況出現,其中上述狀況包括: 田上核取方向係向上時,上述搜尋指標器上之 上^ =罩之中的位70與在上述最大位址所相關之 :位兀遮罩之中的位元之間的距離係大 疋值;以及 當上述存取方向係向下時,上述搜尋指標器上之 ^:立ϋ遮罩之巾的位元與上述最小位址所 3元遮罩之令的位元之間的距離係大於上述第二既 疋值。 Μ.如申請專利範圍第24項 更包括在上述不同位元週期之㈣取方法 哭盥 ^ 者所相關的上述匹配計數 兀週期之其它者所相關的上述匹配計數器 之間的差值係大於-岐值之前,暫緩上述預取動作。。 3〇·如申料利範圍第18項所述之資料預取方法, ,、中上述位元週期為3、4以及5位元。 扎如申請專利範圍第18項所述 快取線已出現在上述微處理器之任—快取記 隐體%,放棄預取上述快取線。 32. 如申請專利範圍帛18項所述之資料預 其中上述記憶體區塊之大小係4千位元組。 ,, 33. τ種電腦程式產品,編喝於至少_電腦可讀取媒 上’並適用於-計算裝置,上述電腦程式產品包括. 一::可讀程式編碼?存於上述電腦可讀取媒體 201135460 中,用以在具有一快取記憶體之一微處理器中,定義出 (specify)—預取單元,上述電腦可讀程式包括: 其中上述預取單元係用以接收對一記憶體區塊之複數 位址的複數存取要求,每一存取要求對應上述記憶體區塊 之位址中之一者,並且上述存取要求之位址係隨著時間函 數非單調性地(non-monotonically)增加或減少;; 一第一程式碼,用以定義出一儲存裝置;以及 一第二程式碼,用以定義出一控制邏輯,耦接至上述 儲存裝置,其中當接收到上述存取要求時,上述控制邏輯 則用以: 藉由上述儲存裝置維持之上述存取之一最大位址 以及一最小位址,並且計算上述最大以及最小位址之 變化的計數值; 藉由上述記憶體區塊維持上述記憶體區塊中最近 被存取之快取線的一歷史記錄; 根據上述計數值決定一存取方向; 根據上述歷史紀錄決定一存取樣態;以及 根據上述存取樣態並沿著上述存取方向,將上述 快取記憶體内尚未被上述歷史記錄指示為已存取之快 取線預取至上述記憶體區塊中。 34. 如申請專利範圍第33項所述之電腦程式產品, 其中上述至少一電腦可讀媒體係擇自於一碟片、磁帶或者 其他具磁性、光學或者電子儲存媒體以及一網路、線路、 無線或者其他通訊媒體。 35. 一種資料預取方法,用以預取資料進入一微處理 0608-A43067TW/fma] 68 201135460 , 器之一快取記憶體,上述資料預取方法包括: 接收對一記憶體區塊之一位址的一存取要求; 設定一位元遮罩中與一快取線所相關之一位元,其中 上述快取線係與上述記憶體區塊之上述位址相關; 於接收到上述存取要求之後,增加一總計數器之計數 值: 當上述位址大於一最大指標暫存器的值,用上述位址 更新上述最大指標暫存器,並且增加一最大改變計數器之 計數值; 當上述位址小於一最小指標暫存器,用上述位址更新 上述最小指標暫存器,並且增加一最小改變計數器之計數 值; 計算一中間指標暫存器,作為上述最大以及最小改變 計數器之平均值; 當上述中間指標暫存器之左側的上述位元遮罩之N位 元與上述中間指標暫存器之右側的上述位元遮罩之N位元 匹配時,為複數個不同的位元週期(distinct bit period)中之 每一者,增加上述位元週期所相關之一匹配計數器的計數 值,其中N為上述位元週期中之位元數; 決定一狀況是否出現,其中上述狀況包括: (A) 上述存取總計數器大於一第一既定值; (B) 上述最大改變計數器與最小改變計數器相減 取絕對值後的差係大於一第二既定值;以及 (C) 上述匹配計數器之一者與其它者間之計數值 間之差值的絕對值係大於一第三既定值;以及 0608-A43067TW/final 69 201135460 當上述狀況存在時: 當上述最大改變計數器大於上述最小改變計數器 時,決定上述存取方向係向上,並且當上述最大改變 計數器小於上述最小改變計數器時,決定上述存取方 向係向下; 決定被上述位元遮罩之上述中間指標暫存器之其 中一側的N位元所指定的上述存取樣態,其中N為上 述位元週期中與上述最大匹配計數器相關之一者的位 元數;以及 根據所決定之上述存取方向與上述存取樣態,將 上述記憶體區塊之複數快取線預取至上述快取記憶體 中。 36.如申請專利範圍第36項所述之資料預取方法, 其中上述根據所決定之上述存取方向與上述存取樣態,將 上述快取線預取至上述快取記憶體中的步驟包括: (1) 沿著上述存取方向,初始化一搜尋指標器以及距離 上述中間指標器N位元之上述存取樣態; (2) 決定一第二狀況是否存在,其中上述第二狀況包 括: (D)在上述搜尋指標器之上述存取樣態的位元已 •i-fL · δ又疋, (Ε)在上述搜尋指標器之上述位元遮罩的位元已 清除;以及 (F)在上述存取方向上,上述最大/最小指標器與 上述搜尋指標器之上述位元遮罩中之位元間之差距係 0608-A43067TW/fmal 70 201135460 ; 小於一第四既定值;以及 (3)當上述第二狀況存在,預取上述搜尋指標器之上述 位元遮罩中之位元所相關的上述快取線。 37. 如申請專利範圍第36項所述之資料預取方法, 其中上述根據所決定之上述存取方向與存取樣態,將上述 快取線預取至上述快取記憶體的步驟更包括: 於上述第二狀況存在時,在決定上述第二狀況存在以 及存取之後,根據上述存取方向,增加/減少上述搜尋指標 器的值;以及 重複上述步驟(2)以及(3)。 38. 如申請專利範圍第37項所述之資料預取方法, 其中上述根據所決定之上述存取方向與存取樣態,將上述 快取線預取至上述快取記憶體的步驟更包括: 當上述狀況(F)為真,停止上述重複步驟。 39. 如申請專利範圍第37項所述之資料預取方法, 其中上述根據所決定之上述存取方向與存取樣態,將上述 快取線預取至上述快取記憶體的步驟更包括: 當上述位元遮罩之所有位元都已測試完,停止上述重 複步驟。 40. —種微處理器,包括: 複數核心; 一快取記憶體,由上述核心所共享,用以接收對一記 憶體區塊之複數位址的複數存取要求,每一存取要求對應 上述記憶體區塊之位址中之一者,上述存取要求之位址係 隨著時間函數非單調性地(non-monotonically)增加或減 0608-A43067TW/final 71 201135460 少;以及 一預取單元,用以: 監視上述存取要求,並維持上述記憶體區塊中之一最 大位址以及一最小位址,以及上述最大位址以及最小位址 之變化的計數值; 根據上述計數值,決定一存取方向;以及 沿著上述存取方向,將上述記憶體區塊中未命中之快 取線預取至上述快取記憶體中。 41. 如申請專利範圍第40項所述之微處理器,其中 上述預取單元更用以: 維持上述記憶體區塊中最近被存取之快取線的一歷史 記錄,上述最近被存取之快取線係與上述存取要求之位址 相關; 根據上述歷史記錄,決定一存取樣態;以及 根據上述存取樣態並沿著上述存取方向,將上述快取 記憶體内被上述歷史記錄指示為最近尚未被存取且在上述 記憶體區塊中是未命中的複數快取線預取至上述記憶體區 塊中。 42. 一種微處理器,包括: 一第一級快取記憶體; 一第二級快取記憶體;以及 一預取單元,用以: 偵測出現在上述第二級快取記憶體中之最近存取 要求之一方向以及樣態,以及根據上述方向以及樣態,將 複數快取線預取至上述第二級快取記憶體中; 0608-A43067TW/fma! 72 201135460 : 從上述第一級快取記憶體,接收上述第一級快取 記憶體所接收之一存取要求之一位址,其中上述位址與一 快取線相關, 決定在上述方向中所相關之快取線之後被上述樣 態所指出之一個或多個快取線;以及 導致上述一個或多個快取線被預取至上述第一級 快取記憶體中。 43. 如申請專利範圍第42項所述之微處理器,其中: 為了偵測出現在上述第二級快取記憶體中上述最近存 取要求之上述方向以及樣態,上述預取單元係用以偵測一 記憶體區塊之上述方向以及樣態,上述記憶體區塊係可被 上述微處理器存取之記憶體範圍之一小集合; 為了決定在上述方向中所相關之快取線之後被上述樣 態所指出之一個或多個快取線,上述預取單元係用以: 放置上述樣態至上述記憶體區塊,使得上述位址 位於上述樣態中;以及 沿著上述方向,由上述位址開始搜尋,直到遇到 上述樣態所指出之一快取線。 44. 如申請專利範圍第43項所述之微處理器,其中: 上述樣態包括快取線之一順序; 其中為了放置上述樣態至上述記憶體區塊,使得上述 位址位於上述樣態中,上述預取單元係用以藉由上述順序 將上述樣態轉移(shift)至上述記憶體區塊。 45. 如申請專利範圍第43項所述之微處理器,其中 出現在上述第二級快取記憶體中之上述記憶體區塊的上述 0608-A43067TW/final 73 201135460 最近存取要求之上述位址係隨著時間函數而非單調性地 (non-monotonically)增力口 以及減少。 46. 如申請專利範圍第45項所述之微處理器,其中 出現在上述第二級快取記憶體中之上述記憶體區塊的上述 最近存取要求之上述位址可為非連續的(non-sequentail)。 47. 如申請專利範圍第42項所述之微處理器,更包 括: 複數核心;其中 上述第二級快取記憶體以及預取單元係由上述核 心所共享;以及 每一上述核心包括上述第一級快取記憶體之一不 同之範4歹1Ka distinct instantation)。 48. 如申請專利範圍第42項所述之微處理器,其中 為了導致上述一個或多個快取線被預取至上述第一級快取 記憶體中,上述預取單元係用以提供上述一個或多個快取 線之位址至上述第一級快取記憶體,其中上述第一級快取 記憶體係用以從上述第二級快取記憶體中要求上述一個或 多個快取線。 49. 如申請專利範圍第48項所述之微處理器,其中 上述第一級快取記憶體包括一佇列,用以儲存從上述預取 單元所接收之上述位址。 50. 如申請專利範圍第42項所述之微處理器,其中 為了導致上述一個或多個快取線被預取至上述第一級快取 記憶體中,上述預取單元係從上述微處理器之一匯流排介 面單元要求一個或多個快取線,並且隨後將提供上述所要 0608-A43067TW/final 74 201135460 求到之快取線提供至上述第一級快取記憶體。 、51.如申請專利範圍第42項所述之微處理器,其中 為了導致上述一個或多個快取線被預取至上述第—級快取 記憶體中,上述預取單元係用以自上述第二級快取記憶體 中要求上述一個或多個快取線。 一 52. 如申請專利範圍第51項所述之微處理器,其中 上述預取單元係用以將上述被所要求到之快取線隨後提供 至上述第一級快取線。 ’、 53. 如申明專利範圍第51項所述之微處理器,其中 上述第二級快取記憶體係用以所要求之快取線隨後提供至 上述第一級快取線。 54. 如申明專利範圍第42項所述之微處理器,其中 上述預取單元偵測上述方向以及樣態的步驟,包括: 曰當接收到上述最近存取要求時,轉—記憶體區塊之 一最大位址以及一最小位址,以及上述最大位址以及上述 最小位址之改變的計數值; 當接收到上述最近存取要求時,維持上述記憶體區塊 之上述存取位址所相關之最近存取之快取線之一歷史呓 錄;以及 σ 根據上述計數值,決定上述方向;以及 根據上述歷史記錄,決定上述樣態。 55*如申請專利範圍第54項所述之微處理器,上述 根據上述計數值決定上述方向的步驟包括: 當上述最大位址之變化之計數值與上述最小位址之變 化之計數值間之差值係大於一既定值時,決定上述方向係 0608-A43067TW/final 201135460 向上;以及 當上述最小位址之變化 化之計數值間之差值係 /、上述最大位址之變 係向下。 核定值時,決定上述方向 56&gt;如申請專利範圍第42項所、十、 上述歷史記錄包括—位元遮罩斤迷之微處理器,其中·· 區塊之上述存取位址所相 孕 用以指出上述記憶體 當接收到上述存取要快取線,· 下列步驟·· 了上述預取單元更包括進行 .計算上述位元遮罩 一令間指標暫存器;以及 砍最近存取之快取線的 當上述令間指標暫存 N位與上述中間指 胃1的上述位^遮罩之 之N位元匹配時,為複數個盗之右則的上述位元遮罩 bit period)中之每—者,姆 同的位元週期(distinct 匹配計數器的計數值,^口土述位元週期所相關之一 元數。 一為上述位元週期中之位 57.如申請專利範圍苐% 上述根據上述位元遮罩決、述之微處理器,其中 偵測上述位元週期之 Ά取樣態的步驟包括: 上述位元週期之其它者:相關的上述匹配計數器與 值是否大於—既定值;以丨、上述匹配計數器之間的差 決疋被上述位元遮罩 側的N位元所指定的:述間f標暫存器之其卜 期之一者的位元數,上 、’〜、〇中N為上述位元週 0608-A43067TW/f,nal 疋週期之上述一者所耳有之相 76 201135460 關匹崎㈣與上述位元週期之其它者所具有之相關匹配 计數益之間的差值大於上述既定值。 58. 一種資料預取方法,用以預取資料至具有一第二 級快取記憶體之―微處理以—第-級錄錢體,上述 資料預取方法包括: 偵測出現在上述第二級快取記憶體中之最近存取要求 之-方向以及樣態,以及根據上述方向以及樣態,將複數 快取線預取至上述第二級快取記憶體中; 線相關 從上述第一級快取記憶體’接收上述第一級快取呓憶 :所接收之-存取要求之—位址,其中上述位址與一快取 決疋在上述方向中所相關之快取線之後被上述樣態所 指出之一個或多個快取線;以及 導致上述一個或多個快取線被預取至上述第一級快取 記憶體中。 、 59.如申請專利範圍第58項所述之資料預取 其中: 上述偵測出現在上述第二級快取記憶體中上述最近存 取要求之上述方向以及樣態的步驟,包括偵測—記憶體區 塊之上述方向以及樣態,上述記憶體區塊係可被上 理器存取之記憶體範圍之一小集合; ^ 決定在上述方向巾所相關之快取線之後被上述樣態所 指出之一個或多個快取線的步驟,包括: 放置上述樣態至上述記憶體區塊,使得上述位址 位於上述樣態中;以及 〇60S-A43067TW/flnaI 77 201135460 沿著在上述方向,由上述位址開始搜尋,直到遇 到上述樣態所指出之一快取線。 60·如申請專利範圍第59項所述之資料預取方法, 其中上述樣態包括快取線之一順序,並且放置上述樣態至 上述記憶體區塊,使得上述位址位於上述樣態中的步驟, 包括藉由上述順序將上述樣態轉移至上述記憶體區塊。 61. 如申請專利範㈣59項所述之資料預取方法, 其中出現在上述第二級快取記憶體中之上述記憶體區塊的 上迤最近存取要求之上述位址係隨著時間函數而非單調性 地(non_monotonically)增加以及減少。 62. 如申請專利範圍第61項所述之資料預取方法, 其中出現在上述第二級快取記憶體中之上述記憶體區塊的 上逑最近存取要求之上述位址可為非連續的 (non-sequentail) 〇 .如申5月專利範圍第58項所述之資料預取方法, 述微處理器更包括複數核心,並且上述第二級快取 體以及預取單S係由上述核心所共享,並且每一上述 核心包括上述第—級快取記憶體之—不同之範例。 • b中明專利圍第58項所述之資料預取方法, 述一個或多個快取線被預取至上述第-級快取 ^、二、y驟’包括上述微處理器之一預取單元用以提供 $了個或多個快取線之位址至上述第—級快取記憶體, 體弟—級快取記憶體係用以從上述第二級快取記憶 體中要求上述-個或多個快取線。 _-M36〇L^m利範㈣%項所述之㈣預取方法, 78 201135460 其中導致上述一個或多個快取線被預取至上述第_級快取 3己憶體的步驟,包括上述微處理器之一預取單元用以提供 上述一個或多個快取線之位址至上述第一級快取記憶體 中,其中上述第一級快取記憶體自上述微處理器之—匯流 排介面單元用以要求上述一個或多個快取線’並且隨後將 上述要求之一個或多個快取線提供至上述第一級快取記憔 66.如申請專利範圍第58項所述之資料預取方法, 其中導致上述一個或多個快取線被預取至上述第一級快取 記憶體的步驟,包括上述預取單元係用以自上述第二級快 取έ己憶體中要求上述一個或多個快取線。 67.如申請專利範圍第66項所述之資料預取方法, 其中導致上述-個或多錄取線被預取至上述第—級快取 記憶體的步驟’包括上述預取單元係用以將所要求之」個 或多個快取線隨後提供至上述第—級快取線。 如申明專利範圍第66項所述之資料預取方法, 更包括上述第二級快取記憶體剌以將所要求之—個 個快取線隨後提供至上述第—㈣取線。 / 69. -種電腦程式產品’編碼於至少 體之上’並適用於一呻笞挺罢l 电腼h貝取媒 一册 、裝置,上述電腦程式產品包括: 中,“二,式編碼’館存於上述電腦可讀取媒體 中義—微處理器,上述電腦可讀程式包括: -第二::碼’用以定義一第一級快取記憶體裝置; 以及 工碼,用以定義一第二級快取記憶體裝置,· 0608-A43067TW/fJnaj 79 201135460 第二程式碼,用以定義—預取單元,使得上述預取 早元用以: 、、#測出現在上述第二級快取記憶體中之最近存取 =求之t向以及樣態’以及根據上述方向以及樣態,將 複數快取線預取至上述第二級快取記憶體中; 從上述第一級快取記憶體,接收上述第一級快取 吕己憶體所接收之7¾. g» φ ^ 仔取要求之一位址,其中上述位址與一 快取線相關; 決定在上述方向中所相關之快取線之後被上述樣 態所指出之一個或多個快取線;以及 導致上述一個或多個快取線被預取至上述第一級 快取記憶體中。 70. 如申請專利範㈣69項所述之電腦程式產品, 其中上述至少-電腦可讀媒體係擇自於一碟片、磁帶或者 其他具磁性、光學或者電子料雜以及-網路、線路、 無線或者其他通訊媒體。 71. 一種微處理器,包括·· 一快取記憶體;以及 一預取單元,用以: 、、偵測具有一第一記憶體區塊之複數記憶體存取要 求之樣恶,並且根據上述樣態從上述第一記憶體區 塊預取複數快取線至上述快取記憶體中; 監視一第二記憶體區塊之一新的記憶體存取 求; 決定上述第一記憶體區塊是否虛擬鄰近於上 〇608-A43067TW/fmal 义牙》 201135460 二記憶體區塊,並且當自上述第一記憶體區塊延續至 上述第二記憶體區塊時,則決定上述樣態是否預測到 上述第二記憶體區塊之新的記憶體存取要求所相關之 —快取線在上述第二記憶體區塊中;以及 根據上述樣態,響應地(responsively)從上述第二 記憶體區塊將上述絲線預取至上述快取記憶體中。 72. 如申請專利範圍第71項所述之微處理器, 其中上述第一以及第二記憶體區塊之大小對應於—每 體記憶體分頁之大小。 m 73. 如申請專利範圍第71項所述之微處理哭, 其中上述微處理器包括一第二級快取記憶體,其;上 述新的δ己憶體存取要求包括自上述微處理器之—第— 級快取記憶體至上述第二級快取記憶體的要求,用以 分派上述第二記憶體區塊之上述快取線。 74如申請專利範圍第71項所述之微處理器, ’、中為了彳貞測上述第_記憶體區塊之上述記憶體存取 要求之上述樣態,上述預取單元係用則貞測上述記恨 體存取要求之一方向;並且 〜 —為了決疋上述第一記憶體區塊是否虛擬鄰近於上 述第-記憶體區塊’上述預取單元用以決定上述第一 記憶體區塊在上述方向中是否虛擬鄰近於 憶體區揄。 1 、、如申請專利範圍第74項所述之微處理器, 二中上述第—§己憶體區塊之上述記憶體存取要求的上 述位址係__函數非單調性 0608-A43067TW/f,nal 201135460 76·如申請專利範圍第74項所述之微處理器, 其中當自上述第一記憶體區塊延續至上述第二記憶體 區塊時,為了決定上述樣態是否預測到上述第二記憶 體區塊之上述新的記憶體存取要求所相關之上述快取 線,上述第二記憶體區塊中,上述預取單元係用以在 沿著上述方向自上述第一記憶體區塊延續至上述第二 記憶體區塊時,決定上述樣態是否預測上述第二記憶 體區塊之上逑新的記憶體存取要求所相關之上述快取 線在上述第二記憶體區塊中。 7?,如申請專利範圍第74項所述之微處理器, 其中為了根據上述樣態自上述第二記憶體區塊將上述 决取線預取至上述快取記憶體中,上述預取單元係用 以根據上述樣態且沿著上述方向,自上述第二記憶體 區塊將上述絲線預取至上述絲記憶體中。 如申π專利範圍第71項所述之微處理哭, 其中上述樣態包括上述第—記憶體區塊之複數快ς線 的:順序’其中當自上述第一記憶體區塊延續至上述 第二記憶體區塊時’為了決定上述樣態是否預測到上 述第二記憶體區塊之上述新的記憶體存取要求所相關 之上述快取線在上述第二記憶體區塊中,上述預取單 7G係用以在根據上述快取線的一順序自上述第一記憶 f區塊延續至上述第二記憶體區塊時,決定上述樣態 疋否預測到上述第二記憶體區塊之上述新的記憶體存 =要求所相關之上述快取線在上述第二記憶體區塊 0608-A43067TW/final 82 201135460 79. 如申請專利範圍第71項所述之微處理器, 其中上述快取單元更用以等待根據上述樣態從上述第 二記憶體區塊上述快取線預取至上述快取記憶體中, 直到當自上述第一記憶體區塊延續至上述第二記憶體 區塊時,決定上述樣態是否預測到在上述新的記憶體 存取要求之後有上述第二記憶體區塊之至少一既定值 之記憶體存取要求的每一者所相關之一快取線。 80. 如申請專利範圍第71項所述之微處理器, 其中上述後續的記憶體存取要求之既定數量為2。 81. 如申請專利範圍第71項所述之微處理器, 其中預取單元更用以: 維持由複數項目所構成之一項目表,其中上述項 目表之每一項目包括第一、第二以及第三欄位,其中 上述第二欄位保持(hold) —最近存取之記憶體區塊之 虛擬位址的代表值,其中上述第一欄位保持在一方向 與上述最近存取之記憶體區塊虛擬相鄰之一記憶體區 塊之虛擬位址的代表值,其中上述第三攔位保持在另 一方向與上述最近存取之記憶體區塊虛擬相鄰之一記 憶體區塊之虛擬位址的代表值。 82. 如申請專利範圍第81項所述之微處理器, 其中為了決定上述第一記憶體區塊是否虛擬相鄰於上 述第二記憶體區塊,上述預取單元係用以: 決定上述第二記憶體區塊之虛擬位址之代表值是 否匹配於上述項目表之項目之一者的上述第一欄位或 者第三攔位;以及 0608-A43067TW/final 83 201135460 =麵匹配之上述項目之上㈣二般否匹配 於上述第一記憶體區塊之虛擬位址之代表值。 83.如申請專利範圍第8ι項所述之微處理器, ,、中為了維持上述表,上述預取單it係用以: 根據-先進先出的方式,將上述項目 目表中’以回應上述微處理器之—载 Ί 生之記憶體存取要求。 载入/儲存早續產 84. 如申請專利範圍第81項所述之微處理哭, 其中上述記憶體區塊之上述虛擬位址的代表值包括 述記憶韙區塊之虛擬位址之一雜湊之位元。 上 85. 如申請專利範圍第84項所述之微處理器, 其中上述記憶體區塊之虛擬位址之上述雜湊之位_么 根據下列演算法則之一雜湊,其中hashjj]表示第.ίτ' 雜湊之位元,以及VA[k]表示第k個上述記憶體區: 之虛擬位址的位元: hash[5]=VA[29]AVA[18]AVA[17]; hash[4]=VA[28]AVA[19]AVA[16]; hash[3]=VA[27]AVA[20]AVA[15]; hash[2]=VA[26]AVA[21]AVA[14]; hash[l]=VA[25]AVA[22]AVA[13]; hash[0]=VA[24]AVA[23]AVA[12]。 86.如申請專利範圍第71項所述之微處理器,更包 括複述核心,其中上述快取記憶體以及預取單元由上述检 心所共享。 0608-A43067TW/final 84 201135460 87' 一種資料預取方法,用以預取資料至一微處理器 之一快取記憶體,上述資料預取方法包括: °° 偵測具有一第一記憶體區塊之複數記憶體存取要求之 一樣態,並且根據上述樣態從上述第一記憶體區塊預取快 取線至上至上述快取記憶體中; 、 監視一第一記憶體區塊之一新的記憶體存取要求; 決疋上述第一 ^己憶體區塊是否虛擬鄰近於上述第二呓 憶體區塊,並且當自上述第一記憶體區塊延續至上述第二 記憶體區塊時,決定上述樣態是否預測到上述第二記憶&amp; 區塊之新的記憶體存取要求所相關之—快取線在上述第二 記憶體區塊中;以及 根據上述樣態,從上述第二記憶體區塊將複數快取線 預取至上述快取記憶财,以回應上述決定步驟。 队如申請專利範圍帛87項所述之資料預取方法, 其中上述第一以及第二記憶體區塊之大小 憶體分頁之大小。 κ此冗 89.如申請專利範圍第87項所述之資料預取方法, 其中上述微處理器包括—第二級快取記憶體,並中上述 的記憶體存取要求包括自上述微處 ^ 5 μ 第—級快取記 隐胆至上述第二級快取記憶體的要求,用以分派上 記憶體區塊之上述快取線。 一 ”請專利範圍第87項所述之資料預取方法, /、中偵測上㈣-記憶體區叙魏記㈣ 的步驟’更包括偵測上述記憶體存取存取之一方 定上述第―記憶㈣塊^麵料於上述第 201135460 二1 己憶體區塊的步驟,更包括衫上述第-記憶體區塊是 否在上述方向中虛擬鄰近於上述第二記憶體區塊。 儿#申請專利範圍第9〇項所述之資料預取方法, 二中具有記憶體區塊之上述記憶體存取存取的上 处位址倾著時間函數非單雛地增加或者減少。 :2. ”請專利範圍第9〇項所述之資料預取方法, ;中:自上述第一記憶體區塊延續至上述第二記憶體區塊 上述樣態是否預_上述第二記憶體區塊之上述 取要求所相關之上述快取線在上述第二記憶 中的步驟包括在沿著上述方向自上述第一記憶體區 =至上述第二記憶體區塊時’決定上述樣態是否預測 一體區塊之上述新的記憶财取要求所相關之 快取線在上述第二記憶體區塊中。 广”請專利範圍第90項所述之資料預取方法, :根據上述樣態自上述第二記憶體區塊將複數快取線預 、、至上述快取記憶體中的步驟包括根據上述樣態且沿著上 =方向’自上述第二記憶體區塊將上述快取線 快取記憶體中。 &amp; 4· #中專利㈣第87項所述之資料預取方法, =中上,樣態包括具有上述第一記憶體區塊之複數快取線 =順序,其中當自上述第—記憶體區塊延續至上述第二 =思體區塊時’為了決定上述樣態是否預測到上述第二記 =體區塊之上述新的記憶體存取要求所相關之上述快取線 述土記憶體區塊中’上述預取單元係用以在根據上 。取線的&quot;順序自上述第—記憶體區塊延續至上述第- 〇6〇8-A43〇67TW/finaI ^ 〜戸、土工您弗一 86 201135460 ; 記憶體區塊時,決定上述樣態是否預測上述第二記憶體區 塊之上述新的記憶體存取要求所相關之上述快取線在上述 第二記憶體區塊中。 95. 如申請專利範圍第87項所述之資料預取方法, 更包括暫缓根據上述樣態從上述第二記憶體區塊上述快取 線預取至上述快取記憶體中,直到當自上述第一記憶體區 塊延續至上述第二記憶體區塊時,決定上述樣態是否預測 到在上述新的記憶體存取要求之後有上述第二記憶體區塊 之至少一既定數量之記憶體存取要求的每一者所相關之一 快取線。 96. 如申請專利範圍第87項所述之資料預取方法, 其中上述後續的記憶體存取要求之既定數量為2。 97. 如申請專利範圍第87項所述之資料預取方法, 更包括: 維持由複數項目所構之之一項目表,其中上述項目表 之每一項目包括第一、第二以及第三攔位,其中上述第二 欄位保持一最近存取之記憶體區塊之虛擬位址的代表值, 其中上述第一欄位保持在一方向與上述最近存取之記憶體 區塊虛擬相鄰之一記憶體區塊之虛擬位址的代表值,其中 上述第三欄位保持在另一方向與上述最近存取之記憶體區 塊虛擬相鄰之一記憶體區塊之虛擬位址的代表值。 98. 如申請專利範圍第97項所述之資料預取方法, 其中決定上述第一記憶體區塊是否虛擬相鄰於上述第二記 憶體區塊的步驟,更包括: 決定上述第二記憶體區塊之虛擬位址之代表值是否匹 0608-A43067TW/final ' 87 201135460 配於上述項目表之項目之一者的上述第一攔位或者第三攔 位;以及 決定在所匹配之上述項目之上述第二攔位是否匹配於 上述第一記憶體區塊之虛擬位址之代表值。 99. 如申請專利範圍第97項所述之資料預取方法, 其中維持上述項目表的步驟,更包括: 以先進先出的方式’將上述項目推進上述項目表中, 以便回應上述微處理器之一載入/儲存單元所產生之記憶 體存取要求。 100. 如申請專利範圍第97項所述之資料預取方法, 其中上述記憶體區塊之上述虛擬位址的代表值包括上述記 憶體區塊之虛擬位址之一雜湊之位元。 101. 如申請專利範圍第100項所述之資料預取方 法,其中上述記憶體區塊之虛擬位址之上述雜湊之位元係 根據下列演算法則之一雜湊’其中hash[j]表示第j個雜湊 之位元’以及VA[k]表示第k個上述記憶體區塊之虛擬位 址的位元: hash[5]=VA[29]AVA[18]AVA[17]; hash[4]=VA[28]AVA[19]AVA[16]; hash[3]=VA[27]AVA[20]AVA[15]; hash[2]=VA[26]AVA[21]AVA[14]; hash[l]=VA[25]AVA[22]AVA[13]; hash[0]=VA[24]AVA[23;TVA[12]。 102. 一種電腦程式產品,編碼於至少一電腦可讀取 0608-A43067TW/final 88 201135460 媒體之上,並且適用於一計算裝置,上述電腦程式產品包 括: 電腦可讀程式編碼,儲存於上述電腦可讀取媒體, 用以疋義一微處理器,上述電腦可讀程式包括: 第轾式碼,用以定義一快取記憶體裝置;以及 一第二程式碼,用以定義一預取裝置,使得上述預取 裝置用以: 偵測具有一第一記憶體區塊之存取之一樣態,並 且根據上述樣態從上述第一記憶體區塊預取進入快取 線; i視一第二記憶體區塊之一新的存取要求; 決定上述第一記憶體區塊係虛擬鄰近至上述第二 記憶體區塊以及上述樣態,當持續自上述第一記憶體 區塊至上述第二記憶體區塊,預測至與上述具有上述 第二記憶體區塊之新的要求相關之一快取線之一存 取;以及 根據上述樣響應地從上述第二記憶體區塊預取 進入上述快取記憶體之快取線。 0608-A43067TW/fmal 89201135460; VII. Patent application scope: 1.  A prefetching unit is provided in a microprocessor having a cache memory, comprising: wherein the prefetching unit is configured to receive a plurality of access requests for a plurality of addresses of a memory block, each of which is stored Retrieving one of the addresses corresponding to the memory block, and the address of the access request is non-monotonically increasing or decreasing with time; a storage device; and a control logic And the storage device is coupled to the storage device, wherein when the access request is received, the control logic is configured to: maintain one of the access addresses and a minimum address of the access request in the storage device, and the maximum bit a count value of the change of the address and the minimum address; maintaining a history record of the most recently accessed cache line in the memory block, the most recently accessed cache line being associated with the address of the access request Determining an access direction according to the above count value; determining an access mode according to the historical record; and according to the access mode and along the access direction, Not cache above said history recording instruction prefetching in vivo to the above-described memory block to the cache line has been accessed. 2.  The prefetching unit of claim 1, wherein the control logic is further configured to suspend the prefetching action before the number of recently accessed cache lines in the memory block is greater than a predetermined value. . 3.  The prefetching unit described in claim 2, wherein the upper 0608-A43067TW/fmal 59 201135460 has a predetermined value of at least 9. 4.  The prefetching unit of claim 2, wherein the predetermined value is at least ten percent of the number of cache lines in the memory block. 5.  The prefetching unit of claim 1, wherein the control logic is configured to: change the count value of the change of the maximum address and the minimum address when the access direction is determined according to the count value. When the difference between the count values is greater than a predetermined value, the above accessor is determined. Toward the system; when the difference between the count value of the change of the minimum address and the count value of the change of the maximum address is greater than the predetermined value, it is determined that the access direction is downward. 6.  The prefetching unit of claim 1, wherein the control logic is further configured to use an absolute value of a difference between a count value of the change of the maximum address and a count value of the change of the minimum address. The above prefetch action is suspended until a predetermined value. 7.  The prefetching unit of claim 1, wherein: the history record comprises a one-dimensional mask, wherein the bit mask is used to indicate the most recently accessed cache line, and the above-mentioned recently accessed The cache line is associated with the address of the memory block; when receiving the access request, the control logic is further configured to: calculate the most recently accessed cache line in the bit mask An intermediate indicator register; and the above-mentioned bit mask on the right side of the above-mentioned intermediate indicator register, the 0608-A43067TW/finaI 60 201135460 N bit of the above-mentioned bit mask on the left side of the intermediate indicator register When the N bit matches, for each of a plurality of different bit bit periods, the count value of one of the matching counters associated with the bit period is increased, where N is in the bit period The number of bits. 8.  The prefetching unit of claim 1, wherein the control logic is configured to: detect the matching associated with one of the bit periods, in order to determine the access mode according to the bit mask; Whether the difference between the counter and the matching counter associated with the other of the bit periods is greater than a predetermined value; and determining the N-bit of one of the intermediate indicator registers masked by the bit The specified access mode, wherein N is the number of bits of one of the bit periods, and the correlation matching counter of the one of the bit periods has a correlation match with the other of the bit periods The difference between the counters is greater than the above predetermined value. 9.  The prefetching unit of claim 8, wherein the memory block is marked by the bit mask as being not recently stored according to the access mode and along the access direction. Taking the cache line prefetched into the cache memory, the control logic is configured to: along the access direction, assign a search indicator and the access pattern from the N indicator of the intermediate indicator; And when the bit in the access mode on the search indicator indicates an access, prefetch the cache line associated with the bit in the bit mask on the search indicator. 0608-A43067TW/final 61 201135460 1〇. For example, according to the above-mentioned access mode and along the access direction, the pre-acquisition according to the ninth patent scope of the Shenqing patent is marked as being not recently accessed by the above-mentioned bit mask; In the cache memory, the above-mentioned control system: and, according to the access direction, increase/decrease the value of the search indicator; when: add/reduce the above search indexer === prefetch the above The cache line associated with the above-mentioned bits in the above search = ^疋 mask has been increased/decreased. , The pre-fetching unit described in the first paragraph of the application of the patent, the above-mentioned control logic is further used to: repeat the above-mentioned increase of the value of the search indicator and the work until the dog condition occurs, wherein the above situation The method includes: a prefetching movement, when the access direction is upward, a bit in the bit mask on the search indicator and a bit in the bit 70 of the bit address associated with the maximum address The distance between the elements is greater than a second predetermined value; and when the access direction is downward, the bit in the bit mask on the search indicator is related to the bit address The distance between the bits in the 7G mask is greater than the second predetermined value. 12. The prefetching unit of claim 7, wherein the control logic is further configured to use the matching counter associated with one of the different bit periods to match the other of the different bit periods. Before the difference between the counters is greater than a predetermined value, the above pre-0608-A43067TW/final · ^ 201135460 action is suspended. 13.  The prefetching unit of claim 1, wherein the above bit period is 3, 4, and 5 bits. 14.  The prefetching unit of claim 1, wherein the control logic discards the cache line when the cache line has appeared in any of the cache memories of the microprocessor. 15.  The prefetching unit of claim 1, wherein the size of the memory block is 4 kilobytes. 16.  The prefetching unit of claim 1, further comprising: a plurality of the storage devices; wherein the control logic is configured to receive an access request, and the address of the access request is not in the storage device One of the new memory blocks is associated, and one of the above storage devices is assigned to the new memory block. 17.  The prefetching unit of claim 16, wherein the control logic is further configured to clear a count value of the change of the maximum address, a count value of the minimum address change, and one of the storage devices being dispatched. The above history. 18.  A data prefetching method for prefetching data to a cache memory of a microprocessor, the data prefetching method comprising: receiving a plurality of access requests for a plurality of addresses of a memory block, each The access request corresponds to one of the addresses of the memory block, and the address of the access request is non-monotonically increased or decreased with time; when the above storage is received When required, the 0608-A43067TW/final 63 201135460 a maximum and a minimum address in the memory block are maintained, and the count value of the change of the maximum and minimum addresses is calculated; when the access request is received, Maintaining a history record of the most recently accessed cache line in the memory block, wherein the most recently accessed cache line is associated with the address of the access request; determining an access direction according to the count value; Determining an access pattern according to the historical record; and, according to the access mode and along the access direction, indicating that the cache memory has not been indicated by the history record as The access cache line is prefetched into the above memory block. 19.  The data prefetching method as described in claim 18, further comprising suspending the prefetching action before the number of recently accessed cache lines in the memory block is greater than a predetermined value. 20.  For example, the data pre-fetching method described in claim 19, wherein the predetermined value is at least 9. twenty one.  The method for prefetching data according to claim 19, wherein the predetermined value is at least ten percent of the number of cache lines in the memory block. twenty two.  The data prefetching method of claim 18, wherein determining the access direction according to the foregoing counting value further comprises: calculating a change of the count value of the change of the maximum address and the minimum address When the difference between the values is greater than a predetermined value, the access direction is determined to be upward; and the count value of the change of the minimum address and the change of the maximum address is 0608-A43067TW/fmal 64 201135460 The direction between the i and the direction is downward. When the value is greater than the above-mentioned predetermined value, it is determined that the above-mentioned storage patents include the data pre-fetching method described in the above-mentioned maximum (four) brothers, and the count value of the change of the changed count value and the minimum bit The above-mentioned prefetch action is slowed down. Before the absolute value of Jun, greater than - the established value, temporarily one of the patent scope - the data prefetch method, the above historical record _ out of the above-mentioned recently accessed position 70 mask, the above-mentioned bit mask for the finger line And the cached block 1 is the most recently accessed cache. When the access is accessed, the method further includes: ten calculating the above-mentioned top line in the bit mask: the intermediate indicator register And the access speed of the ν2Ί&quot;4 intermediate indicator register on the left side of the above-mentioned bit mask 兀 /, the upper speed intermediate indicator register right:, the Ν bit match, the plural For each of the different only = 兀 mask bitpe - increase the above-mentioned bit U (d_Ct matches the count value of the counter, one of the relevant numbers in 1), and ?N is the upper half of the upper light element period. · The information mentioned in item 24 of the patent application, in order to determine the above access samples according to the above-mentioned bits. And determining, by the other one of the above bit periods related to the above-mentioned bit period, that the matching count is greater than a predetermined value; The difference of 3 is 0608-A43067TW/flnal 65 201135460. The above-mentioned access mode specified by one of the N-bits of the intermediate indicator register masked by the above-mentioned bit mask is determined, where N is the above-mentioned bit period. a number of a bit in the middle, wherein the difference between the correlation matching counter of the one of the bit periods and the correlation matching counter of the other of the bit periods is greater than the predetermined value The difference between all other clear bit periods. 26.  The method for prefetching data according to claim 25, wherein, in order to according to the access mode and along the access direction, the memory block is marked by the bit mask as not being recently The access cache line is prefetched into the cache memory, and the control logic is configured to: assign a search indicator along the access direction and the access mode from the N indicator of the intermediate indicator And when the bit in the access mode on the search indicator indicates an access, prefetch the cache line associated with the bit in the bit mask on the search indicator. 27.  The data prefetching method of claim 26, wherein the memory block is marked by the bit mask as being recently not yet used according to the access mode and along the access direction. The cache line of the access is pre-fetched into the cache memory, and further includes: increasing/decreasing a value of the search indicator according to the access direction; and the foregoing storing on the search indicator after increasing/decreasing The bit in the sampled state indicates an access line prefetching the cache line associated with the bit in the bit mask on the search indicator that has been incremented/decreased. 28.  For example, the data pre-fetching method described in claim 27, 0608-A43067TW/fmal 66 201135460 further includes: ,, the value of the finder and the pre-fetching condition, wherein the above conditions include: When the check direction is upward, the distance between the bit 70 in the mask above the search index and the bit located in the mask is greater than the distance between the bits in the mask; And when the access direction is downward, the distance between the bit of the mask of the vertical mask and the bit of the third mask of the minimum address is greater than the above The second is depreciating. Hey. For example, in the scope of claim 24, the difference between the above-mentioned matching counters related to the other matching period 兀 period associated with the method of the above-mentioned different bit periods (four) is greater than -岐Previously, the above prefetch action was suspended. . 3. In the data pre-fetching method described in item 18 of the scope of application, the above-mentioned bit period is 3, 4 and 5 bits. As described in item 18 of the patent application scope, the cache line has appeared in the above-mentioned microprocessor - cached the hidden body %, and abandoned the pre-fetching of the above cache line. 32.  For example, the data mentioned in the application for patent scope 预18 is pre-existing. The size of the above memory block is 4 kilobytes. ,, 33.  τ kinds of computer program products, compiled and sold on at least _ computer readable media' and applicable to - computing devices, the above computer program products include.   One:: readable program code? The computer readable medium 201135460 is configured to define a prefetch unit in a microprocessor having a cache memory, wherein the computer readable program comprises: wherein the prefetch unit is And a plurality of access requests for receiving a plurality of addresses of a memory block, each access request corresponding to one of the addresses of the memory block, and the address of the access request is over time The function is non-monotonically increased or decreased; a first code for defining a storage device; and a second code for defining a control logic coupled to the storage device When the access request is received, the control logic is configured to: maintain the one of the access addresses and a minimum address maintained by the storage device, and calculate the change of the maximum and minimum addresses Counting a value; maintaining a history record of the most recently accessed cache line in the memory block by using the memory block; determining an access direction according to the count value; The historical record determines an access pattern; and pre-fetching the cache line in the cache memory that has not been indicated by the history record as being accessed according to the access mode and along the access direction In the memory block. 34.  The computer program product of claim 33, wherein the at least one computer readable medium is selected from a disc, tape or other magnetic, optical or electronic storage medium and a network, line, wireless or Other communication media. 35.  A data prefetching method for prefetching data into a micro processing 0608-A43067TW/fma] 68 201135460, one of the cache memory, the data prefetching method comprises: receiving one address of a memory block An access request; setting one bit associated with a cache line in the one-bit mask, wherein the cache line is associated with the address of the memory block; receiving the access request Thereafter, adding a count value of the total counter: when the address is greater than the value of a maximum index register, updating the maximum indicator register with the address, and adding a count value of the maximum change counter; Less than a minimum indicator register, updating the minimum indicator register with the above address, and adding a minimum change counter count value; calculating an intermediate indicator register as an average of the maximum and minimum change counters; The N-bit of the bit mask on the left side of the intermediate indicator register matches the N-bit of the bit mask on the right side of the intermediate indicator register And, for each of a plurality of different bit bit periods, increasing a count value of one of the matching counters associated with the bit period, where N is the number of bits in the bit period; Whether a condition occurs, wherein the above conditions include: (A) the total access counter is greater than a first predetermined value; (B) the difference between the maximum change counter and the minimum change counter minus the absolute value is greater than a second predetermined And (C) the absolute value of the difference between the count value of one of the above matching counters and the others is greater than a third predetermined value; and 0608-A43067TW/final 69 201135460 when the above condition exists: When the maximum change counter is greater than the minimum change counter, determining that the access direction is upward, and when the maximum change counter is smaller than the minimum change counter, determining that the access direction is downward; determining that the middle is masked by the bit The above-mentioned access mode specified by one of the N bits of the index register, where N is the above and above the bit period The maximum number of bits match counter associated one element's; and based on the direction of access and the access decision like state, the above-described plurality of blocks of memory prefetch cache line to the cache memory in the above. 36. The method for prefetching the data according to claim 36, wherein the step of prefetching the cache line into the cache memory according to the determined access direction and the access mode includes: (1) along the access direction, initializing a search indicator and the access pattern from the N indicator of the intermediate indicator; (2) determining whether a second condition exists, wherein the second condition includes: D) in the above-mentioned search indicator, the bit of the above-mentioned access mode has been i-fL · δ and 疋, (Ε) the bit of the above-mentioned bit mask in the above search indicator has been cleared; and (F) In the access direction, a difference between the maximum/minimum indicator and the bit in the bit mask of the search indicator is 0608-A43067TW/fmal 70 201135460; less than a fourth predetermined value; and (3) And when the second condition exists, prefetching the cache line associated with the bit in the bit mask of the search indicator. 37.  The method for prefetching the data according to claim 36, wherein the step of prefetching the cache line to the cache memory according to the determined access direction and access mode further comprises: When the second condition exists, after determining the existence and access of the second status, the value of the search indicator is increased/decreased according to the access direction; and the above steps (2) and (3) are repeated. 38.  The method for prefetching data according to claim 37, wherein the step of prefetching the cache line to the cache memory according to the determined access direction and access mode further comprises: The above condition (F) is true, and the above repeated steps are stopped. 39.  The method for prefetching data according to claim 37, wherein the step of prefetching the cache line to the cache memory according to the determined access direction and access mode further comprises: All the bits of the above bit mask have been tested, and the above repeated steps are stopped. 40.  a microprocessor comprising: a complex core; a cache memory shared by the core for receiving a plurality of access requirements for a plurality of addresses of a memory block, each access request corresponding to the memory One of the addresses of the body block, the address of the access request is non-monotonically increased or decreased by 0608-A43067TW/final 71 201135460 with time function; and a prefetch unit, The method is: monitoring the access request, and maintaining a maximum address and a minimum address in the memory block, and a count value of the change of the maximum address and the minimum address; determining one according to the count value And the access direction; and the cache line missed in the memory block is pre-fetched into the cache memory along the access direction. 41.  The microprocessor of claim 40, wherein the prefetching unit is further configured to: maintain a history record of the most recently accessed cache line in the memory block, the recently accessed fast The fetching line is associated with the address of the access request; determining an access mode according to the historical record; and, according to the access mode and along the access direction, the cached memory is subjected to the history The record indicates that the plurality of cache lines that have not been accessed recently and are missed in the memory block are prefetched into the memory block. 42.  A microprocessor includes: a first level cache memory; a second level cache memory; and a prefetch unit configured to: detect a most recent memory present in the second level cache memory Taking a direction and a state of the request, and prefetching the plurality of cache lines into the second-level cache memory according to the above direction and the form; 0608-A43067TW/fma! 72 201135460 : From the above first level Taking a memory, receiving an address of one of the access requirements received by the first-level cache memory, wherein the address is associated with a cache line, and is determined by the cache line associated with the direction One or more cache lines as indicated by the pattern; and causing one or more of the cache lines to be prefetched into the first level of cache memory. 43.  The microprocessor of claim 42, wherein: the pre-fetching unit is configured to detect the direction and the appearance of the recent access request appearing in the second-level cache memory. Measure a direction and a state of a memory block, the memory block being a small set of memory ranges accessible by the microprocessor; to determine a cache line associated with the direction The one or more cache lines indicated by the above manner, wherein the pre-fetching unit is configured to: place the above-mentioned state to the memory block such that the address is located in the above-mentioned state; and along the above direction, The above address starts searching until it encounters one of the cache lines indicated by the above state. 44.  The microprocessor of claim 43, wherein: the above aspect comprises a sequence of cache lines; wherein, in order to place the above-mentioned state to the memory block, the address is located in the above manner, The prefetching unit is configured to shift the above-described state to the memory block by the above sequence. 45.  The microprocessor of claim 43, wherein the above-mentioned address system of the above-mentioned memory block of the memory block that appears in the second-level cache memory is 0608-A43067TW/final 73 201135460 With a time function rather than a non-monotonically increasing force and decreasing. 46.  The microprocessor of claim 45, wherein the address of the recent access request of the memory block that appears in the second-level cache memory is non-contiguous (non- Sequentail). 47.  The microprocessor of claim 42, further comprising: a plurality of cores; wherein the second level cache memory and the prefetch unit are shared by the core; and each of the cores comprises the first level One of the different types of cache memory is 4歹1Ka distinct instantation). 48.  The microprocessor of claim 42, wherein the prefetching unit is configured to provide the one or more in order to cause the one or more cache lines to be prefetched into the first level cache. The address of the plurality of cache lines is to the first level cache memory, wherein the first level cache memory system is configured to request the one or more cache lines from the second level cache memory. 49.  The microprocessor of claim 48, wherein the first level cache includes a queue for storing the address received from the prefetch unit. 50.  The microprocessor of claim 42, wherein the prefetching unit is from the microprocessor in order to cause the one or more cache lines to be prefetched into the first level cache. A bus interface unit requires one or more cache lines, and then the cache line obtained by the above-mentioned desired 0608-A43067TW/final 74 201135460 is supplied to the first level cache. 51. The microprocessor of claim 42, wherein the prefetching unit is used to perform the second step in order to cause the one or more cache lines to be prefetched into the first level cache memory. One or more of the above cache lines are required in the level cache. One 52.  The microprocessor of claim 51, wherein the prefetching unit is configured to provide the requested cache line to the first stage cache line. ’, 53.  The microprocessor of claim 51, wherein the second level cache system is subsequently provided to the first level cache line for the required cache line. 54.  The microprocessor of claim 42 wherein the pre-fetching unit detects the direction and the pattern comprises: ???when receiving the recent access request, one of the transfer-memory blocks a maximum address and a minimum address, and a count value of the change of the maximum address and the minimum address; and maintaining the access address of the memory block when receiving the recent access request One of the history records of the most recently accessed cache line; and σ determines the above direction based on the above count value; and determines the above state based on the above history record. 55* The microprocessor of claim 54, wherein the step of determining the direction according to the counting value comprises: between a count value of the change of the maximum address and a count value of the change of the minimum address When the difference is greater than a predetermined value, the direction is determined to be 0608-A43067TW/final 201135460 upward; and the difference between the count values of the change of the minimum address is /, and the change of the maximum address is downward. When the value is verified, the above direction 56 is determined; as in the 42nd paragraph of the patent application, the above historical record includes a microprocessor for the pixel mask, wherein the above-mentioned access address of the block is pregnant. It is used to indicate that the above memory is to receive the above-mentioned access cache line, and the following steps include the above prefetching unit. Calculating the above-mentioned bit mask inter-order index register; and chopping the N-bit of the above-mentioned inter-order index for the recently accessed cache line and the N-bit of the above-mentioned position mask of the middle finger 1 In the case of matching, the bit period of the above-mentioned bit masks for the right of the stolen pirates, the bit period of the same bit (the count value of the distinct matching counter, and the one of the elements of the bit-counting bit period) One is the bit 57 in the above bit period. For example, the scope of the patent application is 微处理器%. According to the above-mentioned pixel mask, the step of detecting the sampling state of the bit period includes: the other of the above bit periods: the related matching counter and the above Whether the value is greater than - the established value; 丨, the difference between the above matching counters is determined by the N-bit of the above-mentioned bit mask side: the bit of the one of the inter-segment register registers The number of elements, upper, '~, 〇 N is the above-mentioned bit week 0608-A43067TW/f, the nal 疋 cycle of the above-mentioned one is the phase 76 201135460 Guan Pizaki (four) and the others of the above bit period have The difference between the relevant matching count benefits is greater than the above predetermined value. 58.  A data prefetching method for prefetching data to a micro-processing with a second-level cache memory--a level-level recording body, the data pre-fetching method comprising: detecting the second level fast Taking the direction and the state of the most recent access request in the memory, and prefetching the plurality of cache lines into the second level cache according to the above direction and the mode; the line correlation is faster from the first level Taking the memory 'receives the first-level cache memory: the received-access request--address, wherein the address is after the cache line associated with a cache line in the above direction One or more cache lines as indicated; and causing one or more of the cache lines described above to be prefetched into the first level of cache memory. 59. The pre-fetching of the data as described in claim 58: the above-mentioned detection of the above-mentioned directions and appearances of the recent access request in the second-level cache memory, including the detection-memory area In the above direction and mode of the block, the memory block is a small set of memory ranges accessible by the processor; ^ is determined by the above-mentioned state after the cache line associated with the direction towel The step of one or more cache lines, comprising: placing the above-described state to the memory block such that the address is located in the above-mentioned state; and 〇60S-A43067TW/flnaI 77 201135460 along the above direction, by the above The address begins to search until it encounters one of the cache lines indicated by the above pattern. 60. The data prefetching method according to claim 59, wherein the aspect includes one of a sequence of cache lines, and placing the above-mentioned state to the memory block, so that the address is located in the above manner. The step of transferring the above state to the memory block by the above sequence. 61.  For example, the data prefetching method described in claim 59 (4), wherein the address of the uppermost access request of the memory block of the memory block appearing in the second level cache memory is a function of time rather than Increased and decreased monotonically (non_monotonically). 62.  The method for prefetching data according to claim 61, wherein the address of the last access request of the memory block that appears in the second level cache memory may be non-contiguous ( Non-sequentail) 〇 . For example, in the data prefetching method described in Item 58 of the patent scope of May, the microprocessor further includes a plurality of cores, and the second level cache body and the prefetching order S system are shared by the cores, and each of the above The core includes the different examples of the above-mentioned first-level cache memory. • b. The data prefetching method described in item 58 of the patent, wherein one or more cache lines are prefetched to the above-mentioned first-level cache ^, two, y steps 'including one of the above microprocessors The fetch unit is configured to provide the address of one or more cache lines to the first-level cache memory, and the body-level cache memory system is used to request the above-mentioned second-level cache memory. One or more cache lines. _-M36〇L^m利范(4)% (4) prefetching method, 78 201135460 wherein the one or more cache lines are prefetched to the above-mentioned _ level cache 3 replies, including the above a pre-fetching unit of the microprocessor is configured to provide the address of the one or more cache lines to the first-level cache memory, wherein the first-level cache memory is connected to the microprocessor The row interface unit is configured to request the one or more cache lines above and then provide one or more cache lines of the above requirements to the first level cache. The data prefetching method of claim 58, wherein the step of causing the one or more cache lines to be prefetched to the first level cache memory comprises: using the prefetch unit The second level cache requires one or more of the above cache lines. 67. The method for prefetching data according to claim 66, wherein the step of causing the one or more admission lines to be prefetched to the first level cache memory includes the prefetch unit being used to request One or more cache lines are then provided to the above-described first-level cache line. The data prefetching method described in claim 66 of the patent scope further includes the second level cache memory to provide the required one of the cache lines to the above-mentioned fourth (four) line. / 69.   - A computer program product 'coded on at least the body' and applied to a 呻笞 罢 腼 腼 贝 贝 贝 贝 取 取 取 取 取 取 取 取 一 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述 上述The computer readable medium-microprocessor, the computer readable program includes: - a second:: code ' to define a first level cache device; and a code to define a second level Cache memory device, · 0608-A43067TW/fJnaj 79 201135460 The second code is used to define the prefetch unit, so that the above prefetch early element is used for: , , # test appears in the above second level cache memory The most recent access = seek t direction and the form ' and according to the above direction and the form, the complex cache line is prefetched into the second level cache memory; from the first level cache memory, Receiving the above-mentioned first-level cache Lv Yiyi body received 73⁄4.  g» φ ^ takes one of the required addresses, wherein the address is associated with a cache line; and determines one or more cache lines indicated by the above-described mode after the associated cache line in the above direction; And causing one or more of the cache lines described above to be prefetched into the first level cache memory. 70.  The computer program product described in claim 69, wherein the at least the computer readable medium is selected from a disc, a tape or other magnetic, optical or electronic material and - network, line, wireless or other Communication media. 71.  A microprocessor comprising: a cache memory; and a prefetch unit for: detecting a plurality of memory access requests having a first memory block, and according to the above Pre-fetching a plurality of cache lines from the first memory block to the cache memory; monitoring a new memory access request of a second memory block; determining whether the first memory block is Virtually adjacent to the upper memory 608-A43067TW/fmal dentures 201135460 two memory blocks, and when continuing from the first memory block to the second memory block, determining whether the above state predicts the above The new memory access requirement of the second memory block is related to the cache line being in the second memory block; and responsively from the second memory block according to the above aspect The above wire is pre-fetched into the above-mentioned cache memory. 72.  The microprocessor of claim 71, wherein the size of the first and second memory blocks corresponds to a size of a page of each memory. m 73.  The micro-processing crying as described in claim 71, wherein the microprocessor comprises a second-level cache memory, wherein the new δ hex memory access request includes the microprocessor- - a request for the level cache memory to the second level cache memory to allocate the cache line of the second memory block. 74. The microprocessor of claim 71, wherein, in order to detect the above-mentioned mode of the memory access request of the first memory block, the pre-fetching unit is configured to detect the above One of the directions of the hate access request; and ~ - in order to determine whether the first memory block is virtually adjacent to the first memory block, the prefetch unit is used to determine the first memory block in the above Whether the direction is virtual adjacent to the memory area. 1. The microprocessor according to claim 74, wherein the above address of the memory access request of the above-mentioned first § memory block is __function non-monotonic 0608-A43067TW/ The microprocessor of claim 74, wherein when the first memory block continues from the first memory block to the second memory block, determining whether the above state is predicted The cache line associated with the new memory access requirement of the second memory block, wherein the pre-fetch unit is used to extend from the first memory along the direction in the second memory block When the block continues to the second memory block, determining whether the mode predicts the cache line associated with the new memory access request on the second memory block in the second memory area In the block. The microprocessor of claim 74, wherein the prefetching unit is prefetched from the second memory block to the cache memory according to the mode described above. And pre-fetching the wire from the second memory block into the silk memory according to the above state and along the above direction. The micro-processing crying as described in claim 71 of the scope of claim π, wherein the pattern includes the plurality of fast-twist lines of the first-memory block: the sequence 'where the continuation from the first memory block to the above In the case of two memory blocks, the above-mentioned cache line associated with the above-mentioned new memory access request for predicting whether the above-mentioned mode predicts the second memory block is in the second memory block, The ordering 7G system is configured to determine whether the state of the second memory block is predicted from the first memory f block to the second memory block according to the sequence of the cache line. The above-mentioned new memory storage=required related cache line is in the above second memory block 0608-A43067TW/final 82 201135460 79.  The microprocessor of claim 71, wherein the cache unit is further configured to wait for the cache line to be pre-fetched from the second memory block to the cache memory according to the above manner, until Determining whether the state predicts that at least one predetermined value of the second memory block is after the new memory access request is continued from the first memory block to the second memory block One of the cache lines associated with each of the memory access requirements. 80.  The microprocessor of claim 71, wherein the predetermined number of subsequent memory access requests is two. 81.  The microprocessor according to claim 71, wherein the prefetching unit is further configured to: maintain a project list consisting of a plurality of items, wherein each item of the item list includes first, second, and third a field, wherein the second field holds a representative value of a virtual address of a recently accessed memory block, wherein the first field is maintained in a direction and the most recently accessed memory block a representative value of a virtual address of a virtual memory block, wherein the third block maintains a virtual bit in one direction of a memory block that is virtually adjacent to the most recently accessed memory block The representative value of the address. 82.  The microprocessor of claim 81, wherein the pre-fetching unit is configured to: determine the second memory, in order to determine whether the first memory block is virtually adjacent to the second memory block. Whether the representative value of the virtual address of the body block matches the first field or the third position of one of the items in the item list; and 0608-A43067TW/final 83 201135460 = face matching above the above item (4) It is generally matched with the representative value of the virtual address of the first memory block. 83. In order to maintain the above table, the above pre-acquisition order is used to: in the above-mentioned project list, in response to the above-mentioned micro-invention, according to the method of the first-in first-out method. The processor's memory access requirements. Loading/storing premature production 84.  The micro-processing crying as described in claim 81, wherein the representative value of the virtual address of the memory block comprises a hashed bit of one of the virtual addresses of the memory block. Upper 85.  The microprocessor according to claim 84, wherein the above-mentioned hash position of the virtual address of the memory block is multiplexed according to one of the following algorithms, wherein hashjj] indicates the first. Ίτ' the hashed bit, and VA[k] represents the kth memory region: the bit of the virtual address: hash[5]=VA[29]AVA[18]AVA[17]; hash[4] ]=VA[28]AVA[19]AVA[16]; hash[3]=VA[27]AVA[20]AVA[15]; hash[2]=VA[26]AVA[21]AVA[14] ; hash[l]=VA[25]AVA[22]AVA[13]; hash[0]=VA[24]AVA[23]AVA[12]. 86. The microprocessor of claim 71, further comprising a retelling core, wherein the cache memory and the prefetching unit are shared by the check. 0608-A43067TW/final 84 201135460 87' A data prefetching method for prefetching data to a cache memory of a microprocessor, the data prefetching method comprising: °° detecting having a first memory region The complex memory access of the block is required to be in the same state, and the cache line is prefetched from the first memory block to the cache memory according to the above manner; and one of the first memory blocks is monitored. a new memory access request; determining whether the first memory block is virtually adjacent to the second memory block and continuing from the first memory block to the second memory area Block, determining whether the above-mentioned state predicts that a new memory access request of the second memory &amp; block is associated with the cache line in the second memory block; and according to the above aspect, The second memory block prefetches the plurality of cache lines to the cache memory to respond to the determining step. The team, as claimed in claim 87, pre-fetches the data, wherein the size of the first and second memory blocks is the size of the page. κ this redundancy 89. The data pre-fetching method of claim 87, wherein the microprocessor comprises a second-level cache memory, and wherein the memory access request comprises the above-mentioned micro-location 5 μ level The cache captures the requirement of the second level cache memory to allocate the cache line of the upper memory block. The data pre-fetching method described in item 87 of the patent scope, /, the step of detecting (4)-memory area Weiwei (4) further includes detecting one of the above-mentioned memory access accesses. The memory (four) block ^ fabric is in the step of the above-mentioned 201135460 ii 1 memory block, and further includes whether the first memory block of the shirt is virtual adjacent to the second memory block in the above direction. In the data prefetching method described in the ninth aspect of the scope, the upper address tilting time function of the memory access access of the memory block having the memory block is not increased or decreased.  Please refer to the data pre-fetching method described in item 9 of the patent scope, wherein: from the first memory block to the second memory block, whether the state is pre--the second memory block The step of the cache line associated with the request in the second memory includes: determining whether the mode predicts the integrated area when the first memory area = the second memory block is along the direction The cache line associated with the above-mentioned new memory access requirement is in the second memory block. The data pre-fetching method described in item 90 of the patent scope is as follows: The step of pre-fetching the plurality of cache lines into the cache memory includes: fetching the memory from the cache line from the second memory block according to the above state and along the upper=direction in. &amp; 4· #中专利(四) The data pre-fetching method described in item 87, = upper middle, the form includes a plurality of cache lines having the first memory block = order, wherein from the above-mentioned first memory When the block continues to the second=speak block, the above-mentioned cache line description memory related to the new memory access request of the second block is predicted. In the block, the above prefetch unit is used in the base. The order of the line taken from the above-mentioned first memory block continues to the above-mentioned - 〇6〇8-A43〇67TW/finaI ^ 戸 土, geotechnical 一 86 86 。 86; Whether to predict the cache line associated with the new memory access request of the second memory block is in the second memory block. 95.  The method for prefetching the data as described in claim 87, further comprising suspending the pre-fetching of the cache line from the second memory block to the cache memory according to the above manner, until when from the foregoing When a memory block continues to the second memory block, determining whether the mode predicts that at least one predetermined amount of the memory of the second memory block is stored after the new memory access request Take one of the cache lines associated with each of the required ones. 96.  For example, the data pre-fetching method described in claim 87, wherein the predetermined number of subsequent memory access requests is two. 97.  For example, the data pre-fetching method described in claim 87 of the patent application includes: maintaining a list of items constructed by the plurality of items, wherein each item of the item list includes the first, second and third blocks, The second field maintains a representative value of a virtual address of a recently accessed memory block, wherein the first field maintains a memory in a direction that is virtually adjacent to the most recently accessed memory block. A representative value of a virtual address of the body block, wherein the third field maintains a representative value of a virtual address of a memory block that is virtually adjacent to the most recently accessed memory block in another direction. 98.  The data pre-fetching method of claim 97, wherein the determining whether the first memory block is virtually adjacent to the second memory block further comprises: determining the second memory block Whether the representative value of the virtual address is 0608-A43067TW/final ' 87 201135460 The first or third position of the above-mentioned one of the items in the above item list; and the above-mentioned item determined in the matched item Whether the second block matches the representative value of the virtual address of the first memory block. 99.  For example, the data pre-fetching method described in claim 97, wherein the steps of maintaining the above item list further include: pushing the above project into the above item list in a first-in-first-out manner, in response to one of the above microprocessors Memory access requirements generated by the load/store unit. 100.  The data prefetching method of claim 97, wherein the representative value of the virtual address of the memory block comprises a hashed bit of one of the virtual addresses of the memory block. 101.  The data prefetching method according to claim 100, wherein the above-mentioned hashed bits of the virtual address of the memory block are hashed according to one of the following algorithms: wherein hash[j] represents the jth hash The bit ' and VA[k] represent the bit of the virtual address of the kth memory block: hash[5]=VA[29]AVA[18]AVA[17]; hash[4]=VA [28] AVA [19] AVA [16]; hash [3] = VA [27] AVA [20] AVA [15]; hash [2] = VA [26] AVA [21] AVA [14]; hash [ l]=VA[25]AVA[22]AVA[13]; hash[0]=VA[24]AVA[23; TVA[12]. 102.  A computer program product encoded on at least one computer readable 0608-A43067TW/final 88 201135460 medium and suitable for use in a computing device, the computer program product comprising: a computer readable program code, stored in the computer readable The medium readable program includes: a first code for defining a cache memory device; and a second code for defining a prefetch device to enable the pre-fetching The device is configured to: detect the same state of access with a first memory block, and prefetch the cache line from the first memory block according to the mode; i view a second memory region a new access request of the block; determining that the first memory block is virtually adjacent to the second memory block and the above-described state, when continuing from the first memory block to the second memory area Blocking, predicting access to one of the cache lines associated with the new request having the second memory block described above; and responsively from the second memory Prefetching the blocks into the cache line of the cache memory. 0608-A43067TW/fmal 89
TW100110731A 2010-03-29 2011-03-29 Prefetcher,method of prefetch data,computer program product and microprocessor TWI506434B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US31859410P 2010-03-29 2010-03-29
US13/033,765 US8762649B2 (en) 2010-03-29 2011-02-24 Bounding box prefetcher
US13/033,848 US8719510B2 (en) 2010-03-29 2011-02-24 Bounding box prefetcher with reduced warm-up penalty on memory block crossings
US13/033,809 US8645631B2 (en) 2010-03-29 2011-02-24 Combined L2 cache and L1D cache prefetcher

Publications (2)

Publication Number Publication Date
TW201135460A true TW201135460A (en) 2011-10-16
TWI506434B TWI506434B (en) 2015-11-01

Family

ID=44490596

Family Applications (5)

Application Number Title Priority Date Filing Date
TW100110731A TWI506434B (en) 2010-03-29 2011-03-29 Prefetcher,method of prefetch data,computer program product and microprocessor
TW103128257A TWI519955B (en) 2010-03-29 2011-03-29 Prefetcher, method of prefetch data and computer program product
TW104118873A TWI534621B (en) 2010-03-29 2011-03-29 Prefetcher, method of prefetch data, computer program product and microprocessor
TW104118874A TWI547803B (en) 2010-03-29 2011-03-29 Prefetcher, method of prefetch data, computer program product and microprocessor
TW105108032A TWI574155B (en) 2010-03-29 2011-03-29 Method of prefetch data, computer program product and microprocessor

Family Applications After (4)

Application Number Title Priority Date Filing Date
TW103128257A TWI519955B (en) 2010-03-29 2011-03-29 Prefetcher, method of prefetch data and computer program product
TW104118873A TWI534621B (en) 2010-03-29 2011-03-29 Prefetcher, method of prefetch data, computer program product and microprocessor
TW104118874A TWI547803B (en) 2010-03-29 2011-03-29 Prefetcher, method of prefetch data, computer program product and microprocessor
TW105108032A TWI574155B (en) 2010-03-29 2011-03-29 Method of prefetch data, computer program product and microprocessor

Country Status (2)

Country Link
CN (4) CN105183663B (en)
TW (5) TWI506434B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI488109B (en) * 2011-12-13 2015-06-11 Intel Corp Method and apparatus to process keccak secure hashing algorithm
TWI499975B (en) * 2011-12-07 2015-09-11 Apple Inc Next fetch predictor training with hysteresis
TWI501156B (en) * 2011-12-09 2015-09-21 Nvidia Corp Multi-channel time slice groups
TWI560547B (en) * 2014-10-20 2016-12-01 Via Tech Inc Dynamically updating hardware prefetch trait to exclusive or shared in multi-memory access agent
TWI596479B (en) * 2014-12-14 2017-08-21 上海兆芯集成電路有限公司 Processor with data prefetcher and method thereof
US9880846B2 (en) 2012-04-11 2018-01-30 Nvidia Corporation Improving hit rate of code translation redirection table with replacement strategy based on usage history table of evicted entries
US10146545B2 (en) 2012-03-13 2018-12-04 Nvidia Corporation Translation address cache for a microprocessor
US10241810B2 (en) 2012-05-18 2019-03-26 Nvidia Corporation Instruction-optimizing processor with branch-count table in hardware
US10324725B2 (en) 2012-12-27 2019-06-18 Nvidia Corporation Fault detection in instruction translations

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133780B (en) * 2013-05-02 2017-04-05 华为技术有限公司 A kind of cross-page forecasting method, apparatus and system
CN105653199B (en) * 2014-11-14 2018-12-14 群联电子股份有限公司 Method for reading data, memory storage apparatus and memorizer control circuit unit
US10152421B2 (en) * 2015-11-23 2018-12-11 Intel Corporation Instruction and logic for cache control operations
CN106919367B (en) * 2016-04-20 2019-05-07 上海兆芯集成电路有限公司 Detect the processor and method of modification program code
US10579522B2 (en) * 2016-09-13 2020-03-03 Andes Technology Corporation Method and device for accessing a cache memory
US10353601B2 (en) * 2016-11-28 2019-07-16 Arm Limited Data movement engine
US10725685B2 (en) 2017-01-19 2020-07-28 International Business Machines Corporation Load logical and shift guarded instruction
US10496311B2 (en) 2017-01-19 2019-12-03 International Business Machines Corporation Run-time instrumentation of guarded storage event processing
US10452288B2 (en) 2017-01-19 2019-10-22 International Business Machines Corporation Identifying processor attributes based on detecting a guarded storage event
US10579377B2 (en) 2017-01-19 2020-03-03 International Business Machines Corporation Guarded storage event handling during transactional execution
US10496292B2 (en) 2017-01-19 2019-12-03 International Business Machines Corporation Saving/restoring guarded storage controls in a virtualized environment
US10732858B2 (en) 2017-01-19 2020-08-04 International Business Machines Corporation Loading and storing controls regulating the operation of a guarded storage facility
CN109857786B (en) * 2018-12-19 2020-10-30 成都四方伟业软件股份有限公司 Page data filling method and device
CN111797052B (en) * 2020-07-01 2023-11-21 上海兆芯集成电路股份有限公司 System single chip and system memory acceleration access method
KR102253362B1 (en) * 2020-09-22 2021-05-20 쿠팡 주식회사 Electronic apparatus and information providing method using the same
CN112416437B (en) * 2020-12-02 2023-04-21 海光信息技术股份有限公司 Information processing method, information processing device and electronic equipment
WO2022233391A1 (en) * 2021-05-04 2022-11-10 Huawei Technologies Co., Ltd. Smart data placement on hierarchical storage
CN114116529A (en) * 2021-12-01 2022-03-01 上海兆芯集成电路有限公司 Fast loading device and data caching method

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5003471A (en) * 1988-09-01 1991-03-26 Gibson Glenn A Windowed programmable data transferring apparatus which uses a selective number of address offset registers and synchronizes memory access to buffer
SE515718C2 (en) * 1994-10-17 2001-10-01 Ericsson Telefon Ab L M Systems and methods for processing memory data and communication systems
US6484239B1 (en) * 1997-12-29 2002-11-19 Intel Corporation Prefetch queue
US6810466B2 (en) * 2001-10-23 2004-10-26 Ip-First, Llc Microprocessor and method for performing selective prefetch based on bus activity level
JP4067887B2 (en) * 2002-06-28 2008-03-26 富士通株式会社 Arithmetic processing device for performing prefetch, information processing device and control method thereof
US7310722B2 (en) * 2003-12-18 2007-12-18 Nvidia Corporation Across-thread out of order instruction dispatch in a multithreaded graphics processor
US7237065B2 (en) * 2005-05-24 2007-06-26 Texas Instruments Incorporated Configurable cache system depending on instruction type
US20070186050A1 (en) * 2006-02-03 2007-08-09 International Business Machines Corporation Self prefetching L2 cache mechanism for data lines
JP4692678B2 (en) * 2007-06-19 2011-06-01 富士通株式会社 Information processing device
US8103832B2 (en) * 2007-06-26 2012-01-24 International Business Machines Corporation Method and apparatus of prefetching streams of varying prefetch depth
CN100449481C (en) * 2007-06-29 2009-01-07 东南大学 Storage control circuit with multiple-passage instruction pre-fetching function
US8161243B1 (en) * 2007-09-28 2012-04-17 Intel Corporation Address translation caching and I/O cache performance improvement in virtualized environments
US7890702B2 (en) * 2007-11-26 2011-02-15 Advanced Micro Devices, Inc. Prefetch instruction extensions
US8140768B2 (en) * 2008-02-01 2012-03-20 International Business Machines Corporation Jump starting prefetch streams across page boundaries
JP2009230374A (en) * 2008-03-21 2009-10-08 Fujitsu Ltd Information processor, program, and instruction sequence generation method
US7958317B2 (en) * 2008-08-04 2011-06-07 International Business Machines Corporation Cache directed sequential prefetch
US8402279B2 (en) * 2008-09-09 2013-03-19 Via Technologies, Inc. Apparatus and method for updating set of limited access model specific registers in a microprocessor
US9032151B2 (en) * 2008-09-15 2015-05-12 Microsoft Technology Licensing, Llc Method and system for ensuring reliability of cache data and metadata subsequent to a reboot
CN101887360A (en) * 2009-07-10 2010-11-17 威盛电子股份有限公司 The data pre-acquisition machine of microprocessor and method
CN101667159B (en) * 2009-09-15 2012-06-27 威盛电子股份有限公司 High speed cache system and method of trb

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI499975B (en) * 2011-12-07 2015-09-11 Apple Inc Next fetch predictor training with hysteresis
TWI501156B (en) * 2011-12-09 2015-09-21 Nvidia Corp Multi-channel time slice groups
TWI488109B (en) * 2011-12-13 2015-06-11 Intel Corp Method and apparatus to process keccak secure hashing algorithm
TWI552071B (en) * 2011-12-13 2016-10-01 英特爾公司 Method and apparatus to process keccak secure hashing algorithm
US9772845B2 (en) 2011-12-13 2017-09-26 Intel Corporation Method and apparatus to process KECCAK secure hashing algorithm
US10691458B2 (en) 2011-12-13 2020-06-23 Intel Corporation Method and apparatus to process KECCAK secure hashing algorithm
US10146545B2 (en) 2012-03-13 2018-12-04 Nvidia Corporation Translation address cache for a microprocessor
US9880846B2 (en) 2012-04-11 2018-01-30 Nvidia Corporation Improving hit rate of code translation redirection table with replacement strategy based on usage history table of evicted entries
US10241810B2 (en) 2012-05-18 2019-03-26 Nvidia Corporation Instruction-optimizing processor with branch-count table in hardware
US10324725B2 (en) 2012-12-27 2019-06-18 Nvidia Corporation Fault detection in instruction translations
TWI560547B (en) * 2014-10-20 2016-12-01 Via Tech Inc Dynamically updating hardware prefetch trait to exclusive or shared in multi-memory access agent
TWI596479B (en) * 2014-12-14 2017-08-21 上海兆芯集成電路有限公司 Processor with data prefetcher and method thereof

Also Published As

Publication number Publication date
CN105183663B (en) 2018-11-27
TWI547803B (en) 2016-09-01
TW201624289A (en) 2016-07-01
TWI574155B (en) 2017-03-11
CN104636274B (en) 2018-01-26
TWI519955B (en) 2016-02-01
CN104615548A (en) 2015-05-13
TWI534621B (en) 2016-05-21
CN104636274A (en) 2015-05-20
CN102169429B (en) 2016-06-29
CN105183663A (en) 2015-12-23
TW201447581A (en) 2014-12-16
TW201535119A (en) 2015-09-16
CN102169429A (en) 2011-08-31
TW201535118A (en) 2015-09-16
TWI506434B (en) 2015-11-01
CN104615548B (en) 2018-08-31

Similar Documents

Publication Publication Date Title
TW201135460A (en) Prefetcher, method of prefetch data, computer program product and microprocessor
US8880807B2 (en) Bounding box prefetcher
US8645631B2 (en) Combined L2 cache and L1D cache prefetcher
US9223705B2 (en) Cache access arbitration for prefetch requests
US9015422B2 (en) Access map-pattern match based prefetch unit for a processor
US8583894B2 (en) Hybrid prefetch method and apparatus
US8719510B2 (en) Bounding box prefetcher with reduced warm-up penalty on memory block crossings
TWI307465B (en) System, method and storage medium for memory management
US20140108740A1 (en) Prefetch throttling
US9304919B2 (en) Detecting multiple stride sequences for prefetching
US20140129772A1 (en) Prefetching to a cache based on buffer fullness
US9256544B2 (en) Way preparation for accessing a cache
JP6701380B2 (en) Up/down prefetcher
US9223714B2 (en) Instruction boundary prediction for variable length instruction set
CN115964309A (en) Prefetching
US20140115257A1 (en) Prefetching using branch information from an instruction cache
US11907722B2 (en) Methods and apparatus for storing prefetch metadata
CN117897690A (en) Notifying criticality of cache policies