TW201245976A

TW201245976A - Hardware acceleration components for translating guest instructions to native instructions

Info

Publication number: TW201245976A
Application number: TW101102835A
Authority: TW
Inventors: Mohammad Abdallah
Original assignee: Soft Machines Inc
Priority date: 2011-01-27
Filing date: 2012-01-30
Publication date: 2012-11-16
Also published as: US20170115991A1; US20170068540A1; TWI512498B; US10394563B2; WO2012103359A3; US20170235575A1; US9733942B2; US20130024661A1; US11467839B2; WO2012103359A2; US20170024212A1

Abstract

A hardware based translation accelerator. The hardware includes a guest fetch logic component for accessing guest instructions; a guest fetch buffer coupled to the guest fetch logic component and a branch prediction component for assembling guest instructions into a guest instruction block; and conversion tables coupled to the guest fetch buffer for translating the guest instruction block into a corresponding native conversion block. The hardware further includes a native cache coupled to the conversion tables for storing the corresponding native conversion block, and a conversion look aside buffer coupled to the native cache for storing a mapping of the guest instruction block to corresponding native conversion block, wherein upon a subsequent request for a guest instruction, the conversion look aside buffer is indexed to determine whether a hit occurred, wherein the mapping indicates the guest instruction has a corresponding converted native instruction in the native cache.

Description

201245976 籲六、發明說明：【相關申請案參照】本申請案主張2011年1月27日由Mohammad A. Abdallah 申請之同在申請中及共同讓與的美國臨時專利申請案第 61/436,966號的權利，其標題為「用於將客戶指令轉譯為本機指令之硬體加速組件」，及其全文以引用的方式併入本文中。【發明所屬之技術領域】本發明一般有關數位電腦系統，尤其有關一種轉譯包含指令序列(instruction sequence)之指令的系統及方法。【先前技術】許多類型的數位電腦系統利用程式碼（code)變換 (transformation)/轉譯(translation)或仿擬(emulation)，實施軟體功能性。一般而言，即使指令不在電腦系統的「本機」内，轉譯及仿擬均涉及檢查軟體指令的程式及履行軟體指令指定的功能及動作。在轉譯的情況中，非本機指令被轉譯為本機指令的形式，本機指令係設計可在電腦系統的硬體上執行。實例包括先前技術的轉譯軟體及/或硬體，其以工業標準x86應用程式操作’使這些應用程式可在非x86或替代的電腦架構上執行。一般而言，轉譯程序(translation process)利用大量的處理器週期 (processor cycle)，因而造成不小的額外負擔。額外負擔所造成的效能損失可實質上抹滅轉譯程序所帶來的任何好處。一個試圖解決此問題的做法涉及使用及時編譯GusMndme compilation)。及時(JIT)編譯又稱為動態轉譯，是一種提高電腦 201245976 程式之運行時間效能(runtime perf〇rmance)的方法。傳統上電腦程式具有兩個運行時間變換模式：解譯模式㈨terpretati〇n mode)或ΤΓΓ(及時）編譯/轉譯模式。解譯是一種解碼程序 (decodingp_SS)’其涉及一個指令一個指令地解碼，以低於町編譯的額外負擔將客戶碼變換為本機碼，但其產生表現較差的變換碼。另外’解譯係以每一個指令來調用（i嶋㈣。jit編譯器或轉譯器代表錢不同的解譯途徑。就；IT轉換而言，盆通常具有崎譯器高_外貞擔，但其產生更騎佳化的轉譯碼 (translated^ =:在看：要，將轉譯作為解譯來進行以減少 =最佳式碼許多次之後’調請轉譯以建然而身就對#秘1起許多問題。】IT編譯程序本 =:===:這可在敌動應用程式成多趟往返行樹的儲存造 mapping)及分配管理額外負 L舌。己憶體映射(mem〇ry 〇atenCy penalty) ^ Γ ^ ^ ^ ^ ^ ^ 體及程式·財㈣安及在系統記憶解譯程序涉及的額外負擔比:轉‘後^開始起始程序。的程式碼也無法最佳化。生私式碼，所產生【發明内容】令轉本發明之具體實施例實施一種實現客戶指令至本機指 201245976 譯程序之硬體為基加速(hardware based acceleration)的演算法 (algorithm)及裝置。在一具體實施例中，本發明實施為硬體為基轉譯加速器 (hardware based translation accelerator)。該硬體為基轉譯加速器包括.一客戶提取邏輯組件(guest fetch logic component)，用於存取複數個客戶指令；一客戶提取緩衝器(guest fetch buffer)，其耗接至该客戶提取邏輯組件及一分支預測組件（branch prediction component) ’用於將該複數個客戶指令組合成 (assemble) —客戶指令區塊(guest instructj〇n；及複數個轉換表(conversion table)’其耦接至該客戶提取緩衝器，用於將該客戶指令區塊轉譯為對應的本機轉換區塊(native conversion block)。該硬體為基轉譯加速器另外包括：一本機快取(native cache) ’其耦接至該等轉換表，用於儲存該對應的本機轉換區塊’及一轉換後備緩衝器(conversi〇n 1〇〇k aside buffcr，CLB)，其減至該本機快取，用於儲存該客戶指令區塊至對應的本機轉換區塊的-映射，其巾在收騎對客戶指令的後續請求時， =引該轉換後備緩衝器以決定是否發生—命中_，其中該映射心示4客彳彳θ令在《本機快取巾具有—對應的轉換本機指令 (converted native instmetimi)。回應於該命中，該轉換後備緩衝器傳送(forward)轉譯的本機指令以便執行。 „、上述為發明内容，按必要性含有簡單、-般性的說明，因而省略心、·、®!卩’所以熟習本技術者應明自，發_容僅為說 201245976 明性的性詳細說請方面、發明特徵、及優點義的其他【實施方式】隨修法順序、：構實施:式」中’提出許多特定細節，諸如特定方丨、及連接。但應明白，要_本發明之且體貫_，未必要利用這些及其他特定已省略或未以特定細節說明熟知的結構、元件、；==务不必要地混淆本說明。及運接以免說明書中提及「-具體實施例」是指結合該且體邱古關曰中處出現的聽「在—具體實施例中」未必全 =^目_顯實_，也不是與其他具財蝴互斥的獨未例。此外，說明可由某些具體實施例展現而 2具體貫施例展現的各種特徵。同樣地，說明可為某此具體貫_但料他具财關之必雜件的必要條件。— 以下「實施方式」的某些部分就電腦記憶體内資料位元乍的過程、步驟、邏輯區塊、處理、及其他符號表示法提出討 201245976 論以 2些說明及表示法係熟習資料處理技術者所使用的方法，的方式將其運作本f傳達給其他熟習本技術者。過 i所行步驟、邏輯區塊、程序等在此且—般被設想為導理量之:理相—致的步驟或指令序列。此等步驟是需要物〜的步驟。通常’但非-定’這些物理量採取電體之電信號或磁信號的形式，且能夠在電腦系統元、值、元方便=是(主^吉合、比較及以其他方式操控。已證實有時很 +、p要為了—般的使用）將這些信號稱為位件、符號、字元(character)、條件項(term)、數字等理量明自’所有這些術語及她術語係為了和相應物 =月=顯不同於以下論述，應明白，在本發明所;:另面外用诸如處理(Pr〇咖_」或「存取㈣⑽丨哟」或「置:動以術語的論述，是指電腦系統或類似電與記憶體及其他電腦可讀媒體中 ^資訊儲存器、傳送或顯示裝置中同樣表示為物指實施例藉由大幅加速將客戶指令架構之客戶勃:-二2㉟指令架構之本機指令的程序以在本機處理器上程ί的。本發明之具體實施例利用硬體單元實施轉換例_構^丨客戶指令可以來自若干不同的指令架構。實朱構匕括 Java 或 JavaScript、x86、MIPS、SPARC 等〇·士让 -=E* 201245976 體以便快轉令’並以管線送到本機處理器硬多的效能+_統陳馳_触序提供高出許若干例中’本發明實施靈活轉換程序，其可使用構作為輸人。在此具體實施财，將處理器的二❹：、可為軟體所控制’同時利用硬體加速轉換處理， ΐίΐί"1的效能等級。此實施方案在多方面達到好處。可處 f及轉ί不1⑽客戶架構’且各個客戶架構得到硬體加速的好以子有更南的魏等級^軟體控倾端可提供應用程式在〜理器上執行雜AS活度。加速可達成以接近本機硬體速度執行客戶應用程式的客戶指令。在以下說明中，圖丨至圖4 顯示本發明m實關處置客戶指令序列及處置在這些客戶指令序列内的較近分支(near branch)及較遠分支(far branch)的方式。圖5 _示根才康本發明之一具體實施例之例示性硬體加速轉換處理系統的概觀。圖1顯示本發明之一具體實施例所操作的例示性指令序列。如圖1所描繪’指令序列100包含16個指令，從圖1的頂部開始至底部。如圖1中可見，序列100包括四個分支指令 101-104 。本發明具體實施例之一目的在於將整個群組的指令處理為單一基元單元(single atomic unit)。此基元單元稱為「區塊」。指令區塊可充分延伸超過圖1顯示的16個指令。在一具體實施例中，區塊將包括足夠的指令以填滿固定大小(如，64位元組、128 201245976 位元組、256位元組專）’或直到遇到出口條件(exjt c〇ncjiti〇n)。在一具體貫施例中，結束指令區塊的出口條件是遇到較遠分支指令。如本文具體實施例的說明中所使用，較遠分支是指其目標位址駐存在目前指令區塊之外的分支指令。換言之，在給定的客戶指令區塊内，較遠分支具有駐存在某個其他區塊中或在給定指令區塊外之某個其他指令序列中的目標。同樣地，較近分支是指其目標位址駐存在目前指令區塊内的分支指令。另外，應注意，本機指令區塊可含有多個客戶較遠分支。在以下討論中將進一步說明這些術語。圖2顯示根據本發明之一具體實施例描繪基於區塊的轉換程序的示意圖，其中將客戶指令區塊轉換為本機轉換區塊。如圖2中圖解，顯示將複數個客戶指令區塊2〇1轉換為對應的複數個本機轉換區塊202。本發明之具體貫施例藉由將客戶指令區塊的指令轉換為本機轉換區塊之對應的指令而發揮功能。各個區塊2〇1由客戶指令組成。如上述，這些客戶指令可來自若干不同的客戶指令架構(如，Java 或 JavaScript、x86、MIPS、SPARC 等）。多個客戶指令區塊可轉換為一或多個對應的本機轉換區塊。此轉換基於每個指令而發生。圖2亦圖解基於分支預測將客戶指令區塊組合成序列的方式。此屬性(attribute)使本發明之具體實施例能夠基於較遠分支的預測結果組合客戶指令序列。基於較遠分支預測，將客戶指令序列攸乡個客戶齡區塊組合纽轉換為對應的本機轉換區 201245976 塊。此做法將在町圖3及圖4巾進—步說明。圖3顯示根據本發明之一具體實施例圖解將客 :各個指令轉換為本機轉換區塊之對應意圖。如圖3 _，客戶指令區塊駐存在客, 内°同樣地’本_換輯駐存在柳旨令_H 3G2内°。圖3顯示本發明之具體實施例的屬性，其中 =的目標位址轉換為本機分支指令的目標健。例如，客H曰二ΪΪ包Γ別特定分支之目標位址的偏移量(翻）。“ ==因為本機指令產生對應客戶指令之功能性^ :ίίϊί列而不同。例如，客戶指令的長度可與其對應之本 Ϊ 度不同。因此，轉換程序藉由計算對應的本機偏移里不貝在異。這在圖3中顯示為本機偏移量或N—〇ffset。應注意，由於未預測在客戶指令區塊内具有目標的分支(稱為「較近分支」），因而未改變指令序列流。圖4顯不根據本發明之一具體實施例圖解以處置本機轉換 ^塊來處理較遠分支之方式的示意圖。如圖4中_，將客戶指令描繪為記憶體中的客戶指令序列401。同樣地，將本機指令描繪為記憶體中的本機指令序列402。在一具體實施例中，每一個指令區塊(客戶指令區塊及本機指令區塊二者)以較遠分支結束(如，即使本機區塊可能含有多個 201245976 客戶較遠分支）。如上述，區塊將包括足夠的指令以填滿固定大小(如’ 64位元組、128位元組、256位元組等），或直到遇到出口條件，諸如最後一個客戶較遠分支指令。如果已經處理若干客戶指令以組合客戶指令區塊且尚未遇到較遠分支，則插入客戶較遠分支以結束區塊。此較遠分支只要一躍(jump)即可跳到下一個後續區塊(subsequent block)。這確保指令區塊以引向記憶體中另一本機指令區塊或另一客戶指令序列的分支結束》另外，如圖4中顯示’區塊可包括在其指令序列内不駐存在區塊末端處的客戶較遠分支。這以客戶指令較遠分支411及對應的本機指令客戶較遠分支412來顯示。在圖4具體實施例中，預測取用較遠分支411。因此，指令序列跳至較遠分支4Π的目標，即客戶指令F。同樣地，在對應的本機指令中，接續在較遠分支412之後的是本機指令F。未預測較近分支。因此，這些較近分支未以較遠分支的相同方式改變指令序列。以此方式，本發明之具體實施例產生轉換區塊的執跡 (trace) ’其中各健塊包含若干(如，3_侧較遠分支。此軌跡係基於客戶較遠分支預測。在-具體實施例巾，在本機轉換區塊_較遠分支包括用」目對分支路徑之相反位址的客戶位址。如上述，基於較遠分測產生指令序列。直到執行對應的本機轉換區塊，才會結果。因此’一旦偵測到錯誤的預測，則檢查錯誤的較㈣支，以獲得相對分纽徑的相反 201245976 程序接著從相反客戶位址繼續，I規太θ 此方式，本發明之具體實施例使用戶的相反客戶位址，以從較遠分支之綱 ^ ^ 復。因此’如果較遠分支預測結果是錯誤的恢處找到正確的客戶指令。同樣地，如果 ±主-、去何令區塊内的較遠分支，則在CLB中;需;=”在本機指 ,. 个而要其目標區塊的登錄點 (emrypomt)。,然而…旦發生失誤的預測，則需要在咖中插 =目標區塊的新登錄。此魏係料存CLB容量㈣ad⑼為目標而履行。圖5顯示例示性硬體加速轉換系統5〇〇的示意圖，兑本發明之-具體實補騎將客戶指令區塊及其職的本機轉換區塊儲存在快取中的方式。如圖5中圖解，使用轉換㈣緩 ^器506快取客戶及本機區塊之間的位址映射；致使透過低潛時可用性(low latency availability)將最常遇到的本機轉換區塊存取至處理器508中。圖5示意圖圖解在高速低潛時快取(high_speed 1〇w丨咖㈣ cache memory)(轉換後備緩衝器506)内維持時常遇到的本機轉換區塊的方式。圖5中描繪的組件實施硬體加速轉換處理，以達到更高的效能等級。客戶提取邏輯單元502運作為從系統記憶體5〇1提取客戶才曰令的硬體為基客戶指令提取单元（hardware-based guest instruction fetch unit)。給定應用程式的客戶指令駐存在系統記憶 201245976 體501内。在程式初始後，硬體為基客戶提取邏輯單元兄始預提取客戶指令至客戶提取緩衝器5〇3 +。客戶提取緩二 503累積客戶指令，然後將這些指令組合成客戶指令區塊。使用轉換表504，將這些客戶指令區塊轉換騎_本機轉換區塊。轉換=本機指令在本機轉換緩衝器5G5内累積，直到本機轉換區塊完成。接著將本機轉換區塊傳輸至本機快取5〇7，及將映射儲存在轉換後備緩衝器5〇6中。接著使用本機快取撕以供仏本機指令至處理器508以便執行。在一具體實施例中，藉由^ 戶提取邏概顏(guest feteh lGgie趣職hine)魅由^ 取邏輯單元502實施的功能性。隨著此程序繼續，轉換後備緩衝器5〇6填滿客戶區塊至本機區，的位址映射。轉換後備緩衝器5〇6使用一或多個演算法 („如’取近最少使科⑽確保將較常遇到賴塊崎保持在緩衝為内’且將很少制的區塊映射逐出緩衝器。以此方式，將熱門的士機轉無塊映射儲存在轉換後備緩衝器内。另外:、 ^主思，在本顧塊内精確·⑽較遠客戶分支不需要在咖插入新的映射，因為其目標區塊固定在單—映射本機區塊 ’因此保存CLB結構的較小容量效率。此外，在__具體實施 =中’構造CLB贿儲存最後的客戶至本機位址映射。此做法亦保存CLB的較小容量效率。八客戶提取邏輯5〇2查看轉換後備緩衝器5%以決定客戶指 :區塊的位址是否已被轉換為本機轉換區塊。如上述，本發明提供用於轉換處理的硬體加速。因此，在從系統、體501提取客戶位址以進行新的轉換前 ’客戶提取邏輯502 14 201245976 將查看轉祕倾衝有無贱存在的本_純塊映射。在一具體實施例中，對於轉換後備緩衝器以客戶位址# 或以個別客戶位址進行索引。客戶健範岐⑽換為本= 換區塊之客戶指令區塊之位址的制。轉換後備緩衝器的本機轉換區塊映射經由其對應之客戶指令區塊之對應的客位址範圍而被進行索引。因此，客戶提取邏輯可比較&戶與轉換區塊的客戶位址範賊個別客戶位 ^ ，緩衝器料)，以決定預先存在的本= 存在本機快取5〇7中儲存的位置内或在圖6的程式碼快如果預先存在的本_舰齡在本機絲或程式碼。則將對應的本機職齡從絲直接傳送顺理器。’ 以此方式，熱門的客戶指令區塊(如，經常執行的客有其維持在高速低潛時轉換後備緩衝器5G6内的對ϋ 門本機轉換區塊映射。隨輕塊被接觸，適#的替換兩)確保熱門的區塊映射繼續後= :::。因此，客戶提取邏輯502可馬上識別所請求的it 丰機陕取507,以由處理器508執行。l古此供、+ · 苛圮至因為至系統記憶體的行程可花> 4〇至;^^下許多週期，屬性(如，CLB、客戶分個週期或更多。這些抖Λ 刀支序列預測、客戶及本機分支緩徐哭對先_本機快取)允許本發明之具體實幾 ^達成客戶應驗式的應用程式效能至同等 ^功能用程式效能的80%至1〇〇%内。機4 &式的應 15 201245976 在一具體實施例中，客戶提取邏輯5〇2 一再地預提取客戶才曰々用方、與來自處理益508的客戶指令請求(gUest instructi〇n request)無關的轉換。本機轉換區塊可在系統記憶體5〇1中那些用於較少使用區塊的轉換緩衝器「程式碼快取」（c〇nversi〇n buffer code cache)内累積。轉換後備緩衝器5〇6亦保持最常使用 =映射。因此，如果所請求的客戶位址未映射至轉換後備緩衝器中的客戶位址，則客戶提取邏輯可檢查系統記憶體5〇丨以決定客戶位址是否對應於其中儲存的本機轉換區塊。在一具體實施例中，轉換後備緩衝器5〇6係實施為快取並利用快取一致性協定(cache c〇herency pr〇t〇c〇1)，以與儲存在較高階快取中的較大轉換緩衝器及系統記憶體5〇】維持一致性。儲存在轉換後備緩衝器506内的本機指令映射亦寫回(wrke back) 較高階的快取及系統記憶體501。至系統記憶體的寫回維持一致性。因此，可使用快取管理協定以確保熱門的本機轉換區塊映射儲存在轉換後備緩衝器506内，及冷門的本機轉換映射區塊儲存在系統記憶體501中。因此，較大形式的轉換緩衝器5〇6 駐存在系統記憶體501中。應/主思，在一具體實施例中，可使用例示性硬體加速轉換系統500實施若干不同的虛擬儲存方案。例如，可使用客戶指令區塊及其對應的本機轉換區塊儲存在快取内的方式支援虛擬儲存方案。同樣地，可使用用來快取客戶及本機區塊之間的位址映射的轉換後備緩衝器506支援虛擬儲存方案(如，虛擬至實體§己憶體映射的管理）。 16 201245976 =具體實施财，圖5的_#施虛擬 :’其使用可接收若干不同指令 ;:’】軟體所控制，同時利用硬體加速轉換處理，以達到更高的效= H吏用此實^方案，可處理及轉換不同的客戶 : 且各個客戶_得到硬體加速的好處，高的效能等級。實例客戶架構包括Java或Javascript、秦 MIPS、SPARC等。在—具體實施例中，「客戶架構」可以是本機指令(如’來自本機應用程式/巨集操作(macr〇〇perati〇n))，及轉換程序產生最佳化本機指令（如，最佳化本機指令/微操作 (眶r〇-〇perati〇n))。軟體控制前端可提供應用程式在處理器上執行的較大靈活度。如上述’硬體加速可以接近本機硬體速度執行客戶應用程式的客戶指令。圖6顯不根據本發明之一具體實施例之硬體加速轉換系統 600的詳細實例。系統600以上述系統5〇〇的實質上相同方式履行。然而，系統600顯示說明例示性硬體加速程序之功能性的額外細節。系統記憶體601包括資料結構(data structure)，其包含：客戶碼602、轉換後備緩衝器603、最佳化器碼(〇ρ—c〇d)6〇4、轉換器碼(converter code)605、及本機碼快取606。系統600亦顯示共用硬體快取(shared hardware cache)607，在其中交錯及共用客戶指令及本機指令兩者。客戶硬體快取61〇從共用硬體快取607抓取最常接觸的這些客戶指令。 201245976 客戶提取邏輯620預提取來自客戶6〇戶提取邏輯620與TLB 6〇9介接，τ 戶指7客位址轉譯為對應的實體客戶二:=== ::::直接傳送至客戶硬體快取61〇。由客户提取邏輯62〇 k取的客戶指令儲存在客戶提取緩衝器6ΐι中。轉換表6U及613包括替代攔位(substitute fleld)及控制棚位 (C〇咖1 fldd)’並運作為將接收自客戶提取緩衝器611的客戶指令轉譯為本機指令料層轉齡(—1瓣㈣加灿⑹。多工器(mUltiplexer)6】4及6ι5將轉換的本機指令傳輸至本機轉換緩衝g 616。本機雜麟旨616 娜換的本機指令以組合本機轉麵塊。接著將這些本麟祕塊雜至本機硬體快取608，及將映射保持在轉換後備緩衝器630中。轉換後備緩衝器630包括用於以下的資料結構：客戶分支位址(guest branch address)63卜本機位址632、轉換的區塊範圍 (converted block range)633、程式碼快取及轉換後備緩衝器管理位元(code cache and conversion look aside buffer management ㈣634、及動態分支偏差值位元(dynamic branch bias bit)635。客戶分支位址631及本機位址632包含指示哪些對應的本機轉換區塊駐存在轉換的區塊範圍633内的客戶位址範圍。快取管理協定及替換策略確保熱門的本機轉換區塊映射駐存在轉換後備缓衝器630内’同時冷門的本機轉換區塊映射駐存在系統記憶體601的轉換後備緩衝器資料結構603内。 201245976 如同系、統500的情況，系'統_試圖確保熱門的區塊映射駐存在高速低潛時轉換後備緩衝器63〇内。因此，在一具體實施例中，當提取邏輯640或客戶提取邏輯62〇查看以提取客戶位址時，提取邏輯64〇首先可檢查客戶位址，以決定對應的本機轉換區塊是否駐存在本機碼快取6〇6内。此舉允許決定請求的客戶位址是否在本機碼快取6〇6中具有對應的本機轉換區塊。如果請求的客戶位址不駐存在緩衝器003或608、或緩衝器 M0内，則從客戶碼6〇2提取客戶位址及若干後續客戶指令，及經由轉換表612及613實施轉換程序。、圖7顯示根據本發明之一具體實施例具有輔助軟體為基加速轉換管線（secondary software-based accelerated conversion pipeline)之硬體加速轉換系統7〇〇的實例。組件711-716包含軟體實施的載入儲存路徑(load store P^th) ’其例如在特殊高速記憶體760内。如圖7所描繪，客戶提，緩衝器7Π、轉換表712-713及本機轉換緩衝器716包含特殊咼速。己憶體760的分配部分(aii〇cate(j p0rti〇n)。在許多觀點上’特殊高速記憶體76〇用作極低階(1〇w_level)快速快取(如， L0快取）。前頭761圖解藉以經由與指令提取路徑(instruction fetch path)(如’來自挺取的解瑪邏輯(触decocje i〇gic))相對的載入儲存路徑來加速轉換的屬性。在圖7具體實施例，高速記憶體760包括用於進行比較的 201245976 特殊邏輯。由於這-點，可以軟體實施轉換加速。例如，在另 -具體實施例巾’缝轉理輯行管線w executi〇n pipdine)的軟操控儲存組件711_716的標準記隨，該軟體在處理n執行管線處賴較件711_716的值載人—或多個 SIMD暫存器並實施履行SiMD暫存器中欄位間之比較的比較指令，及視需要履行遮罩操作(mask 〇perati〇n)及結果掃描操作 (result scan operation)。可使用通用微處理器硬體(卿㈣卿⑽ m1Cr〇processor hardware)(諸如，例如，使用比較一者與多者之比較指令)來實施載入儲存路徑。、應注意，由具有特殊屬性或位址範圍的指令存取圮憶體 760。，如’在-具體實施例中，客戶提取緩衝器具有用於^個客戶指令項目（gUest instruction entry)的ro。每個客戶指令均建立ID。此ID允許容易從客戶緩衝器映射至本機轉換緩衝7器。瓜允許容易進行客戶偏移量至本機偏移量的計算，不管客^ =應的本機指令相比的不同長度。此做法在上文圖3中^ 在一具體實施例中，由使用計算所提取客戶指令之長度之長度解碼器(length decoder)的硬體來計算ID。然而，廂二，可以硬體或軟體履行此功能性。心' ，一旦指派ID後，可經由ID存取本機指令緩衝器。仍允許客戶偏移量至本機偏移量的偏移量轉換。 ° 圖8顯示根據本發明之一具體實施例圖解CLB結合程式碼 20 201245976 々上述使用CLB來儲仔具有對應的轉換本機位址(儲存在程式碼快取記憶體内)之客戶位址的映射(如，客戶至本機位址映射）在具體實施例中，以客戶位址的一部分索引clb。客戶位址可分#财！丨、賴(tag)、及娜f (如，記憶献小⑽她 n此客戶位址包含用以識別CLB項目帽應於索引之匹配的標織。如果在標籤上發生命中’則龍的項目將儲存指桿 irrr)，，其指科在程式碼快取記㈣嶋中何雜到對應的轉換本機齡記憶塊(如，轉換本機齡的對應區塊）。 _應'主意，本文使用術語「記憶塊」指轉換本機指令區塊的 U己憶體大小。彳物，記憶塊的大小可取決於難本機指令區塊的不同大小而不同。 +在-具體實施例中’關於程式碼快取記憶體8G6，以一組固 ^大l、6£fe塊(如’各個記憶塊類型具有不同大小)分配程式碼快在系統記憶體及所有較低階硬體快取(如，本機硬體快取 ^硬體快取6〇7)+，可將程式碼快取賴分割成若干組 ° CLB可使用客戶位址以進行索引及標籤上比較用於程工”’、、取s己憶塊(code cache chunk)的方式標籤(way tag)。 1方^ 8描乡會咖硬體快取804以描繪為方式x及方式 y的兩儲存客戶位址標籤。應注意，在一具體實施例中，可透 i、'、°構化方式將指標儲存於本機碼記憶塊，完成使用CLB結 21 201245976 a 構之客戶位址至本機位址的映射(如，從客戶至本機位址映射）。各個方式與標籤相關聯。CLB係以客戶位址802(包含標籤)進行索引。在CLB中發生命中時，傳回對應於標籤的指標。使用此 “才示以索引程式瑪快取記憶體。這在圖8中以一行文字「程式碼記憶塊的本機位址=Seg#+F(pt)」顯示，其代表以下事實：程式碼s己憶塊的本機位址隨指標及區段號碼(segment number)而變。在本具體實施例中，區段是指記憶體中用於虛擬映射指標範疇(pointer scope)之點的基礎(如，允許將指標陣列映射於實體 έ己憶體中的任何區域）。或者’在一具體實施例中’可經由如圖8中以一行文字「程式碼記憶塊的本機位址=seg#+索引* (64個記憶塊大小）+ way# * (記憶塊大小）」顯示的第二方法，索引程式碼快取記憶體。在此具體實施例中，可將程式碼快取組織成其方式結構 (way-structure)匹配 CLB 方式結構化(way structuring)，使得在 CLB的方式及程式碼快取記憶塊的方式之間存在ι:1映射。當在特定CLB方式中發生命中時，即程式碼快取的對應方式中的對應程式碼記憶塊具有本機碼。仍然參考圖8，如果CLB的索引未中(miss)，則可檢查記憶體的較高階層是否有命中(如’ L1快取、L2快取等）。如果在這些較高快取等級中沒有發生命中，則檢查系統記憶體8〇1中的位址。在一具體實施例中’客戶索引指向包含例如64個記憶塊的項目》讀出64個記憶塊中各者的標籤，並將其與客戶標籤進行比較，以決定是否發生命中。此程序在圖8中以虛線方框8〇5 顯示。如果在與系統記憶體中的標籤比較之後沒有任何命中， 22 201245976 則在記憶體的任何階層等級不存在轉換，及必須轉換客户指令。、應注％、，本發明之具體實施例管理以類似快取方式儲存客戶至本機指令映射之記憶體的各個階層等級。這原本來自快取為基心隐體(cache_based _〇ry)(如，clb硬體快取、本機快 ^ U及U快取等）。然而，CLB亦包括「程式碼快取+CLB 笞理位元」，其係用來實施系統記憶體内之客戶至本機指令映，最，最少使用(least recently used，LRU)替換管理策略。在-具體實施例中’ CLB管理位元(如，⑽位元)為軟體所管以此方式，使用記憶體的所有階層等級儲存最近使用之最吊遇到的客戶至本機指令映射。對應地，這導致記憶體的所有階層等級同樣地儲存最常遇到的轉換本機指令。 “圖8亦顯示CLB中儲存的動態分支偏差值位元及/或分支歷史記錄位元(branch history bit)。使用這些動態分支位元追蹤用於、、且5各戶4曰令序列之分支預測的行為。使用這些位元追蹤哪些分支預測是最常正確預測的，及哪些分支預測是最常不正確預測的。CLB亦儲存轉換區塊範圍的資料。此資料使程序能夠讓程式碼快取記憶體中對應之客戶指令已修改(如，在自我修改程式碼中）的轉換區塊範圍變成無效。圖9顯示根據本發明之一具體實施例圖解實體儲存堆疊快取實施方案(physical storage stack cache implementation)及客戶位址至本機位址映射的例示性流程圖。如圖9所描繪，可將快取實施為實體儲存堆疊901。 23 201245976 圖9具體實施例圖解可將程式碼快取實施為可變社構快取 (wiabie structure cache)的方式。取決於不同具體實施口例的需求，可變結構快取可完全為硬體實施及控制、完全為軟體實施及控制、或軟體強制及控制與基本硬體啟用的部分混合。圖9具體實施例有關在管理分配及替換客戶至本機位址映射及其在實際實體儲存巾的對應轉譯的讀之間求取最佳平衡。在本具體實施例中，此可透過使用結合指標與可變大小記憶塊的結構來完成。 ° 使用多方式標籤陣列(则lti_way tag array)來儲存用於不同大小群組之實體儲存的指標。每次需要分配特定的儲存大小 (如，其中儲存大小對應於位址)時，則據此分配各對應於此大小之儲存區塊的群組。此舉允許本發明之具體實施例精確地分配儲存，以儲存可變大小的指令軌跡。圖9顯示群組如何可以屬於不同大小。顯不兩個例示性群組大小：「群組大小4的替換候選者」及「群組大小2的替換候選者」。除了對應於位址的標藏之外’指標係也儲存在將位址映射成實體儲存位址的標鐵陣列中。標籤可包含兩個或兩個以上的子標籤。例如，標籤結構9〇2 中最前面的3個標藏包含分別如所顯示的子標藏ai⑴、A2 C2 D2、及A3 B3。因此’標籤A2 B2 C2叱包含群組大小4，而標籤A1 B1包含群組大小2。群組大小遮罩㈣叩如聰p 亦指示群組的大小。只體可像堆疊般管理，致使每次分配新的群組時，謂新的群組放置在實體儲存堆疊的頂部。藉由覆寫項目 24 201245976 的標籤而使項目失效’藉此恢復所分配的空間。圖9亦顯示延伸方式標籤結構（extended way吨 stmcture)903。在一些情形中，標籤結構902中的項目在延伸方式標籤結構903中將具有對應的項目。這取決於項目及標籤結構是否具有設定之延伸方式位元(如，設定為一）。例如，設定^ 的延伸方式位元指示在延伸方式標籤結構中有對應的項目: 延伸方式標籤結構允許處理器以與標準標籤結構的不同方式延伸參考區域性(locality 〇f reference)。因此，儘管以一個方式索標籤結構902(如，索引⑴），但以不同方式(如，索引伸方式標籤結構。 n 型的實施方案中’索引⑴可歧在索引(k)中的更多項目。這是因為在大多數的限制中，主要的標籤結構· 方式標籤結構903大上許多，其中例如①可涵蓋聰個項目 (如’ 10位元）’而(k)可涵蓋256(如，8位元）。、這使得本發明之具體實施織夠合併用於匹配已變得很敎方式，’如果無法在標躲立兀虛可，用延伸方式標籤結構儲存用於熱門軌跡的額外J ^ & ’此可變快取結構在僅需要時使用储存於堆遇上之該快隼: 母一個快取集的固定實體資料儲存)相比，這提供高效率 25 201245976 的有效儲存容量增加β 亦可以有指示—個隹八位元(如，意思是這杜集ΓΡ\或—個群組的集合為冷門的(cold) 集合的堆疊儲存看起\ r久未被存取）。在此财，這些 (bubble)。此時，可為发像疋在分配堆疊儲存内的氣泡的分配指標。此程序是儲存回門集要求使用這些冷門集(C〇ld set) P職ss)，1中在賴挣p :回收再利用程序⑽卿reclamation 的整個集合在堆疊时配好之後，該記憶塊屬於結構(圖9中未顯示/免使；了促=„再利用的所需機制及一個集合有-辦繼混清)是：每回收再利用程序(其中用=====’及方式)。此v允;=== 於堆叠式t將新的記憶塊分配的指標及其在堆疊_侧施為ίΐίί!圖9具體實施例充分適於使用標準記憶體，其實 ΐί儲記憶體相對。此屬性是由於以下事實所造成：子，係藉由讀取指標、讀取索引、及分配位址範圍加乂吕理。在此實施方案中不需要特殊的快取為基電路結構。㈣f注意’、在—具體實施例中，可使用圖9架構以實施不涉轉換或程柄變換的#料快取及快取方案。gj此，可使用圖9 26 201245976 ==:== rrr 〇其K给對應的群組。群組遮罩^紙藉由匹配客戶指令的子搁位(subfieid)以所^於的特定群_發#魏。鮮雜客戶 ^ ΓΓΓΓ：：)Γ相關位元，以特別查看相關位元。=表先順核理时讀相罩-標藏對至表=按=:=?繪為由上至下的方向)_ 先權方向讀取，來匹配樣式。以由按遮罩-標籤儲存的優性所檢查的不同遮罩係對i地按i優先:::C配功能耵愿吹耵弟一級表格1004圖解以此方式，緩衝器中的各個位轉換表，其中各個等級的轉換表連⑼t(bytest_)發送到 j将狹衣連續偵測位元攔位。在偵測到 27 201245976 相關位元欄位時，表格替代本機的等效欄位。表格亦產生幫助用於此等級以及下一個等級表格(如，級表格1004)之替代程序(substituti〇n process)的控制欄位。下— 個表格使用先前表格的控制欄位以識別下一個相關位元襴位，其以本機等效欄位替代。第二級表格於是可產生控制攔位以幫助第一級表格，以此類推。一旦所有客戶位元攔位均以本機位元攔位替代’指令即完全轉譯且傳送至本機轉換緩衝器。本機轉換緩衝器接著寫入程式碼快取，及其客戶至本機位址映射記錄於CLB中，如上所述。圖11A顯示本發明之具體實施例所實施之例示性樣式匹配程序的示意圖。如圖11A所描繪，由標籤、樣式、及遮罩決定目的地。樣式解碼的功能性包含履行位元比較（如，逐位元 X0R)、履行位元AND(如，逐位元AND)、及後續檢查所有零位元(如，所有位元的NOR)。 ’ 圖11B顯示根據本發明之一具體實施例之基於SIMD暫存器之樣式匹配程序的示意圖11〇〇。如示意圖11〇〇中所描綠，顯示四個SIMD暫存器11〇2-11〇5。這些暫存器實施如所示樣式解碼程序的功能性。使用傳入樣式1101在各個標籤上履行平行位元比較（如’逐位元X〇r)，及其結果履行與遮罩的位元 AND(如，逐位元AND)。匹配指示符結果如所示各儲存於其相應的SIMD位置。接著如所示履行掃描，及掃描在simd元件中所遇到的第一個「真(true)」是其中所有i位元的方程式(pi X〇r Ti)ANDMi = 〇為真的元件，其中Pi是相應的樣式，Ti是相應 28 201245976 的標籤及Mi是相應的遮罩。圖12顯示根據本發明之一具體實施例之統一暫存器構案 (unified register file)l2〇l的圖式。如圖12所描繪，統一暫存器檔案1201包括個2部分1202-1203及項目選擇器（emry SeleCt〇r)1205。統一暫存器檔案】2〇1支援硬體狀態更新的架構推測(architecture speculation)。統-暫存ϋ檔案咖可實施最佳化影子暫#||(Gptimized 及^的(C_iUed)暫存器狀態管理程序。此程需要在暫能性及認可的暫存器功能性，且不實施例中，統任何妓複製。例如，在一具體 1205提供。在圖12 ^田趣\ 01的功此性大部分由項目選擇器來自部分1及部分/各個暫存器樓案項目由分別間，從各個項目凟的存益尺及尺’構成。在任何給定時部分2的自部分一，就是來自位元的值，㈣財_^^5針對各=目料的喝日有4個不同的組合。 X及y位元的值如下。 ⑻：R無效； 01 : R推測； W:R認可； Π :R認可；認可認可推測無效 (在項取了讀取請求R，後) (在凟取了讀取請求R後） (在項取了讀取請求R，後） (在讀取了讀取請求尺後） 29 201245976 οο 各個指令/事件的影響(impaet)。在指令「寫回」後， U) 00 ϋ1變成10。在指令「認可」後，G1變成11，及 10變成11。 1生回復事件(roUbackevent)後，01變成00，及離Hi ίί儲存在暫存器檔案項目選擇器1205中的狀域於發生這錢更時的事件成位元轉變。破璜：Si二1能夠在影子暫存器狀態内繼續執行，而不會 Μ暫存紙4。^子暫存11狀態準備認可時，更新 :存=目選擇器致使以上述方式從該部分讀二= 始、:’’僅視需要藉由更新暫存器檔案項目選擇器，則心™_10可在發生例外時，回，到取近w可點(e〇mmit p。♦同樣地，認可點可向前移動，错此僅藉由更新暫奸職項目·ϋ，認可推_行結果。 =功能性在不需要挪存H記賴之間進行任何交叉複製供0 以此方式4暫存n彳錄可經由暫 1205， HSSSR)及複數個認可暫存器（議触ed Π i物，拽科，SSSR料It魏CR暫存界。在回復時，SSSR狀態回復為CR暫存器。 ° 201245976 圖13顯示根據本發明之一具體實施例支援推測架構狀態 (speculative architectural state)及暫態架構狀態（transient architectural state)之統一影子暫存器檔案及管線架構(unified shadow register file and pipeline architecture) 1300 的示意圖。圖13具體實施例描繪包含支援包含架構1300的組件，其支援包含架構推測狀態之指令及結果以及支援包含暫態之指令及結果。如本文中使用，認可架構狀態包含可由處理器上執行的程式存取(如’ s買取及寫入)的可見暫存器(vis丨ble register)及可見記憶體。相反地，推測架構狀態包含未被認可及因此並非全域可見的暫存器及/或記憶體。在一具體實施例中，有四個由架構13〇〇啟用的使用模型 (usage model)。第一使用模型包括硬體狀態更新的架構推測，如上文在圖12的討論中描述。第二使用模型包括雙範疇使用（dua】 sc〇pe usage)。此使用模型可用於將2個執行緒(thread)提取至處理器中，其中一個執行緒在推測狀_巾執行’及另-個執行緒在非推測狀態中執行。在此使用模型巾，兩個範脅被提取至機則職㈣中，及存在於機器中。于第㈣包括指令從-個形式至另—個形式的JIT(及構Γί ί在此使用模型中，經由軟體(例如’JiT)重排架構紅。第三使用模型可翻於例如客戶至本機指令轉譯、虛 31 201245976 擬機器至本機指令轉譯、或將本機微指令重新映射/轉譯為更最佳化的本機微指令。第四使用模型包括暫態環境切換（transient c〇ntext switching)，而不用在從暫態環境返回後存檔及復原先前環境。此使用模型適用於可因若干理由發生的環境切換。一個此種理由例如了以疋經由例外處置環境(excepti〇n hanc|iing context)對例外進行精確處置。第二、第三、及第四使用模型將在以下圖 14-17的討論中進一步說明。再次參考圖13’架構1300包括實施上述4個使用模型的若干組件。統一影子暫存器檔案1301包括：第一部分，認可暫存益棺案1302 ;第二部分’影子暫存器檐案1303 ;及第三部分，最新指示符陣列(latest indicator array) 1304。還包括推測撤回記憶體緩衝器（speculative retirement memory buffer，SMB)1342 及表新指示付陣列1341。架構1300包含亂序架構(〇ut of order architecture) ’因此’架構1300另外包括重排緩衝器及撤回視窗 (reorder buffer and retirement window)1332。重排及撤回視窗 1332 另外包括機器撤回指標(machine retirement pointer)1331、就緒位元陣列(ready bit array)1334 ’及每指令最新指示符(per instruction latest indicator)，諸如指示符 1333。根據本發明之一具體實施例進一步詳細說明第一使用模型，即，硬體狀態更新的架構推測。如上述，架構1300包含亂序架構。架構1300的硬體能夠認可亂序指令結果(如，亂序載入及亂序儲存及亂序暫存器更新）。架構1300以上文在圖12的 32 201245976 方式利用統—影子暫存器檔案，以支援在認可暫子暫存器之間的推測執行。另外，架構13〇〇利用推測 t^T^rculaxive ioad store 緩衝器1342支挺推測執行。〗332架f::將使用這些組件與重排緩衝器及撤回視窗 1332，以允+其狀態正確撤_認可暫存器檔案⑽及 2體^ 5 0 即使機胃魏序方式將這些組件在㈣撤回到統一影子暫存器檔案及撤回記憶體緩衝器。例如，該架用案1301及推測記憶體1342,以基於例;發生與回ί統事t°此功能性使暫存器狀態能夠將亂序撤 1342將亂Ϊ料：器襠案130卜及使推測撤回記憶體緩衝器亂序物己憶體1350。隨著推測執行繼續進行及 ::則機器撤回指標咖向前移動直到觸發認存器標案藉由使其認可點向前移動而^ 對二憶體緩衝器根據機器撤回指㈣1 =考相在鋪_狀_減1332 1-7，就緒位元_ 1334顯轉備純行的的= ⑽’ __要執行的7”附加指令。據此，== _^允續進行。其後，如果發生例外，諸如漏士吳、力右1 /刀支則可回復在指令6後發生的指令。或者，201245976 任 VI, invention description: [Related application reference] This application claims to be held on January 27, 2011 by Mohammad A. The rights of the U.S. Provisional Patent Application Serial No. 61/436,966, the entire disclosure of which is incorporated herein by reference in its entirety in its entirety in The manner of reference is incorporated herein. BACKGROUND OF THE INVENTION 1. Field of the Invention This invention relates generally to digital computer systems, and more particularly to a system and method for translating instructions containing an instruction sequence. [Prior Art] Many types of digital computer systems implement software functionality by means of code transformation/translation or emulation. In general, even if the command is not in the "this machine" of the computer system, both translation and simulation involve the function of checking the software command and the functions and actions specified by the software instruction. In the case of translation, non-native instructions are translated into native instructions, and the native instructions are designed to be executed on the hardware of the computer system. Examples include prior art translation software and/or hardware that operate as industry standard x86 applications' to enable these applications to be executed on non-x86 or alternative computer architectures. In general, the translation process utilizes a large number of processor cycles, thus creating no additional burden. The loss of performance caused by the extra burden can substantially erase any benefits of the translation process. An attempt to solve this problem involves using GusMndme compilation) in a timely manner. Timely (JIT) compilation, also known as dynamic translation, is a way to improve the runtime perf〇rmance of the computer 201245976 program. Traditionally, computer programs have two runtime conversion modes: interpretation mode (9) terpretati〇 mode) or ΤΓΓ (timely) compilation/translation mode. Interpretation is a decoding procedure (decodingp_SS) that involves decoding an instruction one instruction at a time, transforming the client code into a native code at an additional burden lower than that of the town, but it produces a poorly performing transform code. In addition, 'interpretation is called with every instruction (i嶋(4). The jit compiler or translator represents a different interpretation path for money. In terms of IT conversion, the basin usually has a high _ 贞 ,, but It produces a more ridiculous turn decoding (translated^ =: in the look: yes, the translation is carried out as an interpretation to reduce = the best code after many times 'transfer to build to build on the body to #秘1 Many problems.] IT compiler program =:===: This can be used in the enemy application to create multiple mappings of round-trip tree trees and assign management extra negative L tongue. Membrane mapping (mem〇ry 〇atenCy Penalty ^ ^ ^ ^ ^ ^ ^ ^ Body and program (4) Security and the extra burden involved in the system memory interpretation program: turn 'after ^ start the program. The code can not be optimized. PRODUCT CODE, PRODUCTION [Description of the Invention] A specific embodiment of the present invention is implemented to implement a hardware based acceleration algorithm and apparatus for implementing a client instruction to the native reference 201245976. In a specific embodiment, the present invention is implemented as a hardware-based turn Accelerator (hardware based translation accelerator). The translation hardware accelerator comprises a base. a guest fetch logic component for accessing a plurality of client instructions; a guest fetch buffer that is consumed by the client fetch logic component and a branch prediction component 'for assembling the plurality of client instructions - a client instruction block (guest instructj〇n; and a plurality of conversion tables) coupled to the client fetch buffer for The client instruction block is translated into a corresponding native conversion block. The hardware-based translation accelerator additionally includes: a native cache 'which is coupled to the conversion table for Storing the corresponding local conversion block 'and a conversion lookaside buffer (conversi〇n 1〇〇k aside buffcr, CLB), which is reduced to the local cache, for storing the client instruction block to the corresponding The mapping of the local conversion block, when the towel is in the follow-up request for the client instruction, the conversion backup buffer is used to determine whether to occur - hit _, where the mapping indicates 4 彳彳 θ In the "local cache towel" - a corresponding converted native instmetimi. In response to the hit, the translation lookaside buffer forwards the translated native instruction for execution. „, the above is the inventive content, According to the necessity, it contains a simple and general description, so the heart, ·, ®!卩' is omitted, so those who are familiar with the technology should be aware of it, and only need to say 201245976. Other Advantages and Advantages [Embodiment] The order of the repair process, the implementation of the formula: "in the formula", puts forward many specific details, such as specific aspects, and connections. However, it should be understood that the present invention is not necessary. The use of these and other specific structures, elements, and other structures that have been omitted or not described in the specific details are not necessarily to be construed as unnecessarily obscuring the description. The appearance of "in the specific embodiment" that appears in Qiu Gu Guanzhong is not necessarily all the same as the other examples of mutual exclusion. In addition, the description may be made by some specific examples. Now, the various features of the specific embodiment are shown in the same way. Similarly, the description can be a necessary condition for a certain kind of complication. However, some parts of the following "implementation methods" are computer memory. The process, steps, logic blocks, processing, and other symbolic representations of the in-vivo data bits are presented in 201245976 by means of two methods and methods used by those skilled in the data processing techniques. f is communicated to other people who are familiar with the technology. Steps, logic blocks, programs, etc., which are performed by i, are here and are generally conceived as a quantity of instructions: a sequence of steps or a sequence of instructions. These steps are the steps required. Usually, 'but not-determined' these physical quantities take the form of electric or magnetic signals of the electric body, and can be conveniently used in computer systems, values, and yuan = yes (mainly, combined, compared, and otherwise manipulated. It has been confirmed When +, p is used for general use, these signals are called bits, symbols, characters, terms, numbers, etc. from all of these terms and her terms are intended to be Correspondence = month = significantly different from the following discussion, it should be understood that in the present invention;: external use such as processing (Pr 〇 _ _ or "access (four) (10) 丨哟" or "set: move by terminology, is Refers to a computer system or similar electrical and memory and other computer readable media. In the information storage, transmission or display device, it is also indicated that the object refers to the embodiment of the customer instruction architecture by greatly accelerating the architecture: The program of the native instructions is on the native processor. The specific embodiment of the present invention implements the conversion example using the hardware unit. The client instruction can come from a number of different instruction architectures. JavaScript, x86, MIPS, SPARC, etc. 〇·士让-=E* 201245976 The body is used to quickly transfer the 'and the pipeline to the native processor's hard multi-effect + _Cheng Chen _ _ _ _ _ _ _ _ _ _ _ _ _ Implement a flexible conversion program, which can be used as a input. In this implementation, the processor's second: can be controlled by the software' while using the hardware acceleration conversion processing, ΐίΐί"1 performance level. The solution achieves benefits in many aspects. It can be used in the implementation of the client architecture and the hardware architecture of each client architecture is accelerated by the hardware. Execute the miscellaneous AS activity. Acceleration can achieve customer instructions to execute the client application at a speed close to the native hardware speed. In the following description, Figure 4 to Figure 4 show the sequence of the customer's instruction and the handling of these customer instructions. The way of the near branch and the far branch within the sequence. Figure 5 - An overview of an exemplary hardware accelerated conversion processing system of one embodiment of the invention. This hair An exemplary sequence of instructions for operation of one embodiment. The sequence of instructions 100 as depicted in Figure 1 includes 16 instructions, starting from the top to the bottom of Figure 1. As can be seen in Figure 1, the sequence 100 includes four branch instructions 101. - 104. One of the embodiments of the present invention aims to process the entire group of instructions into a single atomic unit. This primitive unit is called a "block". The instruction block can be extended beyond Figure 1 16 instructions are displayed. In a specific embodiment, the block will include enough instructions to fill a fixed size (eg, 64 bytes, 128 201245976 bytes, 256 bytes exclusively) or until Export conditions (exjt c〇ncjiti〇n). In a specific embodiment, the exit condition of the end instruction block is to encounter a far branch instruction. As used in the description of the specific embodiments herein, a far branch refers to a branch instruction whose target address resides outside of the current instruction block. In other words, within a given client instruction block, the farther branch has a target that resides in some other block or in some other sequence of instructions outside of the given instruction block. Similarly, a closer branch refers to a branch instruction whose target address resides within the current instruction block. In addition, it should be noted that the native instruction block may contain more distant branches of multiple clients. These terms are further explained in the following discussion. 2 shows a schematic diagram depicting a block-based conversion procedure in which a client instruction block is converted to a native conversion block, in accordance with an embodiment of the present invention. As illustrated in Figure 2, the display converts a plurality of client instruction blocks 2〇1 into corresponding plurality of native conversion blocks 202. A specific embodiment of the present invention functions by converting instructions of a client instruction block into corresponding instructions of a local conversion block. Each block 2〇1 consists of a customer order. As mentioned above, these customer instructions can come from a number of different client instruction architectures (eg, Java or JavaScript, x86, MIPS, SPARC, etc.). Multiple client instruction blocks can be converted to one or more corresponding native conversion blocks. This conversion occurs on a per-instruction basis. Figure 2 also illustrates the manner in which client instruction blocks are combined into a sequence based on branch prediction. This attribute enables a particular embodiment of the present invention to combine a sequence of client instructions based on predictions of farther branches. Based on the far-end branch prediction, the customer instruction sequence 攸个 customer-age block combination conjugate is converted into the corresponding local conversion zone 201245976 block. This practice will be explained in the steps of the town map 3 and Figure 4. Figure 3 illustrates a corresponding intent to translate a guest: each instruction into a native conversion block in accordance with an embodiment of the present invention. As shown in Figure 3, the customer command block resides in the guest, and the same as the 'this_change' resides in the _H 3G2. Figure 3 shows the attributes of a particular embodiment of the invention in which the target address of = is converted to the target health of the native branch instruction. For example, the customer's H曰 packet compensates for the offset (turnover) of the target address of a particular branch. “==Because the native instruction generates the functionality of the corresponding client instruction ^ : ίίϊί column. For example, the length of the client instruction can be different from its corresponding degree. Therefore, the conversion program calculates the corresponding local offset. This is shown in Figure 3. This shows the native offset or N-〇ffset. It should be noted that since the branch with the target in the client instruction block is not predicted (called the "closer branch"), The instruction sequence stream has not changed. Figure 4 is a schematic diagram showing the manner in which a native conversion block is processed to handle a farther branch in accordance with an embodiment of the present invention. As shown in Figure 4, the client instruction is depicted as a sequence of client instructions 401 in memory. Similarly, the native instructions are depicted as a native instruction sequence 402 in memory. In one embodiment, each instruction block (both client instruction block and native instruction block) ends with a far branch (e.g., even though the native block may contain more than one 201245976 client far branch). As mentioned above, the block will include enough instructions to fill a fixed size (such as '64 bytes, 128 bytes, 256 bytes, etc.), or until an exit condition is encountered, such as the last client far branch instruction . If several client instructions have been processed to combine the client instruction blocks and have not encountered a far branch, then the client is farther away to end the block. This far branch can jump to the next subsequent block as long as it jumps. This ensures that the instruction block ends with a branch leading to another native instruction block or another client instruction sequence in memory. Additionally, as shown in Figure 4, the block may include blocks that do not reside within its instruction sequence. The customer at the end is farther away from the branch. This is displayed by the client command remote branch 411 and the corresponding native command client far branch 412. In the particular embodiment of FIG. 4, the prediction takes the farther branch 411. Therefore, the instruction sequence jumps to the target of the far branch 4Π, the client instruction F. Similarly, in the corresponding native command, subsequent to the farther branch 412 is the native command F. The near branch is not predicted. Therefore, these closer branches do not change the sequence of instructions in the same way as the farther branches. In this manner, a particular embodiment of the present invention produces a trace of a translation block where each of the health blocks contains a number (eg, a 3_ side farther branch. This trajectory is based on a customer's far branch prediction. The embodiment towel, in the local conversion block _ the far branch includes the client address of the opposite address of the branch path. As described above, the instruction sequence is generated based on the far-distance split test until the corresponding local conversion area is executed. The block will only result. So 'once the wrong prediction is detected, check the wrong (4) branch to get the opposite of the opposite route. 201245976 The program then continues from the opposite client address, I rule too θ this way, this The specific embodiment of the invention allows the user's opposite client address to be copied from the farther branch. Therefore, 'if the far branch prediction result is the wrong recovery, the correct client instruction is found. Similarly, if ±main- Go to the far branch in the He Ling block, in the CLB; need; = "in the machine,. The login point (emrypomt) of the target block. However, if a prediction of a mistake occurs, you need to insert a new login to the target block. This Wei system stores the CLB capacity (4) ad(9) as the target. Figure 5 shows a schematic diagram of an exemplary hardware accelerated conversion system 5〇〇, in accordance with the present invention - a specific implementation of the way in which the customer command block and its local conversion block are stored in the cache. As illustrated in Figure 5, the conversion (4) buffer 506 is used to cache the address mapping between the client and the local block; resulting in the most commonly encountered local conversion block through low latency availability. Access to processor 508. Figure 5 is a schematic diagram showing the manner in which a local conversion block that is frequently encountered is maintained in a high-speed low-latency cache (high-speed cache) (transition lookaside buffer 506). The components depicted in Figure 5 implement a hardware accelerated conversion process to achieve a higher level of performance. The client extraction logic unit 502 operates to extract a hardware-based guest instruction fetch unit from the system memory 5〇1. The client command for a given application resides in system memory 201245976. After the program is initialized, the hardware extracts the logical unit from the base client to prefetch the client instruction to the client extraction buffer 5〇3 +. The client extracts the 255 cumulative client instructions and then combines the instructions into client instruction blocks. These client instruction blocks are converted to the ride-local conversion block using conversion table 504. Conversion = The local command is accumulated in the local conversion buffer 5G5 until the local conversion block is completed. The local conversion block is then transferred to the local cache 5〇7, and the map is stored in the conversion lookaside buffer 5〇6. The local cache is then used to tear the native command to processor 508 for execution. In a specific embodiment, the functionality implemented by the logic unit 502 is extracted by the user. As this procedure continues, the translation lookaside buffer 5〇6 fills up the address map of the client block to the local area. The conversion lookaside buffer 5〇6 uses one or more algorithms (such as 'too close to the least (10) to ensure that the block is more often encountered in the buffer) and to devote a few blocks of mapping Buffer. In this way, the popular taxi machine-to-blockless mapping is stored in the conversion look-up buffer. In addition: ^, think, in the block, accurate (10) far away customer branch does not need to insert a new one in the coffee Mapping, because its target block is fixed in the single-mapped native block' thus saves the smaller capacity efficiency of the CLB structure. In addition, in the __ concrete implementation = 'construction CLB bribe stores the last customer-to-native address mapping This approach also preserves the smaller capacity efficiency of the CLB. Eight customer extraction logic 5〇2 view the conversion lookaside buffer 5% to determine if the client's finger: the address of the block has been converted to a native conversion block. The present invention provides hardware acceleration for conversion processing. Therefore, before the client address is extracted from the system and the body 501 to perform a new conversion, the 'customer extraction logic 502 14 201245976 will view the existence of the secret dumping. Block mapping. In the example, the conversion look-aside buffer is indexed by the customer address # or by the individual client address. The customer health module (10) is replaced by the address of the client instruction block of the block. The local conversion block map is indexed via the corresponding guest address range of its corresponding client instruction block. Therefore, the client extraction logic can compare the client address of the household and the conversion block to the individual customer bits of the conversion block ^ , the buffer material), to determine the pre-existing present = exists in the local cache 5 〇 7 stored in the location or in the code of Figure 6 fast if the pre-existing _ ship age in the local wire or code. Then the corresponding local age is transmitted directly from the silk. 'In this way, the popular customer instruction block (for example, the frequently executed guest has the pair that is maintained in the high-speed low-latency conversion backup buffer 5G6) ϋ The local conversion block mapping. With the light block being touched, the replacement of the two #) ensures that the popular block mapping continues after =::. Therefore, the customer extraction logic 502 can immediately identify the requested it. Take 507 for execution by processor 508. l ancient Supply, + · rigorous to because the travel to the system memory can be spent > 4 〇 to; ^ ^ many cycles, attributes (such as CLB, customer cycle or more. These jitter knives sequence prediction, The customer and the local branch are slowly screaming to the first (local cache) to allow the specific implementation of the invention to achieve customer-sponsored application performance to within 80% to 1% of the equivalent functional application performance. & Applicable 15 201245976 In a specific embodiment, the client extraction logic 〇2 repeatedly pre-fetches the client-side, the client-independent request from the processing benefit 508 (gUest instructi〇n request) The local conversion block can be accumulated in the conversion buffer "code buffer" (c〇nversi〇n buffer code cache) in the system memory 5〇1 for the less-used block. The conversion lookaside buffer 5〇6 also maintains the most commonly used = mapping. Therefore, if the requested client address is not mapped to a client address in the translation lookaside buffer, the client fetch logic can check the system memory 5 to determine if the client address corresponds to the native translation block stored therein. . In a specific embodiment, the translation lookaside buffer 〇6 is implemented as a cache and utilizes a cache coherency protocol (cache c〇herency pr〇t〇c〇1) to be stored in a higher-order cache. Large conversion buffers and system memory 5〇 maintain consistency. The native instruction map stored in the translation lookaside buffer 506 also whip back the higher order cache and system memory 501. The write back to the system memory maintains consistency. Therefore, a cache management protocol can be used to ensure that the popular local conversion block map is stored in the conversion lookaside buffer 506, and the unpopular native conversion mapping block is stored in the system memory 501. Therefore, a larger form of the conversion buffer 5〇6 resides in the system memory 501. In a particular embodiment, the exemplary hardware acceleration conversion system 500 can be implemented using a number of different virtual storage schemes. For example, a virtual storage scheme can be supported using a client instruction block and its corresponding native conversion block stored in the cache. Similarly, a translation lookaside buffer 506 for caching address mapping between a client and a native block can be used to support virtual storage schemes (e.g., virtual to physical § memory mapping management). 16 201245976=The specific implementation of the financial, Figure 5 _# Shi virtual: 'The use of it can receive a number of different instructions;: '] Software controlled, while using hardware acceleration conversion processing to achieve higher efficiency = H 吏The solution can handle and convert different customers: and each customer _ gets the benefits of hardware acceleration, high performance level. The example client architecture includes Java or Javascript, Qin MIPS, SPARC, and the like. In a specific embodiment, the "customer architecture" may be a native instruction (eg, 'from a native application/macro operation (macr〇〇perati〇n)), and the conversion program generates an optimized native instruction (eg, , optimize the native instruction / micro-operation (眶r〇-〇perati〇n)). The software control front end provides greater flexibility for the application to execute on the processor. The above-mentioned 'hard acceleration' can approach the client's instructions for executing the client application at the native hardware speed. Figure 6 shows a detailed example of a hardware accelerated conversion system 600 in accordance with an embodiment of the present invention. System 600 is executed in substantially the same manner as system 5 described above. However, system 600 displays additional details that illustrate the functionality of an exemplary hardware acceleration program. The system memory 601 includes a data structure including: a client code 602, a conversion lookaside buffer 603, an optimizer code (〇ρ-c〇d) 6〇4, and a converter code 605. And the local code cache 606. System 600 also displays a shared hardware cache 607 in which both client and native instructions are interleaved and shared. The client hardware cache 61 〇 grabs the most frequently contacted customer commands from the shared hardware cache 607. 201245976 Customer extraction logic 620 pre-fetching from customer 6 Settlement Extraction 620 and TLB 6〇9 interface, τ household refers to 7 customer address translated to corresponding entity customer 2:=== :::: Direct delivery to customer hard The body is fast 61. The client instruction fetched by the client extraction logic 62〇 is stored in the client extraction buffer 6ΐι. The conversion tables 6U and 613 include a substitute fleld and a control booth (C〇1 fldd) and operate to translate the client instructions received from the customer extraction buffer 611 into the native command layer (( 1 flap (four) plus can (6). The multiplexer (mUltiplexer) 6] 4 and 6ι5 transfer the converted local command to the local conversion buffer g 616. This machine is used to convert the native command to 616. The face blocks are then mixed with the native hardware cache 608 and the mapping is maintained in the translation lookaside buffer 630. The translation lookaside buffer 630 includes the following data structure: the client branch address ( Guest branch address 63, localized block range 633, converted block range 633, code cache and conversion look aside buffer management (four) 634, and dynamic branch A dynamic branch bias bit 635. The client branch address 631 and the native address 632 contain a range of client addresses indicating which of the corresponding local translation blocks reside in the translated extent 633. Management agreement The replacement strategy ensures that the popular local conversion block map resides in the translation lookaside buffer 630. The simultaneously unsealed local conversion block map resides in the translation lookaside buffer data structure 603 of the system memory 601. 201245976 In the case of System 500, it is attempted to ensure that the popular block map resides in the high speed low latency transition lookaside buffer 63. Thus, in one embodiment, when the extraction logic 640 or the client extraction logic 62 is viewed To extract the client address, the extraction logic 64 〇 first checks the client address to determine if the corresponding local conversion block resides in the native code cache 6〇6. This allows to determine whether the requested client address is There is a corresponding local conversion block in the local code cache 6〇 6. If the requested client address does not reside in the buffer 003 or 608, or the buffer M0, the client bit is extracted from the client code 6〇2 The address and a number of subsequent client instructions, and the conversion process are implemented via conversion tables 612 and 613. Figure 7 shows an auxiliary software-based accelerated conversion pipeline (secondary soft) in accordance with an embodiment of the present invention. An example of a hardware-accelerated conversion system 7A of the ware-based accelerated conversion pipeline. Components 711-716 include a software-implemented load store P^th which is, for example, in a special high-speed memory 760. As depicted in FIG. 7, the client mentions that the buffers 7, the conversion tables 712-713, and the local conversion buffer 716 contain special idle speeds. The allocation portion of the body 760 (aii〇cate(j p0rti〇n). In many respects, the 'special high-speed memory 76' is used as a very low-order (1〇w_level) fast cache (eg, L0 cache). The first header 761 illustrates the attributes that are accelerated by the loading of the storage path as opposed to the instruction fetch path (eg, from the october logic). The high-speed memory 760 includes the 201245976 special logic for comparison. Because of this, the software can implement the conversion acceleration. For example, in another embodiment, the towel is sewed and the pipeline is w executi〇n pipdine. The standard record of the manipulation storage component 711_716 is that the software executes the comparison instruction of the value of the component 711_716 or the plurality of SIMD registers and processes the comparison between the fields in the SiMD register in the processing n execution pipeline, And perform the mask operation (mask 〇perati〇n) and the result scan operation as needed. The load storage path can be implemented using a general purpose microprocessor hardware (such as, for example, using a comparison command that compares one to many). It should be noted that the memory 760 is accessed by an instruction having a special attribute or address range. As in the specific embodiment, the client fetch buffer has a ro for the gUest instruction entry. An ID is created for each customer order. This ID allows easy mapping from the client buffer to the native conversion buffer. Melon allows for easy calculation of customer offset to native offset, regardless of the different lengths of the customer's native command. This is done in Figure 3 above. In a specific embodiment, the ID is calculated from the hardware using a length decoder that calculates the length of the extracted client instruction. However, in the second, the functionality can be fulfilled by hardware or software. Heart', once the ID is assigned, the native instruction buffer can be accessed via the ID. The offset conversion of the customer offset to the native offset is still allowed. FIG. 8 shows a CLB combined code 20 201245976 in accordance with an embodiment of the present invention. The CLB is used to store a client address having a corresponding converted local address (stored in the code cache). Mapping (eg, client-to-native address mapping) In a particular embodiment, clb is indexed as part of the client address. Customer address can be divided into #财!丨, 赖 (tag), and 娜f (eg, memory contribution small (10). This customer address contains a standard weaving to identify the matching of the CLB item cap. If the life is on the label, then the dragon's item The pointer irrr) will be stored, and the reference code in the code cache (four) 杂 will be mixed into the corresponding conversion memory unit (for example, the corresponding block of the native age). _ should be the idea, the term "memory block" is used herein to refer to the size of the U-memory of the native instruction block. The size of the memory block can vary depending on the size of the hard local instruction block. + In the specific embodiment, 'about the code cache memory 8G6, with a set of fixed l, 6 £fe blocks (such as 'each memory block type has different size) assigned code faster in the system memory and all Lower-order hardware cache (for example, native hardware cache ^ hardware cache 6〇7) +, can be divided into several groups of code cache fast CLB can use the customer address for indexing and labeling Compare the way tag for the process "', and take the code cache chunk. 1 party ^ 8 description of the town cafe hardware cache 804 to depict the way x and mode y The two store customer address tags are stored. It should be noted that, in a specific embodiment, the indicator can be stored in the local code memory block through the i, ', and ° configuration, and the client address of the CLB node 21 201245976 a is completed. Local address mapping (eg, from client to native address mapping). Each mode is associated with a tag. CLB is indexed by client address 802 (including tags). When a lifetime is sent in CLB, the corresponding is returned. The indicator of the label. Use this "to display the memory of the index program. This is shown in Figure 8 with a line of text "Local address of the code memory block = Seg#+F(pt)", which represents the fact that the code's local address with the indicator and section The number changes. In this particular embodiment, a segment refers to the basis of the point in the memory for virtual mapping of the pointer scope (e.g., allowing the index array to be mapped to any region in the entity's memory). Or 'in a specific embodiment' can be via a line of text as shown in Figure 8 "The local address of the code memory block = seg # + index * (64 memory block size) + way # * (memory block size) The second method of display, the index code caches the memory. In this embodiment, the code cache can be organized into a way-structure matching CLB mode structuring so that there is a way between the CLB mode and the way the code caches the memory block. ι:1 mapping. When the life is in the specific CLB mode, the corresponding code memory block in the corresponding mode of the code cache has the local code. Still referring to Figure 8, if the index of the CLB is missed, it is possible to check if the higher level of the memory has a hit (e.g., 'L1 cache, L2 cache, etc.). If there is no life in these higher cache levels, the address in system memory 8〇1 is checked. In a specific embodiment, the 'Customer Index points to an item containing, for example, 64 memory blocks', the tags of each of the 64 memory blocks are read and compared to the customer tags to determine whether or not to be in life. This procedure is shown in Figure 8 by the dashed box 8〇5. If there is no hit after comparison with the tag in the system memory, 22 201245976 there is no conversion at any level of the memory, and the client instruction must be converted. In accordance with a specific embodiment of the present invention, the hierarchical levels of the memory of the client-to-machine instruction map are stored in a cache-like manner. This was originally from cache to cache_based _〇ry (for example, clb hardware cache, native fast ^ U and U cache, etc.). However, CLB also includes "code cache + CLB processing bit", which is used to implement client-to-native instruction mapping in system memory, and least, least (LRU) replacement management strategy. In a particular embodiment, the 'CLB management bit (e.g., (10) bit) is managed by the software. In this manner, all hierarchy levels of the memory are used to store the most recently used client-to-native instruction map. Correspondingly, this results in the same level of storage all of the most commonly encountered conversion native instructions. "Figure 8 also shows the dynamic branch offset value bits and/or branch history bits stored in the CLB. These dynamic branch bits are used to track the branches of the sequence for 5 and 4 families. Predicted behavior. Use these bits to track which branch predictions are most often correctly predicted, and which branch predictions are most often incorrectly predicted. CLB also stores conversion block range data. This data enables programs to make code faster. The conversion block range of the corresponding client instruction in the memory has been modified (eg, in the self-modifying code) becomes invalid. Figure 9 shows a physical storage stack cache implementation (physical storage) in accordance with an embodiment of the present invention. An exemplary flowchart of the stack cache implementation and client address to local address mapping. As depicted in Figure 9, the cache can be implemented as a physical storage stack 901. 23 201245976 Figure 9 illustrates the embodiment of the code Take the implementation of the variable fabric cache (wiabie structure cache). Depending on the requirements of different implementations, the variable structure cache can be completely hard. Implementation and control, full software implementation and control, or software forcing and control mixing with basic hardware enabled. Figure 9 illustrates a specific embodiment of managing allocation and replacement of customer-to-native address mapping and its actual physical storage towel The optimal balance between the translations of the corresponding translations is achieved. In this embodiment, this can be done by using a combination of indicators and a structure of a variable size memory block. ° Using a multi-mode tag array (ie lti_way tag array) Storing metrics for physical storage of different sized groups. Each time a specific storage size needs to be allocated (eg, where the storage size corresponds to the address), the groups corresponding to the storage blocks corresponding to the size are allocated accordingly. This allows the embodiment of the present invention to accurately allocate storage to store variable size instruction trajectories. Figure 9 shows how groups can belong to different sizes. No two exemplary group sizes: "Group size 4 Replacement candidate and "group size 2 replacement candidate". In addition to the label corresponding to the address, the indicator is also stored in the array of irons that map the address to the physical storage address. A tag can contain two or more subtags. For example, the first three of the label structures 9〇2 contain sub-labels ai(1), A2 C2 D2, and A3 B3, respectively, as shown. Thus 'label A2 B2 C2叱 contains group size 4, and label A1 B1 contains group size 2. Group size mask (4), such as Cong p also indicates the size of the group. The body can be managed like a stack, so that each time a new group is assigned, the new group is placed on top of the physical storage stack. The project is invalidated by overwriting the label of the project 24 201245976, thereby restoring the allocated space. Figure 9 also shows an extended mode tag structure (extended way ton stmcture) 903. In some cases, items in the tag structure 902 will have corresponding items in the extended mode tag structure 903. This depends on whether the project and tag structure have a defined extension mode bit (for example, set to one). For example, setting the extension mode bit of ^ indicates that there is a corresponding item in the extension mode label structure: The extension mode label structure allows the processor to extend the reference area (locality 〇f reference) in a different manner than the standard label structure. Thus, although the tag structure 902 (e.g., index (1)) is tagged in one way, it is in a different manner (e.g., indexed tag structure. In the n-type implementation, the index (1) can be distinguished in the index (k). Project. This is because, in most of the restrictions, the main tag structure and mode tag structure 903 is much larger, for example, 1 can cover Cong items (such as '10 bits) and (k) can cover 256 (such as , 8-bit). This makes the implementation of the present invention spliced for matching has become a very embarrassing way, 'If you can't hide in the standard, use the extended label structure to store the extra for the popular track. J ^ & 'This variable cache structure provides an efficient storage capacity of 25 201245976 compared to the fast stored data stored on the heap when needed only: the fixed entity data store of the parent cache set) Increasing β can also have an indication—a octet (for example, meaning that the set of clusters or groups of groups is a stack of uncoled collections that look like \r has not been accessed for a long time). In this wealth, these (bubble). At this time, the distribution index of the bubbles in the stacked storage can be allocated for the image. This program is to store the back door set to use these cold door sets (C〇ld set) P job ss), 1 in the Lai earn p: recycling and reuse program (10) the entire set of clearings after the stack is assembled, the memory block Belongs to the structure (not shown / exempted in Figure 9; promoted = „required mechanism for reuse and a collection of – to be mixed) is: every recycling process (where =====' and This v allows; === The new memory block allocation indicator and its stacking_ side are applied to the stacked t. The specific embodiment of Figure 9 is fully suitable for using standard memory. In fact, the memory is relatively This attribute is caused by the fact that the child is added to the metric by reading the indicator, reading the index, and assigning the address range. In this embodiment, no special cache is required as the base circuit structure. Note that, in a specific embodiment, the architecture of Figure 9 can be used to implement a #material cache and cache scheme that does not involve conversion or handle transformation. gj, this can be used Figure 9 26 201245976 ==:== rrr 〇 Its K is given to the corresponding group. The group mask ^ paper is matched by the subfieid of the customer instruction (subfieid) The specific group _ hair #魏. Fresh miscellaneous customer ^ ΓΓΓΓ::) Γ related bits, in order to specifically view the relevant bits. = Table first read the nucleus when reading the hood - the standard pair to the table = press = :=? is drawn from top to bottom) _ read in the direction of the right direction to match the style. Press i to give priority to the different masks checked by the superiority of the mask-tag storage:::C The function is to boast the first level table 1004 diagram in this way, each bit conversion table in the buffer, in which the conversion table of each level (9)t (bytest_) is sent to j to continuously detect the position of the slot. The form replaces the equivalent field of the machine when the relevant bit field of 2012 2012976 is detected. The form also generates an alternative program for the use of this level and the next level table (eg, level table 1004) (substituti〇n process) Control field. The next table uses the control field of the previous table to identify the next relevant bit field, which is replaced by the native equivalent field. The second level table can then generate a control block to help Level 1 form, and so on. Once all customer bit blocks are blocked with native bits The alternate 'instruction is fully translated and passed to the local conversion buffer. The native conversion buffer then writes the code cache, and its client-to-native address mapping is recorded in the CLB, as described above. Figure 11A shows this A schematic diagram of an exemplary pattern matching procedure implemented by a particular embodiment of the invention. The destination is determined by labels, patterns, and masks as depicted in Figure 11 A. The functionality of the pattern decoding includes performance bit comparisons (e.g., bitwise elements) X0R), the fulfillment bit AND (eg, bitwise AND), and subsequent check of all zeros (eg, the NOR of all bits). Figure 11B shows a schematic diagram of a SIMD scratchpad based pattern matching procedure in accordance with an embodiment of the present invention. As shown in Figure 11〇〇, four SIMD registers 11〇2-11〇5 are displayed. These registers implement the functionality of the style decoding program as shown. Parallel bit comparisons (e.g., 'bitwise X〇r') are performed on each of the tags using the incoming pattern 1101, and the result fulfills the bitwise AND of the mask (e.g., bitwise AND). The match indicator results are each stored as shown in their respective SIMD locations. Then performing the scan as shown, and scanning the first "true" encountered in the simd component is the component in which all i-bit equations (pi X〇r Ti) ANDMi = 〇 are true, where Pi is the corresponding style, Ti is the corresponding label of 201224976 and Mi is the corresponding mask. Figure 12 shows a diagram of a unified register file l2〇l in accordance with an embodiment of the present invention. As depicted in Figure 12, the unified scratchpad file 1201 includes a two-part 1202-1203 and an item selector (emry SeleCt) 1205. Unified Scratchpad File] 2〇1 Architecture to support hardware status update Speculation (architecture speculation). System-temporary file escrow can implement optimization shadow temporary #||(Gptimized and ^ (C_iUed) register state management program. This process needs to be in the temporary and approved scratchpad functionality, and not In the embodiment, any copy is copied. For example, it is provided in a specific 1205. In Figure 12, the field of the game is mostly from the item selector from part 1 and part/each register file project by respectively Between, the profit and the size of each item 构成。. In any given part 2 of the self-part one, is the value from the bit, (four) _ _ ^ 5 for each = material date of drinking 4 Different combinations. The values of X and y bits are as follows: (8): R is invalid; 01: R is speculated; W: R is recognized; Π: R is recognized; Accreditation is invalid (after reading the request R, after) After reading the read request R) (after the item has read the read request R) (after reading the read request rule) 29 201245976 οο The influence of each instruction/event (impaet). After going back, U) 00 ϋ1 becomes 10. After the command "approval", G1 becomes 11 and 10 becomes 11. After the 1st response event (roUbackevent), 01 becomes 00, and the event stored in the register file item selector 1205 of Hi ίί is changed into a bit transition when the money is generated. Breaking: Si 2 can continue to execute in the shadow register state without temporarily storing paper 4. ^Sub-temporary 11 state ready for approval, update: save = the target selector causes the read from the part in the above manner = = start,: ''only need to update the scratchpad file item selector, then the heart TM_10 In the case of an exception, it can be returned to the nearest w (e〇mmit p. ♦ Similarly, the recognition point can move forward, and the error is only confirmed by the update of the temporary project. = Functionality does not require any cross-copying between the H-requests for 0. In this way, 4 temporary storage n can be recorded via the temporary 1205, HSSSR) and a number of approved registers (received ed Π i, SSSR material It Wei CR temporary storage boundary. In response, the SSSR state reverts to the CR register. ° 201245976 Figure 13 shows a speculative architectural state and transient state in accordance with an embodiment of the present invention. Schematic diagram of a unified shadow register file and pipeline architecture 1300. The specific embodiment of the present invention depicts a component that includes a support architecture 1300 that includes architectural push State instructions and results and support for transient instructions and results. As used herein, the recognized architecture state contains visible registers that can be accessed by programs executing on the processor (eg 'sbuy and write'). Ble register) and visible memory. Conversely, the speculative state of the architecture contains registers and/or memory that are not recognized and therefore not globally visible. In one embodiment, four are enabled by the architecture 13 A usage model is used. The first usage model includes architectural speculation of hardware state updates, as described above in the discussion of Figure 12. The second usage model includes dual scope usage (dua) sc〇pe usage. Can be used to extract 2 threads into the processor, one of which executes in the speculation and the other thread in the non-speculative state. Here we use the model towel, two fan threats It is extracted into the machine (4), and exists in the machine. In the fourth (4) includes instructions from one form to another form of JIT (and configuration ί ί used in this model, via software (such as 'JiT) heavy row The third usage model can be turned into, for example, client-to-native instruction translation, virtual 31 201245976 to machine-to-native instruction translation, or to remap/translate native microinstructions into more optimized native microinstructions. The usage model includes transient c〇ntext switching instead of archiving and restoring the previous environment after returning from the transient environment. This usage model is suitable for environment switching that can occur for several reasons. One such rationality is, for example, to accurately address exceptions through an exception handling environment (excepti〇n hanc|iing context). The second, third, and fourth usage models are further illustrated in the discussion of Figures 14-17 below. Referring again to Figure 13' architecture 1300 includes several components for implementing the four usage models described above. The unified shadow register file 1301 includes: a first part, an approved temporary benefit file 1302; a second part 'shadow register file 1303; and a third part, a latest indicator array 1304. It also includes a speculative retirement memory buffer (SMB) 1342 and a new indicator pay array 1341. The architecture 1300 includes a 〇ut of order architecture. Thus the architecture 1300 additionally includes a reorder buffer and retirement window 1332. The rearrangement and withdrawal window 1332 additionally includes a machine retirement pointer 1331, a ready bit array 1334', and a per instruction latest indicator, such as an indicator 1333. The first usage model, i.e., architectural speculation of hardware state updates, is further detailed in accordance with an embodiment of the present invention. As mentioned above, the architecture 1300 includes an out-of-order architecture. The hardware of architecture 1300 can recognize out-of-order instruction results (eg, out-of-order loading and out-of-order storage and out-of-order register updates). The architecture 1300 above utilizes the system-shadow register file in the 32 201245976 mode of Figure 12 to support speculative execution between the recognized temporary registers. In addition, the architecture 13 is speculatively executed using the speculative t^T^rculaxive ioad store buffer 1342. 〗 332 f:: will use these components with the rearrangement buffer and withdraw the window 1332, in order to allow its state to be properly withdrawn _ endorsement register file (10) and 2 body ^ 5 0 even if the machine Wei Wei method will put these components in (4) Withdrawal of the unified shadow register file and withdrawal of the memory buffer. For example, the racking case 1301 and the speculative memory 1342 are based on an example; the function of generating and retrieving the t° is such that the state of the register can be removed in an unordered manner by 1342. The speculation is withdrawn from the memory buffer disorder. As the speculative execution continues:: then the machine withdraws the indicator coffee to move forward until the trigger register is moved by its approval point forward. ^ The second memory buffer is retracted according to the machine (4) 1 = Shop___ _1332 1-7, Ready bit _ 1334 shows the pure line of = (10) ' __ 7" additional instructions to be executed. According to this, == _^ is allowed to proceed. Thereafter, if Exceptions such as Le Shi Wu and Li You 1 / Knife can reply to the instruction that occurred after Command 6. Or,

U ’又*可例外發生，則可藉由相 J 133卜認可所有指令！々。 ❹撤回伯‘ 33 201245976 一斤使用最新指示符陣列1341、最新指示符陣列13〇4及最新指不符13』3以允許亂序執行。例如，即使指令2在指令$之 ^暫存器R4,則-旦指令5準備發生，將忽略指令2的載入。最新載入將根據最新指示符取代之前的載入。在重排緩衝器及撤回視窗1332内發生分支預測或例 =觸發回復事件。如上述，在發生回復時，統—影子暫存^ 案1301將回復至其最後一個認可點，及推 = 1342 將被排清(flushed)。知及:緒3至處理器中’其中一個執行緒在推測狀態中執行，二執=推::在執^ 如在示意圖刚中顯示，2個料/執跡、及M0 獅軌跡。架構啟 =二二=U ’ and * can occur exceptionally, and all instructions can be approved by phase J 133! Hey. ❹Recalling </ s> 33 201245976 A pound uses the latest indicator array 1341, the latest indicator array 13 〇 4 and the latest indicator does not match 13 』 3 to allow out-of-order execution. For example, even if instruction 2 is in the register $4 register, then the instruction 5 is ready to occur, and the loading of instruction 2 will be ignored. The latest load will replace the previous load with the latest indicator. A branch prediction or a case = trigger reply event occurs within the rearrangement buffer and withdrawal window 1332. As mentioned above, when a reply occurs, the system-shadow temporary storage 1301 will revert to its last approval point, and push = 1342 will be flushed. Know: from the 3rd to the processor, one of the threads executes in the speculative state, the second is = push:: in the execution ^ as shown in the schematic, 2 materials / obstruction, and M0 lion track. Architecture start = two two =

模式中執仃，及另一個在SR/SM 34 201245976 憶i赏二以⑽模式中，讀取及寫入認可暫存器，及記 S;SR:憶體。在SR/SM模式中，暫存器寫入進入衝器(獅7自最新寫入讀取’而記憶體寫入撤回記憶體緩 -個實例將是排序的目前齡(如’剛)及推職的下犯奇(如，1>4〇2)。因為在目前範疇之後提取下一個範疇，二二：會文到重視’因此兩個範疇均可在機器中執行。例如，，摩O4G1中，在「認可SSSR為CR」處，直至此時的暫存益及兄憶體處於CR模式，而程式碼在CR/CM模式中執行。在 =_中，程式碼在SR^M模式中執行，=== :=回復。以此方式，兩個範疇在機器中同時執行，但各在不同模式中執行並相應地讀取及寫入暫存器。立圖15顯示根據本發明之一具體實施例之第三使用模型的示意，1·，其包括暫態環境切換，在從暫態環境返回後，不必存檔及復原先前環境。如上述，此使賴型翻於可因若干理由發生的魏她。—慨種理由例如可以是經由例外處境對例外進行精確處置。 4 第三使用模型發生於機器正執行轉譯碼及其遇到環境切換時(如，轉譯碼中的例外或需要後續程式碼轉譯時）。在目前的範 #(如，在例外之前）中’ SSSR及SMB尚未認可其對客戶架構狀態的，測狀態。目前狀態在SR/SM模式中運行。在例外發生寺機器。刀換至例外處置益(eXCepti〇n hancJler)(如，轉換器）以精確地處理例外。插人回復，這造成暫存^狀肋復為cr^遞 %: 35 201245976 被排清。轉換器碼將在SR/CM模式中運行。在間，_將其環境撤回到記憶體 ^換=期被寫ί織且不用更新CR。其後，當轉換器:::且= 回執订的轉換碼之前，轉換器回復為(如，回、 CR) °在此程序期間，最後認可的暫存器狀態是在中。這顯示在示意圖15〇〇中，其中先前的範 SS张認可為CR。目前的練軌跡⑽是推測性的。暫= 及5己憶體及此蛇轉是推測性的，及執行在sr/sm 在此實例中’例外發生在㈣⑼2中，及程式碼在轉譯之前需要以原始順序重新執行。此時，SSSR被回復及smb被排清。接著’πτ程式碼1M)3執行。JIT程式碼將sssr回復至範今測的末端並排清SMB。在SR/CM模式下執行m。在鼠结束時， SS^R被回復為CR ’及目前的紳轨跡i 5〇4接著在置卿的轉譯順序重新執行。以此方式，以確切的目前順序精確地處置例外。 B圖16顯示根據本發明之—具體實施例贿指令序列中的例外是因為後續程式碼需要轉譯之_的示意圖咖。如在示意圖1600中顯示，先前的齡/轨跡_以遠跳(farjump)至未轉澤的。目的地而結束。在跳至遠跳目的地之前，認可娜尺為cr。 Jir权式碼1602接著執行以轉譯遠跳目的地的客戶指令(如以建立本機指令騎軌跡）。在SR/CM模式下執行】ιτ。在π執 ^結束時，暫存器狀態從SSSR回復為CR，及由m轉譯的新 ^可’軌跡1603開始執行。新的範•轨跡繼續在模式中從先前範疇/軌跡1601的最後認可點執行。、 36 201245976 圖π顯示根據本發明之一具體實施例之第四使用模型的示意圖1700，其包括暫態環境切換，在從暫態環境返回後，不必存檔及復原先前環境。如上述，此使用模型適用於可因若干理由發生的環境切換。一個此種理由例如可以是經由例外處置产境處理輸入或輸出。衣示意圖1700顯示在CR/CM模式下執行的先前範嘴/轨跡 1701以叫用（call)函數F1而結束的案例。直至此時的暫存器狀態從SSSR認可為CR。函數F1範疇/軌跡17〇2接著開始在 SR/CM模式下以推測的方式執行。函數F1接著以返回主範疇/ 執跡1703而結束。此時，暫存器狀態從SSSR回復為cr。: 範*#/執跡1703重返在CR/CM模式中執行。圖18顯示根據本發明之一具體實施例之例示性微處理器管線(=iCr〇processor pipeiine)18〇〇的圖式。微處理器管線 18〇°〇包括實施上述硬體加速轉換程序之功能性的硬體轉換加速器。在圖18具體實施例中，硬體轉換加速器耦接至提取模組ΐ8〇ι，其後是解碼模組1802、分配模組18〇3、分派模組（此⑽也 mHe)1804、執行模組1805及撤回模組1806。應注意，微處理器管線1 _只是實施本發明之具體實酬上述魏性之管線的-個實例。熟習本技術者應瞭解，可實施包括上述解碼模組功能性的其他微處理器管線。為了解說的目的，已參考特定具體實施例做出以上說明。然而’以上闡釋之討論的目的不在詳盡窮舉或限制本發明於揭 37 201245976 ==;可按照以上教示進行許多修改及變化。具體實 = 本發明的原理及實際應賴出最好技術者以適於所想散用途的各種修改，充刀利用本發明及各種具體實施例c 【圖式簡單說明】本發明藉由舉例而非限制，以附圖的各個圖式進行解說，圖中相似參考數字代表相似元件。圖1顯不本發明之一具體實施例所操作的例示性指令序列。圖2顯不根據本發明之一具體實施例描繪基於區塊的轉譯程序的示意圖，其中將客戶指令區塊轉換為本機轉換區塊。圖3顯示根據本發明之一具體實施例圖解將客戶指令區塊的各個指令轉換為本機轉換區塊之職的本機指令之方式的示意圖。圖4顯示根據本發明之一具體實施例圖解以處置本機轉換區塊處理較遠分支之方式的示意圖。圖5顯示根據本發明之一具體實施例之例示性硬體加速轉換系統的不意圖’其圖解將客戶指令區塊及其對應的本機轉換區塊儲存在快取中的方式。圖6顯示根據本發明之一具體實施例之硬體加速轉換系統的詳細實例。 μ圖7顯不根據本發明之一具體實施例具有輔助軟體加速轉換官線之硬體加速轉換系統的實例。圖8顯示根據本發明之一具體實施例圖解CLB結合程式碼快取及儲存於記憶體中的客戶指令至本機指令映射而發揮功能之方式的例示性流程圖。 38 201245976 圖9顯示根據本發明之一具體實施例圖解實體儲存堆疊程式碼快取實施方案及客戶指令至本機指令映射的例示性流程圖。圖10顯示根據本發明之一具體實施例描繪硬體加速轉換系統之額外例示性細節的示意圖。圖11A顯示本發明之具體實施例所實施之例示性樣式匹配程序的示意圖。圖11B顯示根據本發明之一具體實施例之基於SIMD暫存器之樣式匹配程序的示意圖。圖12齡根據本發明之—具體實施例之統―暫存器槽案圖式。、一具體實施例包括雙範_使用之第圖Μ顯示根據本發明之二使用模型的示意圖。 —具體實施例之第三使用模型的示 ’在從暫態環境返回後，不必存標圖15顯示根據本發明之意圖’其包括暫態環境切換及復原先前環境。外是因為後續程式％ _具體貫關騎指令序列中的例圖之案例的示意圖° 意圖，其包括暫態環& #具體貫施例之第四使用模型的示及復原先前環境Γ 、’在從㈣環境返回後，不必存稽圖18顯示根據本發一— 一線的圖式。一具體實施例之例示性微處理器管 39 201245976 【主要元件符號說明】 100 指令序列 101〜104 分支指令 201 客戶指令區塊 202 本機轉換區塊 301 客戶指令緩衝器 302 本機指令緩衝器 401 記憶體中的客戶指令序列 402 記憶體中的本機指令序列 411 客戶指令較遠分支 412 對應的本機指令客戶較遠分支 500 硬體加速轉換系統 501 糸統記憶體 502 客戶提取邏輯單元 503 客戶提取緩衝器 504 轉換表 505 本機轉換緩衝器 506 轉換後備緩衝器 507 本機快取 508 處理器 600 硬體加速轉換系統 601 糸統記憶體 602 客戶碼 603 轉換後備緩衝器 604 最佳化器碼 201245976 605 606 607 608 609 610 611 612 613 614 615 616 620 630 631 632 633 634 635 640 650 700 711 712 713 716The mode is executed, and the other is in the SR/SM 34 201245976 Recalling the second (10) mode, reading and writing the approval register, and recording S; SR: memory. In SR/SM mode, the scratchpad writes into the punch (Lion 7 reads from the latest writes) and the memory writes back to the memory. The instance will be sorted by the current age (such as 'just) and pushed The post is ridiculous (eg, 1>4〇2). Because the next category is extracted after the current category, 22: the text is attached to the importance of 'so both categories can be implemented in the machine. For example, in the O4G1 In the "Approved SSSR is CR", until the temporary benefit and the brother's memory are in CR mode, the code is executed in CR/CM mode. In =_, the code is executed in SR^M mode. , === := Reply. In this way, the two categories are executed simultaneously in the machine, but each is executed in a different mode and the register is read and written accordingly. Figure 15 shows a specific one according to the invention The schematic of the third usage model of the embodiment, 1·, which includes transient environment switching, does not have to archive and restore the previous environment after returning from the transient environment. As described above, this turns the reliance on for several reasons. Wei She.—The reason for this may be, for example, the precise handling of exceptions through exceptions. 4 The usage model occurs when the machine is performing transcoding and when it encounters an environment switch (eg, an exception in a transcoding or a subsequent code translation is required). In the current Van # (eg, before the exception) 'SSSR and SMB The state of the customer's architecture has not been approved yet. The current state is running in SR/SM mode. In the exception of the temple machine, the knife is switched to the exception handling (eXCepti〇n hancJler) (eg, converter) for precise processing Exception. Insert the reply, which causes the temporary storage of the ribs to be cr^. %: 35 201245976 is cleared. The converter code will run in the SR/CM mode. During the period, _ withdraw its environment back to the memory ^ The change = period is written and the CR is not updated. Thereafter, when the converter::: and = the return code is converted, the converter returns (eg, back, CR) ° during this procedure, the last approved The register status is in. This is shown in Figure 15〇〇, where the previous fan SS is recognized as CR. The current practice track (10) is speculative. Temporary = and 5 recalls and this snake turn is speculation Sexuality, and execution in sr/sm In this example 'exceptions occur in (4)(9)2 And the need to re-execute the code in the original order prior to translation. In this case, SSSR be replied and smb is emptying Next 'πτ code 1M) 3 performed. The JIT code restores sssr to the end of the current measurement and clears the SMB. Execute m in SR/CM mode. At the end of the mouse, SS^R is reverted to CR' and the current trajectory i 5〇4 is then re-executed in the translation order of the clerk. In this way, exceptions are accurately handled in the exact current order. B Figure 16 shows a diagram in accordance with the present invention - the exception to the exception in the sequence of bribes is that the subsequent code needs to be translated. As shown in diagram 1600, the previous age/track_farjumped to the untransferred. End with the destination. Before jumping to the far-away destination, the recognition of the ruler is cr. The Jir-weight code 1602 then executes a client instruction to translate the far-hop destination (e.g., to establish a native command ride trajectory). Execute ιτ in SR/CM mode. At the end of π, the scratchpad state is reverted from the SSSR to CR, and the new ^can' track 1603 translated by m begins execution. The new fan track continues in the mode from the last recognition point of the previous category/track 1601. 36 201245976 Figure π shows a schematic 1700 of a fourth usage model in accordance with an embodiment of the present invention that includes transient environment switching that does not require archiving and restoring the previous environment after returning from the transient environment. As mentioned above, this usage model is suitable for environmental switching that can occur for several reasons. One such reason may be, for example, processing the input or output via an exception handling environment. The garment schematic 1700 shows a case where the previous van/trajectory 1701 executed in the CR/CM mode ends with a call to the function F1. The scratchpad status up to this point is recognized as CR from the SSSR. The function F1 category/trajectory 17〇2 then begins to be performed speculatively in the SR/CM mode. The function F1 then ends with a return to the main category/destruction 1703. At this point, the scratchpad status is reverted from the SSSR to cr. : Fan *#/Exit 1703 Reentry is performed in CR/CM mode. Figure 18 shows a diagram of an exemplary microprocessor pipe (=iCr〇processor pipeiine) 18〇〇 in accordance with an embodiment of the present invention. The microprocessor pipeline 18〇°〇 includes a functional hardware conversion accelerator that implements the hardware acceleration conversion procedure described above. In the specific embodiment of FIG. 18, the hardware conversion accelerator is coupled to the extraction module ΐ8〇ι, followed by the decoding module 1802, the distribution module 18〇3, the dispatch module (this (10) is also mHe) 1804, and the execution module. Group 1805 and withdrawal module 1806. It should be noted that the microprocessor line 1 is merely an example of a pipeline for carrying out the above-described specificity of the present invention. Those skilled in the art will appreciate that other microprocessor pipelines including the functionality of the above described decoding modules can be implemented. The above description has been made with reference to specific embodiments for the purpose of illustration. However, the above discussion of the present invention is not intended to be exhaustive or to limit the scope of the invention. DETAILED DESCRIPTION OF THE INVENTION The principles and practicalities of the present invention should be relied upon by the best skilled in the art for various modifications to the intended use, and the present invention and various specific embodiments are used. The figures are illustrated in the drawings, and like reference numerals represent like elements. 1 shows an exemplary sequence of instructions that operates in accordance with an embodiment of the present invention. 2 shows a schematic diagram depicting a block-based translation procedure in which a client instruction block is converted to a native conversion block, in accordance with an embodiment of the present invention. 3 shows a schematic diagram illustrating the manner in which individual instructions of a client instruction block are converted to native instructions of a local conversion block in accordance with an embodiment of the present invention. 4 shows a schematic diagram illustrating the manner in which a local conversion block handles a farther branch in accordance with an embodiment of the present invention. Figure 5 shows a schematic illustration of an exemplary hardware accelerated conversion system in accordance with an embodiment of the present invention, which illustrates the manner in which client instruction blocks and their corresponding local conversion blocks are stored in a cache. Figure 6 shows a detailed example of a hardware accelerated conversion system in accordance with an embodiment of the present invention. Fig. 7 shows an example of a hardware acceleration conversion system having an auxiliary software acceleration conversion main line according to an embodiment of the present invention. Figure 8 is a diagram showing an exemplary flow chart illustrating the manner in which a CLB combines a code cache with a client command to a native command map stored in a memory in accordance with an embodiment of the present invention. 38 201245976 Figure 9 shows an illustrative flow diagram illustrating a physical storage stacked code cache implementation and client to native instruction mapping in accordance with an embodiment of the present invention. Figure 10 shows a schematic diagram depicting additional exemplary details of a hardware accelerated conversion system in accordance with an embodiment of the present invention. Figure 11A shows a schematic diagram of an exemplary pattern matching procedure implemented by a particular embodiment of the present invention. Figure 11B shows a schematic diagram of a SIMD scratchpad based pattern matching procedure in accordance with an embodiment of the present invention. Figure 12 is a schematic diagram of a register of a register according to the present invention. A specific embodiment includes a dual-use diagram showing the use of a model in accordance with the present invention. - The representation of the third usage model of a particular embodiment does not have to be labeled after returning from the transient environment. Figure 15 shows the intent according to the present invention' which includes transient environment switching and restoring the previous environment. The reason is that the follow-up program % _ is a schematic diagram of the case example of the example in the riding instruction sequence, which includes the transient ring &# concrete embodiment of the fourth use model and the restoration of the previous environment 、 , ' After returning from the (4) environment, it is not necessary to check the figure 18 to show the pattern according to the first line of the present invention. Exemplary microprocessor tube 39 of a specific embodiment 201245976 [Main component symbol description] 100 instruction sequence 101~104 branch instruction 201 client instruction block 202 native conversion block 301 client instruction buffer 302 native instruction buffer 401 Client instruction sequence in memory 402 Native instruction sequence in memory 411 Customer instruction far branch 412 Corresponding local instruction client far branch 500 hardware acceleration conversion system 501 记忆 memory 502 client extraction logic unit 503 client Extraction buffer 504 conversion table 505 local conversion buffer 506 conversion backup buffer 507 local cache 508 processor 600 hardware acceleration conversion system 601 system memory 602 client code 603 conversion backup buffer 604 optimizer code 201245976 605 606 607 608 609 610 611 612 613 614 615 616 620 630 631 632 633 634 635 640 650 700 711 712 713 716

轉換器碼本機碼快取共用硬體快取緩衝器 TLB 客戶硬體快取客戶提取緩衝器轉換表轉換表多工器多工器本機轉換緩衝器客戶提取邏輯轉換後備緩衝器轉換的區塊登錄點位址本機位址轉換的位址範圍程式碼快取及轉換後備緩衝器管理位元動態分支偏差值位元提取邏輯處理器硬體加速轉換系統客戶提取緩衝器轉換表轉換表本機轉換緩衝器 41 201245976 760 特殊高速記憶體 801 糸統記憶體 802 客戶位址 804 CLB硬體快取 805 虛線方框 806 程式碼快取記憶體 901 實體儲存堆疊 902 標籤結構 903 延伸方式標籤結構 1000 硬體加速轉換系統 1001 直線 1002 表格 1003 表格 1004 第二級表格 1100 不意圖 1101 傳入樣式 1102〜1105 SIMD暫存器 1201 統一暫存器檔案 1202-1203 部分 1205 項目選擇器 1300 統一影子暫存器檔案及管線架構 1301 統一影子暫存器檔案 1302 認可暫存器檔案 1303 影子暫存器檔案 1304 最新指示符陣列 1331 機器撤回指標 42 201245976 1332 重排緩衝器及撤回視窗 1333 最新指示符 1334 就緒位元陣列 1341 最新指示符陣列 1342 推測撤回記憶體緩衝器 1350 可見記憶體 1400 不意@ 1401 目前的非推測範疇/執跡 1402 新的推測範疇/執跡 1500 不意圖 1501 先前的範疇/執跡 1502 目前的範脅/軌跡 1503 JIT程式碼 1504 目前的範鳴/軌跡 1600 示意圖 1601 先前的範疇/執跡 1602 JIT程式碼 1603 新的範疇/轨跡 1700 示意圖 1701 先前的範疇/轨跡 1702 函數F1範臀/執跡 1703 主範疇/執跡 1800 微處理器管線 1801 提取模組 1802 解碼模組 1803 分配模組 43 201245976 1804 分派模組 1805 執行模組 1806 撤回模組 1807 統一影子暫存器檔案 1808 推測記憶體緩衝器 1809 全域可見記憶體/快取 44Converter code native code cache shared hardware cache buffer TLB client hardware cache client extraction buffer conversion table conversion table multiplexer multiplexer local conversion buffer client extraction logic conversion backup buffer conversion area Block Login Point Address Local Address Conversion Address Range Code Acquisition and Conversion Backup Buffer Management Bit Dynamic Branch Deviation Value Bit Extraction Log Processor Hardware Acceleration Conversion System Customer Extraction Buffer Conversion Table Conversion Table Machine conversion buffer 41 201245976 760 Special high-speed memory 801 记忆 memory 802 Customer address 804 CLB hardware cache 805 dotted line box 806 code memory 901 physical storage stack 902 label structure 903 extension mode label structure 1000 Hardware Acceleration Conversion System 1001 Line 1002 Table 1003 Table 1004 Second Level Table 1100 Not intended 1101 Incoming Style 1102~1105 SIMD Register 1201 Unified Register File 1202-1203 Part 1205 Item Selector 1300 Unified Shadow Register File and Pipeline Architecture 1301 Unified Shadow Register File 1302 Approval Scratchpad file 1303 Shadow register file 1304 Latest indicator array 1331 Machine retraction indicator 42 201245976 1332 Reorder buffer and retract window 1333 Latest indicator 1334 Ready bit array 1341 Latest indicator array 1342 Presumably recall memory buffer 1350 Visible memory 1400 Unintentional @ 1401 Current non-speculative category / Persecution 1402 New speculative category / Obstruction 1500 Not intended 1501 Previous category / Obstruction 1502 Current Fan Threat / Trajectory 1503 JIT Code 1504 Current Fan Ming /Track 1600 Schematic 1601 Previous Category / Execution 1602 JIT Code 1603 New Category / Track 1700 Schematic 1701 Previous Category / Track 1702 Function F1 Fan Hip / Obstruction 1703 Main Category / Execution 1800 Microprocessor Pipeline 1801 extraction module 1802 decoding module 1803 distribution module 43 201245976 1804 dispatch module 1805 execution module 1806 withdrawal module 1807 unified shadow register file 1808 speculative memory buffer 1809 global visible memory / cache 44

Claims

201245976 VII. Patent application scope: 1. A hardware-based translation accelerator, including: 2: Household extraction logic component 'used to access a plurality of customer protection Bayi customer extraction buffers, coupled to the prediction component' The plurality of conversion tables for the plurality of customer age groups = the narrative and the - branch, which are connected to the client: the buffer household instruction block; the instruction block is translated into a corresponding local conversion block ; ° 'for the customer - the local cache, its _ to the conversion block; & material _ should be the local one conversion backup slower 'frequency limbs the native silk command block compilation The machine conversion block -_; Xiao stored in the customer which received the wheat backup buffer for a customer instruction to decide whether to occur - hit, the conversion has the corresponding - in the local cache - change the machine Wei The user instruction is executed in response to the hit, and the conversion backup buffer forwards the translated native instruction to request the hardware-based translation accelerator described in the first item of the patent scope, and the household access & logical access and processing The number of clients that are not related to the device . / i is a hardware-based translation accelerator as described in item 1, wherein the cache contains a cache of the most commonly encountered native conversion block of the age-old scarf. 'If, please patent scope! A hard-base translation accelerator, wherein a conversion buffer is maintained in the system 5 memory, and a cache consistency is maintained between the conversion lookaside buffer and the 45 201245976 conversion buffer. The hardware translation accelerator described in Li Fancai, > is greater than the conversion backup, and (4) bribes to maintain consistency between the conversion buffer and the conversion backup buffer. • The hardware-based translation acceleration described in item 1 of the patent scope is 11 and its _ replacement buffer is implemented as _ to intermediate cache memory. & R 円 low latency =, = for a processor to accelerate the translation of client instructions into the native command - each household extraction logic component 'used to access a plurality of client instructions; a customer extraction buffer, Transfer to the customer extraction prediction component, the _ branch 2 is connected to the customer extraction buffer, used to deduct the customer 7 volt block into a corresponding local conversion block; / 苳 ^ ^ machine cache , 纲卿嶋, _fine (four) local conversion-conversion backup buffer H, its _ to the local touch instruction block to the corresponding local conversion block - mapping; the age of the customer is received for a customer A subsequent buffer of the instruction to determine if a hit occurs, where the. The local cache has a corresponding conversion local command j (four) client instruction in the execution response _ hit, convert the backup buffer transfer _ the local instruction to 46 201245976 8. The system described in claim 7 Accessing the plurality of client instructions unrelated to the processor. The client (4) logical group 9 _, as described in the seventh paragraph of the patent, contains a replacement strategy to maintain a cache stored in the 1st-to-f backup buffer. The most commonly encountered local conversion block L, such as lit, is the system described in item 7, wherein in-system memory == the device, and in the conversion lookaside buffer and the conversion buffer ^ The system of claim 2, wherein the conversion buffer is greater than 2 switcher_ and uses a writeback strategy to maintain consistency between the switch buffer and the lookaside buffer. L2. The system of claim 7, wherein the conversion backup buffer is implemented as a high speed low latency cache memory for the pipeline. A microprocessor for a method of translating instructions, the microprocessor comprising: a microprocessor pipeline; a σ-body accelerator module coupled to the microprocessor pipeline, the hardware accelerator module The group further includes: an Bayi customer extraction logic component for accessing a plurality of client instructions; a client extraction buffer coupled to the client extraction logic component and a branch prediction component for combining the plurality of client instructions a client instruction block; 201245976 a plurality of conversion tables coupled to the client fetch buffer instruction block translated into a corresponding local conversion block; σ is used to cache the client a local machine, coupled Connect to the conversion table, the conversion block; the material _ should be the local machine - convert the reserve money to the local cache, and the mapping of the block to the corresponding local conversion block; After receiving a follow-up buffer for a customer instruction, the mosquito is a science-life towel, and the towel surface refers to green=the, and the local cache has a corresponding conversion local command; and The household deduction is in the execution of the response The hit 'the conversion lookaside buffer transmits the translated native instruction. 14. The microprocessor of claim 13 wherein the beta component accesses the plurality of client instructions unrelated to the microprocessor ^Customer Extraction Logic 15. The microprocessor of claim 13 wherein the diagnostic apparatus includes a (4) replacement strategy to maintain the most frequent storage of the towel: a version of the backup zone %% The present invention converts a system memory conversion buffer, the microprocessor of claim 13, wherein a conversion buffer is maintained in the body, and between the conversion buffer and the conversion buffer Maintaining the cache consistency. 17. The microprocessor of claim 16, wherein the read ~ is greater than the conversion lookaside buffer, and the write back strategy is used to convert between the buffer buffers and the backup buffer buffer A microprocessor as described in claim 3, wherein the conversion lookaside buffer is implemented as a high speed coupled to the microprocessor pipeline Fast when diving low 1^. The microprocessor according to claim 13 of the patent scope, wherein the hardware accelerator module and the parallel customer instruction extraction pipeline are instrucu〇n fetch pipeline) Parallel to a native microprocessor extraction pipeline (1) also functions as a microprocessor fetch pipeline. 20. A microprocessor that implements a translation instruction method, the microprocessor comprising: a microprocessor pipeline; 'It contains high-speed memory that touches the miscellaneous (four) tube and line, and the έ 加速 accelerator module additionally includes: a customer extraction logic 11 for accessing a plurality of client instructions; for extracting the memory of the customer by one The light is connected to the customer extraction logic and the plurality of accountant instructions are combined into a client instruction block; the corresponding local conversion=one rotation wire is used to translate the client instruction block into a one-local conversion delay _, secret Store _ where the n block for the -client command is received, the user command block to the corresponding local conversion block; the second index stores the guest to determine whether it occurs - hit, where trace, #, - turn The backup buffer fetches a - corresponding conversion native instruction X; and/or "the client instruction is in the local fast instruction in order to perform the response in response to the hit"; the buffer transmits the translated version 49 201245976 21. The microprocessor of claim 20, wherein the high speed memory comprises one of the microprocessors L0 cache. 22. The microprocessor of claim 20, wherein the accelerator module further comprises one of the microprocessors loading a storage instruction extraction path.