TW201005633A

TW201005633A - Recovery apparatus for solving branch mis-prediction and method and central processing unit thereof

Info

Publication number: TW201005633A
Application number: TW97127014A
Authority: TW
Inventors: Chih-Yung Chiu
Original assignee: Faraday Tech Corp
Priority date: 2008-07-16
Filing date: 2008-07-16
Publication date: 2010-02-01
Also published as: TWI362001B

Abstract

According to the exemplary examples of the present invention, a recovery apparatus for solving branch mis-prediction and method and central processing unit thereof are provided. The recovery apparatus includes an instruction buffer, at least one circulation instruction buffer, and a decoding and pairing circuit, wherein the decoding and pairing circuit is coupled to the instruction buffer and the circulation instruction buffer. The instruction buffer stores a plurality of instructions, and the circulation instruction stores a recovery instruction queue corresponding to the instructions, wherein the recovery instruction queue includes a plurality of recovery instructions. The decoding and pairing circuit both decodes and pairs the instructions and the recovery instructions. The decoding and pairing circuit outputs the recovery instructions to an instruction execution processing circuit which is externally connected to the decoding and pairing circuit when the branch mis-prediction occurs.

Description

201005633 一 w v 〇 1-T W 28199twf. doc/n 九、發明說明：【發明所屬之技術領域】本發明是有關於一種中央處理器（Central processing • Unit ’簡稱為CPU)的分支預測(Branch Prediction)裝置，且特別是有關於一種用於解決分支預測錯誤（Branch Mis-Prediction)的回復裝置及其方法與其中央處理器。 ^ 【先前技術】隨著半導體技術的進步，電腦已經成為曰常生活所需的必需品，人們可以用電腦設計程式來處理很多的事務。電腦裡面最重要的核心是中央處理器，目前市面上的中央處理器都會具有分支預測的裝置來處理分支指令。一般而言，一段程式中大約每四至五個指令就會有一個分支指令，所以使用具有分支預測裝置的中央處理器將可以有較好效能表現。然而，當中央處理器處理分支指令，分支預測裝置並無法母一次都準確地預測到下一個應該執 • 行的指令。所以，分支預測裝置可能會產生分支預測錯誤而導致中央處理器的效能有所損失，此效能損失即俗稱的分支損失(Branch Penalty)。為了解決此一問題，許多的研究機構與中央處理器的製造業者會設法使分支預測錯誤的機率降低，以便提升中央處理器的效能。因此，許多的預測演算法(Predicti〇n • Algorithm)與分支預測架構被廣泛地提出。 • 然而，上述這些軟體的演算法或硬體架構都僅降低分 201005633201005633 A wv 〇1-TW 28199twf. doc/n IX. Description of the Invention: [Technical Field] The present invention relates to a branch prediction of a central processing unit (CPU) Apparatus, and particularly related to a reply apparatus for solving Branch Mis-Prediction and its method and its central processing unit. ^ [Prior Art] With the advancement of semiconductor technology, computers have become a necessity for everyday life, and people can use computer design programs to handle many transactions. The most important core in the computer is the central processing unit. Currently, the central processor on the market will have branch prediction devices to process branch instructions. In general, about four to five instructions in a program will have a branch instruction, so using a central processor with a branch prediction device will perform better. However, when the central processor processes the branch instruction, the branch prediction device cannot accurately predict the next instruction that should be executed at a time. Therefore, the branch prediction device may generate a branch prediction error and cause a loss of performance of the central processing unit, which is commonly known as Branch Penalty. In order to solve this problem, many research institutes and CPU manufacturers will try to reduce the probability of branch prediction errors in order to improve the performance of the central processor. Therefore, many prediction algorithms (Predicti〇n • Algorithm) and branch prediction architectures are widely proposed. • However, the algorithms or hardware architecture of these softwares are only reduced by 201005633.

.Ol-TW 28199twf.doc/n 支預測錯誤的機率。對於程式中的條件分支(Conditional Branch)或迴圈運算之最終迴圈(Finai Loop)都免不了會有分支預測錯誤的情形，因此，為了解決分支預測錯誤之情形’研究人員提出一種多路徑執行之中央處理器架構 (Multi-Path Execution CPU Architecture)來克服上述之問題。然而多路徑執行之中央處理器僅能處理一個分支預測錯誤的情形，所以，多路徑執行之中央處理器需要一個可靠估測器（Confidence Estimator)來幫忙中央處理器同時操取(Fetch)多個路徑的指令。另外，多路徑執行之中央處理器會同時將兩個路徑的指令執行到完，所以此多路握執行之中央處理器更需要一個暫存器更名機制（Register Renaming Mechanism)來處理資料相關性(Data Dependency) 與暫存器確認(Register Commitment)的問題。目前，多路徑執行之中央處理器因為其複雜度過高，大部分僅在於理論的階段與研究，少有真的實現在硬體上。另外，在相關的論文與研究上，多路徑執行之中央處理器亦僅增加大約10%的效能。 ' 除了使用多路徑執行之中央處理器處理多路徑執行的方式外，另一種方式是使用多線程中央處理器 (Multi-Threading CPU)讓編譯者(c〇mplier)利用兩個線^ (Threads)去處理多路徑執行。、但是，在深度較大之管線的超純量中央處理器 (Superscalar CPU)中，多路徑執行的方法或架構都因為複 6 201005633_ 28199twf.doc/n 雜度太高而無法實作成電路。另外，因為要執行多個路徑的指令，所以上述之架構與方法需要可靠估測器，或需要編譯者的努力，才能達到效能的增進。【發明内容】本發明提供一種用於解決條件分支（c〇nditi〇nal Branch)或迴圈運算之最終迴圈(Final L〇op)分支預測錯誤 ❻ 的回復裝置及其方法與其中央處理器，此回復裝置可有效的降低因為分支預測錯誤所產生的效能損失。另外，由於所發明的回復裝置及其方法的低複雜度，因此可以應用於深層管線(Deep-Pipeline)之超純量的中央處理器。本發明之範例提供一種用於解決分支預測錯誤的回復裝置，此回復裝置包括指令緩衝器、至少一個圈狀式緩衝器與解碼配對電路。其中，解碼配對電路，耦接於指令緩衝器與圈狀式指令緩衝器。指令緩衝器用以儲存複數個指令，而圈狀式指令緩衝器用以儲存對應於此多個指令的回齡復指令列，此回復指令列包含複數個回復指令。解碼配對電路用以此多個指令與回復指令進行解碼與配對，當分支預測錯誤發生時，解碼配對電路輸出這些回復指令給與其外接的指令執行處理電路。根據本發明之範例，其中，上述之回復裝置更包括分支指標緩衝器、快取控制器與指令快取記憶體電路。其中’ 分支指標緩衝器耦接於快取控制器，快取控制器耦接於指 •令快取記憶體電路，指令快取記憶體耦接於圈狀式指令緩 201005633 --------i)l-TW 28199twf.doc/n 衝器與指令緩衝器。分支指標緩衝器用以偵測是否有分支預測的發生，當分支預測發生時，將指標程式記數送至快取控制器。快取控制器根據指標程式記數控制指令快取記憶體來擷取其儲存的多個指令與回復指令。指令快取記憶體電路用以儲存多個指令與回復指令。.Ol-TW 28199twf.doc/n The probability of predicting errors. For the conditional branch in the program or the final loop of the loop operation (Finai Loop), there is a case of branch prediction error. Therefore, in order to solve the problem of branch prediction error, the researchers proposed a multi-path execution. The Multi-Path Execution CPU Architecture overcomes the above problems. However, the multi-path execution CPU can only handle one branch prediction error, so the multi-path execution CPU needs a Confidence Estimator to help the central processor fetch multiple simultaneously. The instruction of the path. In addition, the multi-path execution of the central processor will execute the instructions of the two paths at the same time, so the CPU of the multi-way execution needs a register renaming mechanism to handle the data correlation ( Data Dependency) Issue with Register Commitment. At present, the multi-path execution of the central processor is too high in complexity, most of which is only in the theoretical stage and research, and rarely implemented on hardware. In addition, in the related papers and research, the central processor of multi-path execution only increases the performance by about 10%. In addition to the way the multiprocessor execution of the multiprocessor executes multipath execution, another way is to use the multi-threading CPU to let the compiler (c〇mplier) utilize two lines ^ (Threads) To handle multipath execution. However, in a super-scalar CPU (Superscalar CPU) with a deep pipeline, the method or architecture of multi-path execution cannot be implemented as a circuit because of the high complexity of the 2010-0633_28199twf.doc/n. In addition, because of the instructions to execute multiple paths, the above architecture and methods require reliable estimators, or require the efforts of the compiler to achieve performance improvements. SUMMARY OF THE INVENTION The present invention provides a recovery apparatus and method thereof for solving a final branch (Final L〇op) branch prediction error 条件 of a conditional branch or a loop operation, and a central processing unit thereof, This reply device can effectively reduce the performance loss caused by branch prediction errors. In addition, due to the low complexity of the inventive recovery device and method, it can be applied to a deep-Pipeline ultra-pure central processor. An example of the present invention provides a reply device for resolving a branch prediction error, the reply device comprising an instruction buffer, at least one loop buffer and a decoding pairing circuit. The decoding pairing circuit is coupled to the instruction buffer and the ring instruction buffer. The instruction buffer is used to store a plurality of instructions, and the circular instruction buffer is used to store a recovery complex command column corresponding to the plurality of instructions. The reply instruction column includes a plurality of reply instructions. The decoding pairing circuit decodes and pairs with the plurality of instructions and the reply command. When the branch prediction error occurs, the decoding pairing circuit outputs the reply commands to the instruction execution processing circuit connected thereto. According to an embodiment of the present invention, the recovery device further includes a branch indicator buffer, a cache controller, and an instruction cache memory circuit. Wherein the branch indicator buffer is coupled to the cache controller, the cache controller is coupled to the finger cache memory circuit, and the instruction cache memory is coupled to the ring type command buffer 201005633 ------ --i) l-TW 28199twf.doc/n Punch and instruction buffer. The branch indicator buffer is used to detect whether a branch prediction occurs, and when the branch prediction occurs, the indicator program count is sent to the cache controller. The cache controller captures a plurality of instructions and reply commands stored by the cache controller according to the index program control instruction cache. The instruction cache memory circuit is used to store multiple instructions and reply instructions.

本發明之範例提供一種用於解決分支預測錯誤的回復方法，此方法包括以下步驟··（a)接收複數個指令至指令緩衝記憶體，以及接收與此多個指令對應之多個回復指令至至少一圈狀式指令緩衝記憶體；（b)對此多個指令與回復指令進行編碼與配對；（C)當分支預測錯誤發生時，輸出此多個回復指令給指令執行處理電路。根據本發明之範例，上述之回復方法更包括以下井驟：（d)將此多個回復指令的分支預測錯誤位元設為丨，= 記錄此多個回復指令的分支程式記數。根據本發明之範例，上述之回復方法更包括以下An example of the present invention provides a reply method for solving a branch prediction error, the method comprising the steps of: (a) receiving a plurality of instructions to an instruction buffer memory, and receiving a plurality of reply instructions corresponding to the plurality of instructions to At least one circle of instruction buffer memory; (b) encoding and pairing the plurality of instructions and the reply instruction; (C) outputting the plurality of reply instructions to the instruction execution processing circuit when a branch prediction error occurs. According to an embodiment of the present invention, the above reply method further includes the following steps: (d) setting the branch prediction error bit of the plurality of reply instructions to 丨, = recording the branch program count of the plurality of reply instructions. According to an example of the present invention, the above reply method further includes the following

驟·（e)若無分支預測錯誤發生，則輸出此多個指令终八執行處理電路。 9 V 根據本發明之範例，其中，上述之圈狀式指令緩所能儲存的回復指令個數等於指令執行處理電器的指令數目。執4仃本發明之範例提供一種具有回復裝置的中央器，此中央處理器包括指令緩衝器、至少一圈狀式护t理衝器、解碼配對電路與指令執行處理電路。其中，7緩對電路爐於指令緩衝賴圈狀式指令緩衝’器，指令g 8 υΙ-TW 28199twf.doc/n 201005633 處理電路耦接於解碼配對電路。指令緩衝器用以儲存複數個指令’圈狀式指令緩衝器用以儲存對應於此多個指令的回復指令列’此回復指令列包含複數個回復指令。解碼配對電路用以對此多個指令與回復指令進行解碼與配對7當分支預測錯誤發生時，解碼配對電路輸出此多個回復指; 給指令執行處理電路，指令執行處理電路用以執行此多個回復指令與指令。根據本發明之範例，其中，上述之中央處理器更包括分支指標緩衝器、快取控制器與指令快取記憶體電路。其中’为支指標緩衝器辆接於快取控制器，快取控制器輕^ 於才a令快取記憶體電路，指令快取記憶體耦接於圈狀式指令緩衝器與指令緩衝器。分支指標緩衝器用以偵測是否有分支預測的發生，當分支預測發生時，將指標程式記數送至快取控制器。快取控制器根據指標程式記數控制指令快取記憶體來擷取其儲存的多個指令與回復指令。指令快取 S己憶體電路用以儲存多個指令與回復指令。綜上所述，本發明之範例所提供之回復裝置及其方法與其中央處理器在於減少分支預測錯誤時所產生的損失。其中，本發明範例所提供之回復裝置僅需在原始的中央處理器架構上加上至少一個圈狀式指令緩衝器與幾個邏輯閘在原始的解碼配對電路，因此其複雜度低且易於實現在電路上。另外，假設同樣在快取擊中率與分支指標缓衝器的擊中率為95% # 12 .級之管線架構中，與沒有指令緩衝器的中央處理器相比，具有此回復裝置的中央處理的效能可 9 OI-TW 28199twf.doc/n 201005633 增加約9〜10%的效能。為讓本發明之上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式，作詳細說明如下。【實施方式】(e) If no branch prediction error occurs, the plurality of instructions are output and the processing circuit is executed. 9 V According to an embodiment of the present invention, the number of reply commands that can be stored by the above-described ring-shaped command is equal to the number of instructions of the instruction execution device. An example of the present invention provides a central device having a reply device including an instruction buffer, at least one loop-type protector, a decode pairing circuit, and an instruction execution processing circuit. The processing circuit is coupled to the decoding pairing circuit, and the processing circuit is coupled to the decoding pairing circuit. The instruction buffer is used to store a plurality of instructions 'a circular instruction buffer for storing a reply instruction column corresponding to the plurality of instructions'. The reply instruction column includes a plurality of reply instructions. Decoding the pairing circuit for decoding and pairing the plurality of instructions and the reply instruction. When the branch prediction error occurs, the decoding pairing circuit outputs the plurality of reply fingers; and the instruction execution processing circuit, the instruction execution processing circuit is configured to perform the Reply instructions and instructions. According to an embodiment of the present invention, the central processing unit further includes a branch indicator buffer, a cache controller, and an instruction cache memory circuit. The indicator buffer is connected to the cache controller, and the cache controller is used to make the memory circuit. The instruction cache is coupled to the ring instruction buffer and the instruction buffer. The branch indicator buffer is used to detect whether a branch prediction occurs, and when the branch prediction occurs, the indicator program count is sent to the cache controller. The cache controller retrieves a plurality of instructions and reply commands stored by the cache according to the index program control instruction cache. Instruction cache The S-memory circuit is used to store multiple instructions and reply instructions. In summary, the reply device and method thereof provided by the example of the present invention and its central processor are to reduce the loss caused by branch prediction errors. The recovery device provided by the example of the present invention only needs to add at least one ring-shaped instruction buffer and several logic gates in the original decoding pairing circuit to the original central processor architecture, so the complexity is low and easy to implement. On the circuit. In addition, it is assumed that in the pipeline architecture in which the cache hit ratio and the branch index buffer hit rate are 95% #12., the central portion of the reply device is compared with the central processor without the instruction buffer. The performance of the treatment can be increased by about 9 to 10% by 9 OI-TW 28199twf.doc/n 201005633. The above described features and advantages of the present invention will be more apparent from the following description. [Embodiment]

本發明之範例提供了一種用於解決分支預測錯誤的回復裝置、方法及其中央處理器，此回復裝置用以減少分支預測錯誤時所產生的效能損失。如同前面所述，一個程式中約4〜5個指令左右就有一個分支指令，而且在條件分支或最終迴圈又免不了會有分支預測錯誤的情形發生。因此，本發明之範例提供的回復裝置在分支預測錯誤時’可以讓其後端的指令執行處理電路繼續對其對應的回復齡接續執行，輯少整體效能之損失。 °月參照圖1，圖1是根據本發明之範例提供中央處理器1〇〇。中央處理器100包括於分支預測錯誤時的回復裝置10、指令執行處理電路20、指令快取記憶體電路3〇、 f取控制11 4G與分支指標緩衝It (Brand! Target Buffer·，簡，為BTB)50。其中，分支指標緩衝$ 5〇麵接於快取控制器40，快取控制器4〇 _接於指令快取記憶體電路％，回復裝置10耦接於指令執行處理電路2〇與指令快取圮憶體30之間。、°似在此需注意的是，指令執行處理電路20可以是一個具有深層管線的超純量之齡執行處理電路，然而，指令 201005633An example of the present invention provides a replying apparatus, method and central processing unit for solving branch prediction errors, which are used to reduce the performance loss caused by branching prediction errors. As mentioned above, there are about one branch instruction in about 4 to 5 instructions in a program, and it is inevitable that there will be a branch prediction error in the conditional branch or the final loop. Therefore, the reply device provided by the example of the present invention can make the instruction execution processing circuit of the back end continue to execute the corresponding recovery age when the branch prediction error occurs, thereby reducing the loss of the overall performance. Referring to Figure 1, Figure 1 provides a central processor 1 in accordance with an example of the present invention. The central processing unit 100 includes a reply device 10 for branch prediction error, an instruction execution processing circuit 20, an instruction cache circuit 3, a f control 11 4G, and a branch indicator buffer It (Brand! Target Buffer·, simply, BTB) 50. The branch indicator buffer is connected to the cache controller 40, the cache controller 4〇 is connected to the instruction cache memory circuit %, and the reply device 10 is coupled to the instruction execution processing circuit 2 and the instruction cache. Between the memory 30. It should be noted that the instruction execution processing circuit 20 may be an ultra-pure amount of execution processing circuit with a deep pipeline, however, the instruction 201005633

01-TW 28199twf.doc/n 執行處理電路20的類型並非用以限定本發明。另外，指令快取記憶體電路30、快取控制器40與分支指標緩衝器5〇在此範例中是與回復裝置1〇分離的，然而，此範例並非用以限定本發明。換句話說，另一種實施方式是將指令快取記憶體電路30、快取控制器40或分支指標緩衝器50設計於回復裝置10内。當分支指標缓衝器50偵測到有分支指令發生時，且01-TW 28199 twf.doc/n The type of processing circuitry 20 is not intended to limit the invention. Additionally, the instruction cache memory circuit 30, the cache controller 40, and the branch indicator buffer 5 are separated from the reply device 1 in this example, however, this example is not intended to limit the invention. In other words, another embodiment is to design the instruction cache memory circuit 30, the cache controller 40 or the branch indicator buffer 50 in the reply device 10. When the branch indicator buffer 50 detects that a branch instruction occurs, and

此分支指令擊中(Hit)分支指標緩衝器50時，分支指標緩衝器50會將指標程式記數(Target Pr〇gram c〇unter，簡稱為 Target PC)送至快取控制器4〇。快取控制器根據指標程式記數控制指令快取記憶體30來擷取(Fetch)指令快取記憶體30所儲存的多個指令’並將這些指令送至回復裝置1〇。When the branch instruction hits the branch indicator buffer 50, the branch indicator buffer 50 sends the target program count (Target Pr〇gram c〇unter, referred to as Target PC) to the cache controller 4〇. The cache controller fetches the instruction fetch memory 30 based on the index program count command to fetch the plurality of instructions stored in the cache memory 30 and sends the instructions to the reply device 1 .

回復裝置10將自指令快取記憶體3〇所擷取的指令暫存至其緩衝器，其中，這些指令可能包括了多個分支預測 =刀支指令、多個一般指令與預測錯誤時的多個回復指令。+接著，回復裝置10對分支預測的分支指令與回復指令同時做編媽與崎(Paking)的動作，或僅對—般指令作編碼與配對的動作。當分支綱錯誤未發生時，回復裝置10 將多個-般指令或分支預_分支指令送至後端的指令執 =處理電路2G執行。但是’若分支預測錯誤時則回復裝會將多個回復指令送至後端的指令執行處理電路20 繼續執行。复技當分支賴錯誤發生時，指令執行處理電路2〇會將 /、s線内的所有指令清除。但鱗，喊裝置⑺此時會將 11 w a w -^yj.TW 28199twf.doc/n 多個回復指令送至指令執行處理電路20繼續執行，所以回復裝置10可以利用此段時間來擷取與暫存之後要執行的多個指令。如此’便能夠有效地減少分支預測錯誤時需要 • 暫停(stall)指令執行處理電路20的管線與暫停回復裝置 10所造成的效能損失。 < 接著’再進一步地說明回復裝置1〇的構造及實施方式。如同圖1所示，回復裝置10包括至少一個以上的圈狀 ❸ 式指令緩衝器（Circulation Instruction Queue)60〜62、指令緩衝器63與解碼配對電路64。其中，圈狀式指令緩衝器 60〜62耦接於解碼配對電路64與指令快取記憶體電路3〇之間，解碼配對電路64則耦接於指令執行處理電路2〇，，指令緩衝器63則是耦接於解碼配對電路64與指令快取 °己憶體電路30之間。在管線式中央處理器1〇〇中，解碼配對電路64可整合於管線前段的解碼階段(Dec〇de Stage)， =指令執行處理電路20則可以包括管線後段的階段，像是鲁 9令執行階段(Execution Stage)。〜在此要注意的是，雖然本範例的圈狀式指令緩衝器60 62的個數是3個，但是並非用以限定本發明。換句話說，的，式指令緩衝器6〇〜62之個數的設計可以根據使用者 ^鸲要來設計。在本範例中，假設指令緩衝器63可以儲存圩個指令，而平均每4個指令就有丨個分支指令，所以，將圈狀式指令緩衝器6〇〜62之個數設為（12/4=3)個。 • ^言之，圈狀式指令緩衝器60〜62的個數的一般設計原則個，其中，X表示指令緩衝器63可以儲存的指令個 12 201005633 --------Jl-TW 28199twf.doc/n 表示平均每n個指令中有1個分支指令，「叫表示 - 取大於x/w的最小整數。 • 另外，在本範财，圈狀式指令緩衝器60〜62之個 ^能儲存的回復指令數目為6個。但是，此個數依然並非用以限定本發明。換句話說，陳式指令緩衝器6“62 之個數所能儲存的回復指令數目可以根據使用者的需要來设計。在本範例中是假設指令執行處理電路2〇t多可以處骞理6個指令’所以’設定圈狀式指令緩衝器6〇〜62之個數 =能儲存的回復指令數目為6個。簡單地說，就是圈狀式緩衝器60〜62之個數所能儲存的回復指令數目一般設為後端指令執行處理電路最多可以處理的指令個數。取控制11根據指標程式記數㈣指令快取記憶〇來娜(Feteh)齡快取記舰3()賴存的多個指令，，多個-般指令與分支指令會被送至指令緩衝器幻儲鬌此時雛式齡緩衝以〇〜62會對快取㈣器4〇送出請求(Request)，此時峰控彻4()會將對應於這些 a令的回錢令聰eeoveiy instmetiGn Queue)從指令快，記憶體電路3G分別送至圈狀式指令緩衝器6()〜62儲子，其中，回復指令列包括複數個回復指令。另外，， $端的指令執減理電路64是_管線之超純量的指令 ' $行電路’所以’此範例用3個圈狀式指令緩衝器60〜62 來暫存回復指令。因此，在發生分支預測錯誤時，使用回设裝置10的中央處理器100可以有效地減少分支損失。 201005633^ 28199twf.doc/n 接著’當分支指令進入解碼配對電路64時，其對應的回復指令亦會同時進入解碼配對電路64。此時，解碼配對電路64會同時對分支指令與回復指令同時進行解碼與配對的動作。接著，解碼配對電路64將分支指令的預測錯誤有效位元(Mis-Prediction Valid Bit)設為1，並同時紀錄分支才日々所繼承的分支程式記數(Branch pr〇gram Counter)。在>又有分支預測錯誤的情況下，解碼配對電路64會將分支指令輸出給後端指令執行處理電路2〇。另外，若是一般指令而非分支指令進入解碼配對電路64時’則解碼配對電路 64就僅接收一般指令，並對一般指令進行解碼與配對的動作後，就將此一般指令送至後端指令執行處理電路2〇。當分支預測錯誤發生時，整個指令執行處理電路2〇的管線與指令緩衝器63的指令都會被清除。此時，解碼配對電路 6 4會將對應於分支預測錯誤之分支指令的多個回復指令送給指令執行處理電路20,以藉此讓指令執行處理電路2〇接著執行這些回復指令。此時，因為有回復指令存在指令執行處理電路20的管線，並被接著執行，所以，指令緩衝器63可以利用此段時間來擷取新的指令，以藉此減少分支預測錯誤時需要暫停(Stall)指令執行處理電路2〇的管線與拖住解碼配對電路64所造成的效能損失。另外’因為上述的指令快取記憶體電路3〇再同一時間需要被截取出多個指令給指令緩衝器63與圈狀式指令緩衝器60〜62，所以，可以將指令快取記憶體電路3〇搭配側轉快取(Skew Cache)的架構來設計。此時，每一個時 201005633 Λ .«01-TW 28199twf.doc/n 脈區間内，指令快取記憶體電路30可以同時被讀出8個字元。針對條件分支或最終迴圈的狀況，都是分支預測機制預測會發生分支但實際上卻沒有的情況，所以在指令快取記憶體電路30搭配侧轉快取的架構下’並沒有讓指令快取 έ己憶體電路30的頻寬有額外的消耗。當然，上述之指令快取記憶體電路30搭配侧轉快取的架構並非用以限定本發 ❹The reply device 10 temporarily stores the instructions fetched from the instruction cache 3 to its buffer, wherein the instructions may include multiple branch predictions = knife instructions, multiple general instructions, and multiple prediction errors. Reply instructions. + Next, the reply device 10 performs the action of composing the mother and the paking at the same time as the branch instruction and the reply command predicted by the branch, or the operation of encoding and pairing only the general command. When the branch error has not occurred, the replying device 10 sends a plurality of general instructions or branch pre-branch instructions to the instruction execution circuit 2G of the back end. However, if the branch prediction error occurs, the acknowledgment device sends a plurality of reply commands to the instruction execution circuit 20 of the back end to continue execution. When the branching error occurs, the instruction execution processing circuit 2 clears all instructions in the /, s line. However, the scale, shouting device (7) will send 11 waw -^yj.TW 28199twf.doc/n multiple reply commands to the instruction execution processing circuit 20 to continue execution, so the reply device 10 can use this period of time to retrieve and temporarily Multiple instructions to be executed after saving. Thus, it is possible to effectively reduce the performance loss caused by the pipeline of the instruction execution processing circuit 20 and the suspension of the recovery device 10 when the branch prediction error is effectively reduced. < Next, the structure and implementation of the recovery device 1A will be further described. As shown in Fig. 1, the reply device 10 includes at least one or more Circulation Instruction Queues 60-62, an instruction buffer 63, and a decode pairing circuit 64. The ring-shaped instruction buffers 60-62 are coupled between the decoding pairing circuit 64 and the instruction cache circuit 3A, and the decoding pairing circuit 64 is coupled to the instruction execution processing circuit 2, the instruction buffer 63. Then, it is coupled between the decoding pairing circuit 64 and the instruction cache circuit. In the pipelined central processing unit, the decoding pairing circuit 64 can be integrated in the decoding stage of the pipeline front stage (Dec〇de Stage), and the instruction execution processing circuit 20 can include the stage of the back stage of the pipeline, such as the execution of the Lu 9 command. Stage (Execution Stage). It should be noted that although the number of the ring-shaped command buffers 60 62 of the present example is three, it is not intended to limit the present invention. In other words, the design of the number of the instruction buffers 6〇~62 can be designed according to the user. In this example, it is assumed that the instruction buffer 63 can store one instruction and the average of every four instructions has one branch instruction. Therefore, the number of the circular instruction buffers 6〇~62 is set to (12/ 4=3). • In other words, the general design principle of the number of circular instruction buffers 60 to 62, where X represents the instruction that the instruction buffer 63 can store 12 201005633 -------- Jl-TW 28199twf .doc/n means that there is one branch instruction per n instructions, "calling - taking the smallest integer greater than x/w. · In addition, in this model, the ring type instruction buffers 60~62 ^ The number of reply commands that can be stored is 6. However, this number is not intended to limit the present invention. In other words, the number of reply commands that can be stored by the number of the instruction buffers 6 "62 can be based on the user's Need to design. In this example, it is assumed that the instruction execution processing circuit 2 〇t can handle six instructions at the same time. Therefore, the number of the ring-shaped instruction buffers 6 〇 to 62 is set = the number of responsive commands that can be stored is six. Briefly, the number of reply instructions that can be stored by the number of ring buffers 60 to 62 is generally set to the maximum number of instructions that the back end instruction execution processing circuit can process. Take control 11 according to the indicator program count (four) command cache memory 〇娜 ( (Feteh) 取取取 3 3 (), a number of instructions, a number of general instructions and branch instructions will be sent to the instruction buffer The magic storage 鬌雏 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 62 The slave memory circuit 3G is sent to the ring type instruction buffers 6() to 62, respectively, wherein the reply command line includes a plurality of reply instructions. In addition, the instruction fetching circuit 64 of the $ terminal is a super-scaling instruction of the _line '$row circuit'. This example uses three ring-shaped instruction buffers 60-62 to temporarily store the reply command. Therefore, the central processing unit 100 using the return device 10 can effectively reduce the branch loss when a branch prediction error occurs. 201005633^28199twf.doc/n Then when the branch instruction enters the decoding pairing circuit 64, its corresponding reply command also enters the decoding pairing circuit 64 at the same time. At this time, the decoding matching circuit 64 simultaneously decodes and pairs the branch instruction and the reply instruction. Next, the decoding pairing circuit 64 sets the Mis-Prediction Valid Bit of the branch instruction to 1, and simultaneously records the Branch pr〇gram Counter inherited by the branch. In the case where there is a branch prediction error again, the decoding pairing circuit 64 outputs the branch instruction to the back end instruction execution processing circuit 2A. In addition, if the general instruction is not the branch instruction entering the decoding pairing circuit 64, then the decoding pairing circuit 64 only receives the general instruction, and after decoding and pairing the general instruction, the general instruction is sent to the back instruction execution. Processing circuit 2〇. When a branch prediction error occurs, the instruction of the entire instruction execution processing circuit 2 and the instruction buffer 63 are cleared. At this time, the decoding pairing circuit 64 sends a plurality of reply instructions corresponding to the branch instruction of the branch prediction error to the instruction execution processing circuit 20, thereby causing the instruction execution processing circuit 2 to execute these reply instructions. At this time, since there is a reply instruction existing pipeline of the instruction execution processing circuit 20, and is subsequently executed, the instruction buffer 63 can use this period of time to retrieve a new instruction, thereby reducing the branch prediction error and requiring a pause ( The Stall) instruction performs a performance penalty caused by the pipeline of the processing circuit 2 and the drag-and-drop pairing circuit 64. In addition, since the above-mentioned instruction cache memory circuit 3 needs to intercept a plurality of instructions to the instruction buffer 63 and the ring type instruction buffers 60 to 62 at the same time, the instruction cache memory circuit 3 can be executed. Designed with the architecture of the Skew Cache. At this time, in each time interval 201005633 Λ . «01-TW 28199 twf.doc/n, the instruction cache memory circuit 30 can be read out 8 characters at the same time. For the conditional branch or the final loop, the branch prediction mechanism predicts that branching will occur but there is actually no such situation. Therefore, under the architecture of the instruction cache memory circuit 30 with side-to-side cache, the instruction is not fast. There is additional consumption of the bandwidth of the memory circuit 30. Of course, the above-described instruction cache memory circuit 30 with side-to-side cache architecture is not intended to limit the present invention.

明’使用此設計僅是為了不讓指令快取記憶體電路3〇的頻寬有額外的消耗。請繼續參照圖1’在此以圖丨中的指令緩衝器63所儲存的指令配合上狀介財舉舰明，關子是假設在最終迴圈或條件分支時出現分支預測錯誤。如圖丨所示，指令緩衝器63冑12個指令，這12個指令分別為η_8、η_4、 η(分支指令）、m_e、ιη·8、m_4、m(分支指令）、t e、t 8、t 4、 t(分支指令）、s、s+4。圈狀式指令緩衝器62儲存了 6個對應於第-個分支指令η的回復指令列，其第—個回復指令為奸4 ’也就是指令η的次—指令。另一方面，指令^ 則可以是分支指標預測器5G為分支指令n所預測的分支目標_- Target);也就是說，分支指標預測機制預測指令 η後t因分支而跳至指令㈣。同理，雛式指令緩衝器 6i儲存了 6個對應於第二個分支指令m的回復指令列，其第一個回復指令為㈣。圈狀式指令緩衝H 60則為第三指甘令J儲存了 6個對應於第三個分支指令t的回復 ^令列，其第一個回復指令為奸4,也就是指令⑽次一指令。而指令s則是分支預測機制為分支指令t預測的分支 15 201005633 χ , VJ1-T"W 28199twf.doc/n 目標在此假設分支預測錯誤是發生令t，則當指令n-8、n_4谁入缺s在弟一個刀支預剩指對電路64僅單純對指令n_8、元4配對電路64時，解竭配接芸ti、亩换膝ifc人 n_做解碼與配對的動作，接著就直餘^n_8、n⑽ = 2〇。接著，第，支指令η進人解獅H = ❹Ming's use of this design is only to avoid additional consumption of the bandwidth of the instruction cache memory circuit. Continuing to refer to Fig. 1' here, the instruction stored in the instruction buffer 63 in Fig. 配合 is matched with the above-mentioned instructions. It is assumed that a branch prediction error occurs in the final loop or conditional branch. As shown in FIG. ,, the instruction buffer 63 胄 12 instructions, which are η_8, η_4, η (branch instruction), m_e, ιη·8, m_4, m (branch instruction), te, t 8, respectively. t 4, t (branch instruction), s, s+4. The ring type instruction buffer 62 stores six reply command columns corresponding to the first branch instruction η, and the first reply command is the verb 4 ', which is the command η. On the other hand, the instruction ^ may be the branch target predictor 5G for the branch target _-Target predicted by the branch instruction n; that is, the branch indicator prediction mechanism predicts the instruction η after t jumps to the instruction (4) due to the branch. Similarly, the prototype instruction buffer 6i stores six reply instruction columns corresponding to the second branch instruction m, and the first reply instruction is (4). The circular command buffer H 60 stores six response commands corresponding to the third branch instruction t for the third finger, and the first reply command is the trait 4, that is, the command (10) times the command. . The instruction s is the branch prediction mechanism for the branch instruction t prediction branch 15 201005633 χ , VJ1-T"W 28199twf.doc/n target here assumes that the branch prediction error is the occurrence of the order t, then when the instructions n-8, n_4 Insufficient s in the brother of a knives pre-remaining means that the circuit 64 is only for the command n_8, the element 4 pairing circuit 64, decommissioning 芸 ti, mu change knee ifc person n_ to do decoding and pairing action, then Straight ^n_8, n(10) = 2〇. Then, the first, the instruction η enters the human lion H = ❹

亦同時接收圈狀式指令夺復解，因為沒有任何的分支預測錯以’ 電路64接著將指令n、m r 〇鮮馬配對電路20。 _C、m_8、m-4送至指令執行處理接著，第一個分支指令n進入解碼配對電路64時，解碼配對電路64㈣時接收圈狀式指令緩衝器61的回復才曰令m+4 ’因為沒有任何的分支預測錯誤，所以，解碼配對電路64接著將指令m、t_e、t_8、Μ送至指令執電路20。备第二個分支指令t進入解碼配對電路64時解碼配對電路64亦同時接收圈狀式指令緩衝旨61 _復指令 t+4 ’假設指令t將分支至次一指令计4而非分支預測之指令s，此時，因為發生了分支預測錯誤的情形，所以解碼配對電路64會將指令緩衝器61的多個回復指令進行解碼與配對的動作’並將這些回復指令送至指令執行處理電路 20’以藉此減少分支預測錯誤時需要拖住(stall)指令執行處理電路20的管線與拖住解碼配對電路64所造成的效能損失。此種情形常發生在遞迴的程式迴圈中，譬如說某個 201005633 一…VJ1-TW 28199twf.doc/n 迴圈需從指令t遞迴至指令S數次之後才能結束迴圈至次一指令t+4。在迴圈反覆遞迴時分支預測機制會學習到「指 ♦ 令t將跳至指令S」以作為分支預測的基準；然而，當迴^ ♦ 結束後要由指令ί繼續進行至次一指令t+4時，分I預測就會發生錯誤，造成處理器執行時的障害(Hazard)。而= 發明即可有效率地回復此種分支預測錯誤問題。最後，請參照圖2，圖2是根據本發明之範例所提供 Ο 的用於解決分支預測錯誤之回復方法的流程圖。此方法^ 以應用於中央處理器中，且特別是具有深層管線之超純量的中央處理器。首先，於步驟S80’接收複數個指令至指令緩衝記憶體，以及接收與此多個指令對應之多個回復^ 7至至少一圈狀式指令緩衝記憶體。接著，於步驟sgi，對此多個指令與回復指令進行編碼與配對，將這些回復指 •7的刀支預測錯誤位元设為1，並記錄這些回指於八沾八支程式記數。最後，於步驟S82，當分支預測錯誤H時^ 輸出此多個回復指令給指令執行處理電路；若無分支預測 ® 錯誤發生則輸出此多個指令給指令執行處理電路。“ 綜上所述，本發明之範例所提供之回復裝置及其方法與其中央處理器在於減少分支預測錯誤時所產生的損失。且上述回復裝置僅需在原始的中央處理器架構上加上至少一個圈狀式指令緩衝器與幾個邏輯閘在原始的解碼配對電路，因此其複雜度低且易於實現在電路上。另外，假設同樣在快取擊中率與分支指標緩衝器的擊中率為95%的12 . 級之管線架構中，與沒有指令緩衝器的中央處理器相比， 17 201005633 ......-J1-TW 28199twf.doc/n 具有此回復裝置的中央處理的效能可增加約能。 iU/e的效 • =然本發明已以實施·露如上，然其並非用本發明’任何所屬技術領域中具有通常知識者 ^ 本發明之精神和範圍内’當可作些許之更動與潤飾 ^發明之保護制當視肋之_請專利制所界定者為 ❹ 【圖式簡單說明】圖1是根據本發明之範例提供中央處理器10(^ 圖2是根據本發明之範例所提供的用於解決分支預測錯誤之回復方法的流程圖。【主要元件符號說明】 100 :中央處理器 10:回復裝置 ❹ 20:指令執行處理電路 30 :指令快取記憶體電路 40 :快取控制器 50:分支指標緩衝器 60〜62 :圈狀式指令緩衝器 63:指令緩衝器 ' 64 :解碼配對電路 • S80〜S82 :步驟流程At the same time, the loop-like instruction is received as a complex solution because there is no branch prediction error. The circuit 64 then aligns the instruction n, m r with the fresh horse pairing circuit 20. _C, m_8, m-4 are sent to the instruction execution process. Then, when the first branch instruction n enters the decode pairing circuit 64, the response of the received ring type instruction buffer 61 is decoded when the pairing circuit 64 (4) is decoded, so that m+4' There are no branch prediction errors, so the decode pairing circuit 64 then sends the instructions m, t_e, t_8, Μ to the instruction execution circuit 20. When the second branch instruction t enters the decoding pairing circuit 64, the decoding pairing circuit 64 also receives the ring type instruction buffer. 61 _ complex instruction t+4 'Assume that the instruction t will branch to the next instruction meter 4 instead of the branch prediction. The instruction s, at this time, because the branch prediction error occurs, the decoding pairing circuit 64 decodes and pairs the plurality of reply instructions of the instruction buffer 61 and sends the reply instructions to the instruction execution processing circuit 20 In order to reduce the branch prediction error, it is necessary to stall the instruction to execute the pipeline of the processing circuit 20 and to drag the performance loss caused by the decoding pairing circuit 64. This kind of situation often occurs in the rewinding program loop. For example, a certain 201005633 a...VJ1-TW 28199twf.doc/n loop needs to be returned from the instruction t to the instruction S several times before the loop can be ended to the next one. Command t+4. When the loop returns repeatedly, the branch prediction mechanism learns that "finger ♦ will t jump to instruction S" as the benchmark for branch prediction; however, when the callback ends, the instruction ί continues to the next instruction. At +4, an error occurs in the I prediction, causing a hindrance (Hazard) when the processor executes. The invention can efficiently respond to such branch prediction errors. Finally, please refer to FIG. 2. FIG. 2 is a flow chart of a method for recovering a branch prediction error according to an example of the present invention. This method ^ is applied to a central processing unit, and in particular to a super-quantity central processing unit with deep pipelines. First, a plurality of instructions are received to the instruction buffer memory in step S80', and a plurality of responses corresponding to the plurality of instructions are received to at least one of the instruction buffer memories. Next, in step sgi, the plurality of instructions and the reply command are encoded and paired, the knife prediction error bits of the reply fingers are set to 1, and the back fingers are recorded in the eight-dip program count. Finally, in step S82, when the branch prediction error H, the plurality of reply instructions are output to the instruction execution processing circuit; if no branch prediction ® error occurs, the plurality of instructions are output to the instruction execution processing circuit. In summary, the reply device and method thereof provided by the examples of the present invention and its central processor are to reduce the loss caused by branch prediction errors. And the above reply device only needs to add at least the original central processor architecture. A looped instruction buffer is paired with several logic gates in the original decoding pair, so its complexity is low and easy to implement on the circuit. Also, assume the same hit rate in the cache hit rate and branch indicator buffer In a 95% 12-stage pipeline architecture, compared to a central processor without an instruction buffer, 17 201005633 ...-J1-TW 28199twf.doc/n has the central processing performance of this reply device The effect of the iU/e can be increased. The present invention has been implemented as described above, but it is not intended to be used in the spirit and scope of the present invention. A few changes and refinements ^The protection system of the invention is defined as the ribs of the invention. The definition of the patent system is ❹ [Simplified illustration of the drawing] FIG. 1 is a schematic diagram of a central processing unit 10 according to an example of the present invention. A flowchart of a reply method for solving a branch prediction error provided by the example. [Main component symbol description] 100: CPU 10: Responsive device ❹ 20: Instruction execution processing circuit 30: Instruction cache memory circuit 40: Fast Take controller 50: branch indicator buffers 60 to 62: circle type instruction buffer 63: instruction buffer '64: decoding pairing circuit • S80 to S82: step flow

Claims

201005633 01- 28199twf.doc/n X. Application for Patent Park: 1. A replies for solving branch prediction errors, including: an instruction buffer for storing a plurality of instructions; at least one lap instruction buffer For storing a reply command column corresponding to the instructions, the reply command column includes a plurality of reply fingers eight and "

a decoding pairing circuit 'followed by the instruction buffer and the circular instruction buffer' for decoding and pairing the instructions and the reply instruction, and when the branch prediction error occurs, the decoding pairing circuit outputs the reply instructions A processing circuit is executed for an instruction connected thereto. 2. The reply device of claim 1, wherein if no branch prediction error occurs, the decoding pairing circuit outputs the ^ command to the instruction execution processing circuit. — 3· As claimed in claim 1, the reply device further includes: a buffer buffer for detecting whether there is a branch prediction

Raw, when the branch prediction occurs, send an indicator program count to a cache controller; the cache controller is lightly connected to the branch indicator buffer, and according to the indicator program count control - command cache _ taking the instruction of the age and the reply instruction; and the instruction cache memory circuit is coupled to the instruction path, the circular instruction buffer and the instruction buffer for storing the instruction With these reply instructions. 4. The reply device of claim 1, wherein the number of reply commands that the ring type instruction buffer can store is equal to ^ instruction ^ 19 201005633 * -^01-TW 28199twf.doc/n The number of instructions that the row processing circuit can execute. 5. The reply device of claim 1, wherein the decoding pairing circuit further sets the branch prediction error bit of the reply command to 1 ' and records the branch program count of the reply commands. 6. A method for recovering a branch prediction error, comprising: receiving a plurality of instructions to an instruction buffer memory, and receiving and multiplexing

a plurality of instructions corresponding to the plurality of instructions to at least one of the instruction buffer memories; encoding and pairing the instructions with the reply instructions; and outputting the reply instructions to an instruction when the branch prediction error occurs Processing circuit. 7. The method for replying according to item 6 of the patent application scope, further comprising: setting the branch prediction error bit of the reply instructions to 1, and recording the branch program counts of the reply instructions. 8. The method for replying according to claim 6 of the patent application, further comprising: if the right no branch prediction error occurs, outputting the instructions to the instruction execution processing circuit. The reply method of claim 6, wherein the number of reply commands that the bar sound instruction buffer can store is equal to the number of instructions that the command processing circuit can execute. A central processing unit having a replying device, comprising: a buffer for storing a plurality of instructions; a type instruction buffer 11 for storing 7 columns corresponding to the instructions, the reply command column comprising a plurality of reply commands The exhausting pairing circuit 'lights up' to the command buffer and the ring-shaped finger 20 -01-TW 28199twf.doc/n 201005633 i slow 3 branches with the reply commands to solve the problem and match the i finger; == the decoding Pairing the _ to the instruction execution processing circuit, the light is connected to the decoding pairing circuit to execute the replies and the instructions. 11. In the case of the central processing unit described in claim 10, if there is no branch deletion error, the decoding circuit outputs the instructions to the instruction execution processing circuit. — 12. The central processor described in item 1G of the patent scope further includes: a branch indicator buffer, and if there is a branch with a branch prediction, when the branch is generated, the indicator program is sent. To - the cache controller is programmed to be connected to the branch indicator buffer, according to the indicator program, the command system - the command thread __ takes its storage order and the reply commands; and A ❿ the command cache The memory circuit is transferred to the instruction cache memory circuit, the ring type instruction buffer and the instruction buffer, and the instruction and the reply instruction are used. The central processing unit of claim 10, wherein the number of replies that can be stored by the lap command buffer is the number of instructions that the instruction execution processing circuit can execute. 14. The central processing unit of claim 10, wherein the decoding pairing circuit further sets the branching error bit of the recovery age to 1, and records the branch program count of the reply instructions. twenty one