TW536685B - Group rename allocation for multiple register instructions - Google Patents

Group rename allocation for multiple register instructions Download PDF

Info

Publication number
TW536685B
TW536685B TW90123583A TW90123583A TW536685B TW 536685 B TW536685 B TW 536685B TW 90123583 A TW90123583 A TW 90123583A TW 90123583 A TW90123583 A TW 90123583A TW 536685 B TW536685 B TW 536685B
Authority
TW
Taiwan
Prior art keywords
instruction
register
registers
allocation
pool
Prior art date
Application number
TW90123583A
Other languages
Chinese (zh)
Inventor
James Allan Kahle
Original Assignee
Ibm
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ibm filed Critical Ibm
Application granted granted Critical
Publication of TW536685B publication Critical patent/TW536685B/en

Links

Landscapes

  • Advance Control (AREA)

Abstract

A microprocessor and a method of processing instructions therein are disclosed. Initially, the microprocessor determines whether a dispatched instruction requires multiple register allocations. If the instruction requires only a single register allocation, the microprocessor allocates a register for the instruction in a first portion of a physical register pool. If the instruction requires multiple register allocations, a set of physical registers in a second portion of the register pool are allocated as a block after determining that the second portion is available. In one embodiment, the block allocation is performed only if the instruction requiring multiple register allocations is an eligible instruction, such as a load multiple word (LMW) instruction in an instruction set such as the PowerPC(R) instruction set. The set of registers that are block allocated may correspond to a fixed set of architected registers to simplify the allocation logic. In this embodiment, the registers allocated in response to a complex instruction are independent of the architected registers affected by the instruction. If the architected registers affected by the instruction exceed the number of registers in the second portion of the register pool, a portion of the registers affected by the instruction may be allocated to the first portion of the register pool. The portion of the physical register pool selected for block assignments may be moved following a block allocation.

Description

536685 A7 B7 五、發明説明(1 ) 發明背景 1. 發明領域 本發明一般來說與微處理器領域有關,尤其是與具有暫 存器分配機構(可自訂同時分配暫存器給需要多重暫存器分 配的指令)的微處理器有關。 2. 相關技藝記錄 在微處理器結構領域中,暫存器更名廣泛應用於幫助失 序以及理論執行的指令。在一個暫存器更名方法學中,在 此稱為資料表映射暫存器更名,在需要時,來自實體暫存 器池的暫存器會分配給結構化暫存器(可見於程式内的暫存 器)。一對照表用於紀錄每個結構化暫存器與其對應實體暫 存器之間的關聯,當指令需要特定的結構化暫存器時,該 結構化暫存器會分配給實體暫存器池内的暫存器之一並將 此分配紀錄在對照表内。例如,一個從記憶體載入結構化 暫存器R5的指令會讓一個實體暫存器分配給R5。若分配第 七實體暫存器,則對照表内的R5位置會指出目前R5已經分 配給第七實體暫存器。當取出來自記憶體的内容後,資料 就會存入實體暫存器7内。若接下來的指令更新R5的内 容,則邏輯上R5會配置給新的實體暫存器並且更新對照表 的内容。 對照表的内容會隨時改變以反映目前的暫存器配置,此 外,實體暫存器檔案内的暫存器可承諾給特定結構化暫存 器、所分配或可用的。因此,通常需要複雜的邏輯來決定 哪個實體暫存器可用於分配,以及在發出指令時選擇一個 -4- 本紙張尺度適用中國國家標準(CNS) A4規格(210X 297公釐)536685 A7 B7 V. Description of the invention (1) Background of the invention 1. Field of the invention The present invention is generally related to the field of microprocessors, especially with a register allocation mechanism (can be customized to simultaneously allocate registers to multiple temporary registers Register instructions). 2. Relevant technical records In the field of microprocessor architecture, the rename register is widely used to help out-of-order and theoretically executed instructions. In a register renaming methodology, this is called the table mapping register renaming. When needed, the registers from the physical register pool will be allocated to structured registers (visible in the program's Register). A comparison table is used to record the association between each structured register and its corresponding physical register. When the instruction requires a specific structured register, the structured register is allocated to the physical register pool. One of the registers in the register and records this allocation in the lookup table. For example, an instruction to load the structured register R5 from memory would cause a physical register to be allocated to R5. If the seventh entity register is allocated, the R5 position in the comparison table will indicate that R5 is currently allocated to the seventh entity register. When the content from the memory is taken out, the data will be stored in the physical register 7. If the next instruction updates the contents of R5, logically R5 will be allocated to the new physical register and update the contents of the lookup table. The contents of the lookup table are subject to change at any time to reflect the current register configuration. In addition, the registers in the physical register file can be committed to a specific structured register, allocated or available. Therefore, complex logic is usually required to determine which physical register is available for allocation, and to choose one when issuing an order. -4- This paper size applies to the Chinese National Standard (CNS) A4 specification (210X 297 mm)

線 536685 A7 B7 五、發明説明(2 ) 用於分配的暫存器。分配邏輯的複雜性會限制到在时… 環内可分配的暫存器數量。 早—循 在大多數微處理器指令集内大部份指令只影響到 存器的内容,在此案例中,在—循環内所發出的指= 辛於所必須分配的暫存器數量。不過在某些指令集:: -指令(在此稱為複雜指令)可影響多個暫存器: 此需要多重暫存器分配。若分配邏輯式處理 : 的強迫因素,處理器可能無法在單一週期内以^方争= 配複雜指令所要求的額外暫存器,並且必須在任何發出二 雜指令的週期内發出少於最大數量的指令 ^ 有利於將每週期所發出的指令數量最大化,因 犯一種可將複雜指令的分配處理簡化: 週期都發出最多的指令。 m 力母個 發明概要 在此公佈—種處理指令的微處理器與方 決f配送的指令是否需要多重暫存器分二若: 暫存器分配’微處理器就會將實體暫存器池 :存^内的暫存器分配一個給該指令。若指令需要多重 ;-ϋΓ ’在衫暫存11池的第:部份可用之後就會將 施:二=實::存器配置成為區塊。在-個具體實 卜睡+ 1有“要多重暫存器分配的指令為適合的 ^内的备曰執行區塊分配,像是指令集(如PowerPC®指令 ^内々負載多重字元(LMW)指令。區塊分配的暫存器集 ° 到結構化暫存器的固定集合,來簡化分配邏輯。 _5 X297^i) 3 五、發明説明( 在:具::施例中’所分配用於回應複雜指 互於文扣令影響的結構化暫存器。若森 暫存斋獨 暫存器超過暫存器冰Λ # 又g々影響的結構化 到指令影響的社m #輕& &I^ σ數1,則部份受 内。選擇心Si 到暫存器池的第-部份 内選擇田成區塊指派的實體暫存器 配而移動。 刀了追Ik S塊分 _圖式簡軍說明_ ’將會對本發明的 從下列本發明的詳細說明並參閱附圖 其他目的與優點有通盤的了解,其中: 圖1 >料處理系統的方塊圖; 圖2為通合用於圖丨資料處理系仙的處理器之方塊圖,· 、圖3為說明依照本發明一個具體實施例的問題單元細節之 方塊圖, 圖4說明實體暫存器池在如本發明預期成為區塊暫存器分 配之前;以及 圖5說明依照本發明一個具體實施例的區塊暫存器分配之 後的實體暫存器池。 本發明可接受許多修改以及其他形式,所以此處藉由圖 式内範例來顯示特定具體實施例並做詳細說明。不過吾人 應當了解到,此處的圖式以及詳細說明並沒有將本發明限 制在所公佈的特定形式之意圖,相反地,本發明將涵蓋所 有在申請專利範圍定義之本發明領域與精神内的所有修 改、同質性以及改變。 I發明較鱼县體實施例之詳細說明 -6- 本紙張尺度適财g S家標準(CNS) A4規格(210X297公I) 536685 A7 _____ B7 五、發明説明(4 )~" — 請參閱圖1,此圖顯示依照本發明的資料處理系統100之 具體實施例。系統100 —個或多個中央處理單元(處理 器)101a、101b、l〇lc等等(統稱為處理器ίο!)。在一個具 體貫施例内’每個處理器101都包含一精簡指令集電腦 (RISC)微處理器。有關RISC處理器的額外資訊一般來說可 在 C· May 等人所著 PowerPC Architecture: A Specification for a New Family of RISC Processors 5 (Morgan Kaufmann, 1994第二版)内取得。處理器ιοί會耦合至系統記憶體25〇以 及透過系統匯流排113耦合至其他組件。 唯讀記憶體(ROM) 102耦合至系統匯流排113並且包含基 本輸出入系統(BIOS ),該系統用來控制系統1〇〇的某些基 本功能。圖1進一步說明耦合至系統匯流排113的1/〇配接卡 107以及網路配接卡106。I/O配接卡1〇7可為小型電腦系統 介面(SCSI)配接卡,可和硬碟1〇3與/或磁帶儲存裝置1〇5通 訊。I/O配接卡107、硬碟103以及磁帶儲存裝置105在此將 集合統稱為主要儲存裝置104。 網路配接卡106將匯流排113與外部網路互聯在一起,可 讓資料處理系統100與其他類似系統互相通訊。顯示監視器 136利用顯示配接卡112連接到系統匯流排113,該配接卡中 可包含用來改善大量圖形應用程式效能的圖形配接卡以及 視訊控制器。在一個具體實施例内,配接卡1〇7、1〇6與112 會連接到一個或多個透過中間匯流排橋接器(未顯示)連接 到系統匯流排113的I/O匯流排。 適合連接周邊裝置(像疋硬碟控制器、網路配接卡以及圖 -7- 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) 536685 A7 B7 五、發明説明(5 ) 形配接卡)的I / 0匯流排都包含依照P C I特殊國際組織 Hillsboro OR所公佈的PCI局部匯流排規格2.2版之周邊組件 介面(P CI)匯流排,並且在此併入當成參考。在此所顯示的 其他輸入/輸出裝置會透過使用者介面配接卡108與顯示配 接卡112連接到系統匯流排113。鍵盤109、滑鼠110以及喇 叭111全都透過使用者介面配接卡108(可包含將多種裝置配 接卡整合成單一積體電路的Superl/O晶片)與匯流排113互 連。有關此類晶片的詳細資訊,請讀者參閱國家半導體公 司的 PC87338/PC97338 ACPI 1.0 與 PC98/99 Compliant Superl/O 資料表(1998 年11月),網址為 www .national· com ο 因此如圖1内的架構,系統100包含處理器101型態的處理 裝置、包含系統記憶體250與主要儲存裝置104的儲存裝 置、像是鍵盤109與滑鼠110的輸入裝置以及包含喇队111與 顯示器136的輸出裝置。在一個具體實施例内,部份系統記 憶體250與主要儲存裝置104會集合儲存一個作業系統,像 是IBM Corporation的ΑΙΧ®作業系統,來協調圖1内所示許 多組件的功能。有關AIX作業系統的詳細資訊,請參閱IBM Corporation 出版的 ΑΙΧ 版本 4.3 技術參考:Base Operating System and Extensions,第 1與 2 冊(訂購編號SC23-4159 與 SC23-4160)、AIX 版本 4.3 系統使用指南· Communications and Networks(訂購編號SC23-4 122)以及ΑΙΧ版本4·3系統 使用指南:Operating System and Devices (訂購編號 S C 23-4121),網址為www.ibm.com並且在此併入當成參考。 -8 - 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) _ 裝Line 536685 A7 B7 V. Description of the invention (2) Register for allocation. The complexity of the allocation logic is limited to the number of scratchpads that can be allocated within the ring. Early-Cycle In most microprocessor instruction sets, most instructions only affect the contents of the registers. In this case, the instructions issued in the -loop = less than the number of registers that must be allocated. However, in some instruction sets:-Instructions (referred to herein as complex instructions) can affect multiple scratchpads: This requires multiple scratchpad allocations. If you assign logical processing to the coercive factor: the processor may not be able to compete in a single cycle with the extra registers required for complex instructions, and must issue less than the maximum number of cycles in which two heterogeneous instructions are issued The instruction ^ helps to maximize the number of instructions issued per cycle, because it simplifies the allocation processing of complex instructions: the most instructions are issued in each cycle. The outline of the invention of Li Li is published here-a microprocessor that processes instructions and the instructions that determine whether multiple registers need to be allocated are divided into two: If the register is allocated, the microprocessor will store the physical register pool. : A register in ^ is assigned to this instruction. If the instruction requires multiple; -ϋΓ ′ will be implemented after the first part of the 11th pool of the shirt is available: two = real :: The memory is configured as a block. In a specific example + 1 there is "the instruction to be allocated by multiple registers is suitable for the execution of the block allocation, such as the instruction set (such as PowerPC® instruction ^ internal multi-character load (LMW) Instructions. Block allocation register set ° to a fixed set of structured registers to simplify the allocation logic. _5 X297 ^ i) 3 V. Description of the invention (in: having :: in the embodiment 'assigned for The complex response refers to the structured register that is affected by the text deduction order. The Roussen temporary storage register exceeds the register ice Λ # and g 又 the structure of the impact of the structure is affected by the instruction m # 轻 & & I ^ σ number 1, then part of the internal. Select the heart Si to move to the physical register assigned by the Tiancheng block in the-part of the register pool. Brief description of the formula _ 'The present invention will have a comprehensive understanding of the following detailed description of the present invention and other objects and advantages of the present invention with reference to the drawings, in which: Figure 1 > Block diagram of a material processing system; Figure 2 is a general application for Figure 丨 the block diagram of the processor of the data processing system, Figure 3 is a detailed implementation of the present invention. Figure 4 is a block diagram showing the details of the problem unit. Figure 4 illustrates the physical register pool before the block register allocation as expected by the present invention; and Figure 5 illustrates the block register allocation according to a specific embodiment of the present invention. The physical register pool of the invention. The present invention can accept many modifications and other forms, so the specific embodiments are shown and explained in detail by examples in the drawings. However, I should understand that the drawings and details here The description does not intend to limit the invention to the specific form disclosed. On the contrary, the invention will cover all modifications, homogeneity, and changes within the scope and spirit of the invention as defined by the scope of the patent application. Detailed description of the embodiment -6- The paper size is suitable for financial standards (CNS) A4 specifications (210X297 male I) 536685 A7 _____ B7 V. Description of the invention (4) ~ " — Please refer to Figure 1, this figure Shows a specific embodiment of the data processing system 100 according to the present invention. The system 100-one or more central processing units (processors) 101a, 101b, 10c, etc. (collectively referred to as processing ίο!). In a specific embodiment, 'each processor 101 includes a reduced instruction set computer (RISC) microprocessor. Additional information about RISC processors is generally available in C. May et al. PowerPC Architecture: A Specification for a New Family of RISC Processors 5 (Morgan Kaufmann, 1994 second edition). The processor is coupled to the system memory 25 and to other components through the system bus 113. Read-only memory (ROM) 102 is coupled to the system bus 113 and contains a basic input / output system (BIOS), which is used to control certain basic functions of the system 100. FIG. 1 further illustrates a 1/0 adapter card 107 and a network adapter card 106 coupled to the system bus 113. I / O adapter card 107 can be a small computer system interface (SCSI) adapter card, and can communicate with hard disk 103 and / or tape storage device 105. The I / O adapter card 107, the hard disk 103, and the tape storage device 105 are collectively referred to as a main storage device 104 herein. The network adapter card 106 interconnects the bus 113 with an external network, and allows the data processing system 100 to communicate with other similar systems. The display monitor 136 is connected to the system bus 113 using a display adapter card 112. The adapter card may include a graphics adapter card and a video controller for improving the performance of a large number of graphics applications. In a specific embodiment, the adapter cards 107, 106, and 112 are connected to one or more I / O buses connected to the system bus 113 through an intermediate bus bridge (not shown). Suitable for connecting peripheral devices (such as 疋 hard disk controller, network adapter card and Figure-7- This paper size applies to Chinese National Standard (CNS) A4 specification (210 X 297 mm) 536685 A7 B7 V. Description of the invention (5 ) -Shaped adapter card) I / 0 buses include peripheral component interface (P CI) buses in accordance with the PCI Local Bus Specification Version 2.2 published by PCI Special International Organization Hillsboro OR, and are incorporated herein by reference. The other input / output devices shown here are connected to the system bus 113 through the user interface adapter card 108 and the display adapter card 112. The keyboard 109, the mouse 110, and the speaker 111 are all connected to the bus 113 through a user interface adapter card 108 (which may include a Superl / O chip that integrates multiple device adapter cards into a single integrated circuit). For more information about such chips, please refer to the National Semiconductor's PC87338 / PC97338 ACPI 1.0 and PC98 / 99 Compliant Superl / O datasheets (November 1998) at www.national · com ο So as shown in Figure 1 The system 100 includes a processor 101 type processing device, a storage device including a system memory 250 and a main storage device 104, an input device such as a keyboard 109 and a mouse 110, and an output including a raid 111 and a display 136. Device. In a specific embodiment, part of the system memory 250 and the main storage device 104 collectively store an operating system, such as the IBM Corporation's AIX® operating system, to coordinate the functions of many components shown in FIG. For more information about the AIX operating system, refer to AIX Version 4.3 Technical Reference: Base Operating System and Extensions, Volumes 1 and 2 (order numbers SC23-4159 and SC23-4160), AIX Version 4.3 System User Guide, published by IBM Corporation. Communications and Networks (Order No. SC23-4 122) and AIX Version 4.3 System Usage Guide: Operating System and Devices (Order No. SC 23-4121) at www.ibm.com and incorporated herein by reference. -8-This paper size applies to China National Standard (CNS) A4 (210 X 297 mm) _ Packing

536685 A7 B7 五、發明説明(6 ) 此時請參閱圖2,在此呈現適合用於系統100内的處理器 101之具體實施例的簡化方塊圖。在此說明的具體實施例 内,處理器101包含製作在單層半導體基板上的積體電路超 純量微處理器。處理器101包含許多執行單元、暫存器、緩 衝器、記憶體以及其他功能單元,底下將會有詳盡的說 明。如圖2内的說明,處理器101透過匯流排介面單元 (BIU)212以及處理器匯流排213(像是系統匯流排113,包含 位址、資料以及控制匯流排)耦合至系統匯流排113,BIU 212控制處理器101與其他耦合至系統匯流排113的裝置(像 是系統記憶體250與主要儲存裝置104)間之資訊轉換。吾人 可了解到處理器101可包含其他耦合至系統匯流排113的裝 置,這些裝置並不需要多做說明,所以為了簡化起見就此 省略。 BIU 212連接到處理器101内的指令快取以及記憶體管理 單元214和資料快取與記憶體管理單元216。像是位於指令 快取214與資料快取216内的高速快取可讓處理器101達成相 當快速的存取先前從系統記憶體250傳送出來的資料子集或 指令,因而改善資料處理系統100的運作速度。分別儲存在 資料快取216與指令快取214内的資料與指令會利用位址標 籤識別與存取,這些標籤每個都包含資料或指令所在系統 記憶體實體位址的位元(通常是高階位元)之選取號碼。循 序擷取單元217會在每個時脈週期内從指令快取214内取得 要執行的指令。在一個具體實施例内,若循序擷取單元217 取得來自指令快取214的分支指令,該分支指令會轉送到分 -9 - 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) 裝536685 A7 B7 V. Description of the Invention (6) Please refer to FIG. 2 at this time. Here, a simplified block diagram of a specific embodiment suitable for the processor 101 in the system 100 is presented. In the specific embodiment described here, the processor 101 includes an integrated circuit ultra-microprocessor made on a single-layer semiconductor substrate. The processor 101 includes many execution units, registers, buffers, memory, and other functional units, which will be explained in detail below. As shown in FIG. 2, the processor 101 is coupled to the system bus 113 through a bus interface unit (BIU) 212 and a processor bus 213 (such as a system bus 113 including an address, data, and control bus). The BIU 212 controls the conversion of information between the processor 101 and other devices (such as the system memory 250 and the main storage device 104) coupled to the system bus 113. I can understand that the processor 101 may include other devices coupled to the system bus 113. These devices do not need to be described further, so they are omitted for simplicity. The BIU 212 is connected to an instruction cache and memory management unit 214 and a data cache and memory management unit 216 in the processor 101. High-speed caches, such as those located in the instruction cache 214 and the data cache 216, allow the processor 101 to achieve fairly fast access to a subset or instruction of data previously transmitted from the system memory 250, thereby improving the performance of the data processing system 100. Speed of operation. The data and instructions stored in the data cache 216 and the instruction cache 214, respectively, are identified and accessed using address tags, each of which contains the bits (usually high-level) of the physical address of the system memory where the data or instruction is located Bits). The sequential fetch unit 217 fetches the instruction to be executed from the instruction cache 214 in each clock cycle. In a specific embodiment, if the sequential fetch unit 217 obtains a branch instruction from the instruction cache 214, the branch instruction will be forwarded to the sub-9-this paper size applies the Chinese National Standard (CNS) A4 specification (210 X 297 public Centimeters)

536685 A7 B7 五、 發明説明( 支處理單元(BPU)218來執行。 循序擷取單元2 17將未分支的指令轉送到指令佇列219, 在此會暫時儲存指令等待處理器101的其他功能單元來執 行。一配送單元220會從佇列219取出儲存的指令並將指令 轉送到問題單元(ISU)221,配送單元220會根據接收自完成 單元240的指令完成資訊,來排程將指令配送給問題單元 22 1。所說明的ISU 22 1具體實施例包含一或多個問題佇列 222a、222b、222c等等(統稱為問題佇列222)。ISU 221利 用可能時在每個週期内將新的指令發出給至形單元,以負 責維持完整的負載管線。在一個具體實施例内,ISU 22 1所 發出的指令並無順序。 在說明的具體實施例内,處理器101的執行電路除了 BPU 218(包含多重執行單元)以外,還包含多用途固定點單元執 行單元(FXU)223、負載/儲存單元(LSU)228以及懸浮點執 行單元(FPU)230。FXU 223利用來自特定一般用途暫存器 (GPR)232的來源運算子,代表一個專用的固定點結構以及 可執行固定點加法、減法、AND、Ο R與XOR運算的邏輯 單元。在其他具體實施例内,執行單元223可包含一負載/ 儲存單元以及一演算/邏輯單元。接著固定點指令的執行之 後,固定點執行單元223會將指令的結果輸出到GPR緩衝器 232,在此提供在結果匯流排262上接收到的結果之儲存空 間。 FPU 230可在接收自浮點暫存器(FPR)236的來源運算子 上執行單一與雙精度浮點演算以及邏輯運算,像是浮點乘 -10- 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐)536685 A7 B7 V. Description of the invention (BPU) 218 for execution. Sequential fetch unit 2 17 transfers unbranched instructions to instruction queue 219, where the instructions are temporarily stored and waiting for other functional units of processor 101 A distribution unit 220 will fetch the stored instructions from the queue 219 and transfer the instructions to the problem unit (ISU) 221. The distribution unit 220 will schedule the instructions to be distributed to the instruction completion information received from the completion unit 240. Question unit 22 1. The specific embodiment of ISU 22 1 described includes one or more question queues 222a, 222b, 222c, etc. (collectively referred to as question queue 222). ISU 221 utilizes new The instructions issued to the conformal unit are responsible for maintaining the complete load pipeline. In a specific embodiment, the instructions issued by the ISU 22 1 have no order. In the specific embodiment described, the execution circuit of the processor 101 except the BPU In addition to 218 (including multiple execution units), it also includes a multi-purpose fixed point unit execution unit (FXU) 223, a load / storage unit (LSU) 228, and a floating point execution unit (FPU) 230. F XU 223 uses a source operator from a specific general-purpose register (GPR) 232 to represent a dedicated fixed-point structure and a logical unit that can perform fixed-point addition, subtraction, AND, OR, and XOR operations. In other specific implementations For example, the execution unit 223 may include a load / storage unit and a calculation / logic unit. After the execution of the fixed-point instruction, the fixed-point execution unit 223 outputs the result of the instruction to the GPR buffer 232, and the result is provided here Storage space for results received on bus 262. FPU 230 can perform single and double precision floating point arithmetic and logical operations on source operators received from floating point register (FPR) 236, such as floating point multiplication- 10- This paper size applies to China National Standard (CNS) A4 (210 X 297 mm)

裝 訂Binding

線 536685 A7 B7 ___ 五、發明説明(8 ) 法與除法。FPU 230會將來自浮點指令執行的資料結果輸出 儲存到選取的FPR緩衝器236。LSU 228通常執行從資料快 取216、低階快取記憶體(未畫出)或系統記憶體250將資料 載入選取的GPR 232或FPR 236和/或浮點與固點儲存指令 之浮點與固點負載指令,其將來自選取的GPR 232或FPR 236之一的資料儲存到資料快取216,而最終則儲存到系統 記憶體250。 在較佳具體實施例内,處理器101運用失序指令執行來進 一步改善其超純量架構的效能。因此,FXU 223 ' LSU 228與FPU 230可用變動自指令原始程式順序的順序來執行 指令,並可知道資料彼此獨立。如同之前指出的,FXU 223、LSU 228與FPU 230會以管路階段的順序來處理指 令。在一個具體實施例内,處理器101包含五個明確的管路 階段,換言之就是擷取、解碼/配送、執行、結束以及完 成。 在擷取階段期間,循序擷取單元217會從指令快去214内 取得一個或多個未分支的指令,並將擷取的指令儲存在指 令佇列219内。相較之下,循序擷取單元217會將來自指令 流的分支指令轉送到BPU 218來執行。BPU 218包含一分 支預報機構,其在一個具體實施例内包含一動態預報機 構,像是分支紀錄表,可讓BPU 218藉由預報分支是否接 受來冒險執行未解決的條件分支指令。 在解碼/配送階段,配送單元220與ISU 221會解碼並通常 以程式順序從問題佇列222將一個或多個指令發送給執行單 -11 - 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) 536685 A7 B7 五、發明説明(9 ) 元223、228與230。在一個具體實施例内,ISU 221可將一 或多個實體暫存器分配在一般用途暫存器(GPR)池232與浮 點暫存器(FPR)池236,來儲存指令之前的結果並且將分配 紀錄在對照表内。此外,指令(或指令識別碼或代表指令的 標籤)可儲存在完成單元240的多槽完成緩衝器(或完成表) 内,當成追蹤哪個指令已經完成的裝置。 在執行階段期間,執行單元223、228與230會機會性執行 發自ISU 220的指令當成運算子與執行資源,來指示已經可 以運算。執行單元223、228與230包含儲存配送給該執行單 元的指令之保留站,直到運算子或執行資源變成可用為 止。在終止指令的執行之後,若有結果的話執行單元223、 228與230會根據指令種類將該資料結果儲存在GPR或FPR 内。在配送具體實施例内,執行單元223、228與230會提醒 完成單元240指令已經執行完畢。最後,分別將來自GPR更 名緩衝器233反FPR更名緩衝器237的指令之資料結果傳送 到GPR 232與FPR 236,如此便以非完成單元240的完成表 内之程式順序來完成指令。 處理器101最好支援失序理論指令執行,指令可根據預報 分支或超越會引起中斷情況的指令來進行理論執行。在分 支未預報或中斷的事件中,硬體會自動清除來自管路不要 的指令並忽略不要的結果。在一個時脈週期内會選擇性清 除來自所有單元的不正確理論結果,並在接下來的時脈週 期内恢復所發出的指令。 當已發出指令,ISU 221會以可輕易決定出任何兩指令間 -12- 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) 裝 訂Line 536685 A7 B7 ___ V. Description of the invention (8) Method and division. The FPU 230 stores the data result output from the execution of the floating-point instruction to the selected FPR buffer 236. LSU 228 typically executes floating point data loading from data cache 216, low-level cache memory (not shown), or system memory 250 to selected GPR 232 or FPR 236 and / or floating point and fixed point storage instructions With the fixed-point load command, it stores data from one of the selected GPR 232 or FPR 236 to the data cache 216, and finally to the system memory 250. In a preferred embodiment, the processor 101 uses out-of-order instruction execution to further improve the performance of its superscalar architecture. Therefore, the FXU 223 'LSU 228 and FPU 230 can execute the order by changing the order of the original order of the command, and can know that the data are independent of each other. As previously noted, the FXU 223, LSU 228, and FPU 230 process instructions in the order of the pipeline stages. In a specific embodiment, the processor 101 includes five distinct pipeline stages, in other words fetch, decode / distribute, execute, end, and complete. During the fetch phase, the sequential fetch unit 217 obtains one or more unbranched instructions from the instruction bounce 214, and stores the fetched instructions in the instruction queue 219. In contrast, the sequential fetch unit 217 forwards branch instructions from the instruction stream to the BPU 218 for execution. The BPU 218 includes a branch forecasting mechanism, which in a specific embodiment includes a dynamic forecasting mechanism, such as a branch record table, which allows the BPU 218 to risk executing unresolved conditional branch instructions by predicting whether a branch is accepted. In the decoding / distribution phase, the distribution unit 220 and ISU 221 will decode and usually send one or more instructions to the execution order from the question queue 222 in program order-11-This paper size applies the Chinese National Standard (CNS) A4 specification ( 210 X 297 mm) 536685 A7 B7 5. Description of the invention (9) Yuan 223, 228 and 230. In a specific embodiment, the ISU 221 may allocate one or more physical registers to the general purpose register (GPR) pool 232 and the floating point register (FPR) pool 236 to store the results before the instruction and Record the allocation in the comparison table. In addition, the instruction (or instruction identification code or label representing the instruction) can be stored in the multi-slot completion buffer (or completion table) of the completion unit 240 as a device to track which instruction has been completed. During the execution phase, execution units 223, 228, and 230 will opportunistically execute instructions issued from ISU 220 as operators and execution resources to indicate that operations are already available. Execution units 223, 228, and 230 include reservation stations that store instructions distributed to the execution unit until the operator or execution resource becomes available. After the execution of the instruction is terminated, if there is a result, the execution units 223, 228, and 230 will store the data result in the GPR or FPR according to the type of instruction. In the specific embodiment of delivery, the execution units 223, 228, and 230 will remind the completion unit 240 that the instructions have been executed. Finally, the data results of the instructions from the GPR rename buffer 233 and the FPR rename buffer 237 are transmitted to the GPR 232 and FPR 236, respectively, so that the instructions are completed in the program sequence in the completion table of the non-complete unit 240. The processor 101 preferably supports the execution of out-of-order theoretical instructions. The instructions can be theoretically executed according to a predicted branch or overrun instruction that would cause an interruption. In the event of an unforeseen or interrupted branch, the hardware will automatically clear the unwanted instructions from the pipeline and ignore the unwanted results. Incorrect clock results from all units are selectively cleared in one clock cycle, and instructions issued are resumed in the next clock cycle. When an order has been issued, ISU 221 can easily decide between any two orders. -12- This paper size applies the Chinese National Standard (CNS) A4 (210 X 297 mm) binding

線 536685 A7 B7 五、發明説明(1Q ) 相對年紀的方式來標記指令。在一個具體實施例内,會以 整數值(ITAG)標記循序指令。在其他具體實施例内,為了 追蹤方面可將多個指令聚集成群並指派一個共用的識別 碼,在此稱為群組標籤(GTAG)。除了提供所發出指令的順 序與相對年紀的機構之外,ITAG與GTAG也提供其對應指 令的速記代表。指令的標籤值隨附有佇列項目與其所在的 管路階段。使用標籤對於指令清除機構(回應處理器產生的 清除指令)有所幫助,其中會執行在隨附清除指令的ITAG 或GTAG與隨附特定佇列項目或執行單元階段的ITAG或 GTAG之間的幅度比較,並且若其用於比已清除的指令還 年輕或一樣年輕(即是同時或之後發出的),則項目就會失 效。所有剩餘的已清除指令(以及所有接續指令)都是從機 器内「已清除」,並且擷取單元會重新指向「已清除」指 令的位址上開始擷取之處。 此時請參閱圖3,在此顯示依照本發明一個具體實施例的 微處理器101之其他細部圖。在說明的具體實施例内,會將 配送單元220設定成接收已擷取自指令快取214的指令(如圖 2内所示)。配送單元220包含設定成決定所接收的指令是需 要單一暫存器分配或多個暫存器分配,並分配實體暫存器 池306内適當數量項目的分配邏輯302。利用決定指令將會 影響的結構化一般用途暫存器的數量可決定出指令所需的 暫存器分配數量,此外,分配邏輯302會將分配的暫存器與 結構化暫存器間之關聯紀錄在對照表304内。如此,對照表 304就包含指出結構化暫存器與實體暫存器間之關係的資 -13- 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐)Line 536685 A7 B7 V. Description of the Invention (1Q) Relative age marks the instructions. In one embodiment, sequential instructions are marked with an integer value (ITAG). In other specific embodiments, for tracking purposes, multiple instructions may be grouped into a group and assigned a common identification code, which is referred to herein as a group tag (GTAG). In addition to the organization that provides the order and relative age of the instructions issued, ITAG and GTAG also provide shorthand representatives of their corresponding instructions. The label value of the instruction is accompanied by a queued item and the pipeline phase it is in. The use of tags is helpful to the instruction clearing mechanism (in response to a clear instruction generated by the processor), which executes the range between the ITAG or GTAG with a clear instruction and the ITAG or GTAG with a specific queue item or execution unit Compare, and if it is used to be younger or the same as a cleared instruction (that is, issued at the same time or later), the item will be invalidated. All remaining cleared instructions (and all subsequent instructions) are "cleared" from the machine, and the fetch unit will point back to the address of the "cleared" instruction to start fetching. Please refer to FIG. 3 at this time, which shows another detailed view of the microprocessor 101 according to a specific embodiment of the present invention. In the illustrated specific embodiment, the distribution unit 220 is configured to receive an instruction retrieved from the instruction cache 214 (as shown in FIG. 2). The distribution unit 220 includes allocation logic 302 that is set to determine whether the received instruction requires a single register allocation or multiple register allocations, and allocates an appropriate number of items in the physical register pool 306. The number of structured general-purpose registers that will be affected by determining the instruction can be used to determine the number of register allocations required by the instruction. In addition, the allocation logic 302 will associate the allocated registers with the structured registers. Recorded in the comparison table 304. In this way, the comparison table 304 contains information indicating the relationship between the structured register and the physical register. -13- This paper size applies the Chinese National Standard (CNS) A4 specification (210 X 297 mm)

Hold

536685 A7536685 A7

^时雖然說明的具體實施例說明分配邏輯302為發出單元配 k單兀220’的-部丫分’但熟悉微處理器架構的人就會了解 到,分配邏輯3〇2可位於處理器ι〇ι(例如根據實施包含發出 單兀的單元221)的其他功能單元内。 刀酉:邏輯302取好可以在一個週期内分配所有需要的結果 暫存器’已達成最大的處理器效能常,在單一週期内 可配迖人發出夕個扣令,而分配邏輯3〇2應該可以分配暫存 器池306内充足的暫存器,以容納連續最大的配送與發出 率。不過,隨著處理器週期時間減少,要在分配的時間内 分配足夠數量的暫存器就越困難了。 在大多數指令集内絕大部份的微處理器指令只影響到單 -暫存器的内#,通常分配邏輯已經最佳化可在單一週期 内分配N個暫存器,其中N為微處理器1〇1在單一週期内可 配送與發出的指令之最大數量。在此案例中,通常若一或 多個指令影響到多個暫存器的内容,則分配邏輯就無法在 單一週期内分配足夠的暫存器。不過,分配邏輯3〇2會設定 成可容納指令需要多個結果暫存器分配但不會對在該週期 内可發出或配送的指令數量有負面影響之情況。 一個耑要多個結果暫存器的指令範例是p〇werpC⑧指令集 的附在多重字元(LMW)指令,LMW指令會影響特定目的地 暫存器的内谷’以及所有大於特定目的地暫存器的暫存器 之内容。例如指令LMW R26, EA會從起始於有效位址(EA) 的記憶體將六個連續字元(四位元組區段)載入R26到R3 i (PowerPC®指令及支援32個結構化暫存器)。要在有限時間 -14 - 本紙張尺度適用中國國家標準(CNS) A4規格(210X 297公釐) 536685 A7 B7 五、發明説明(12 ) 内分配每個指令所需的暫存器數量,暗示使用傳統配置方 法的處理器週期時間通常是不可實現的。 依照本發明的分配邏輯302會利用當已經分配暫存器時偵 測複雜指令,以及當偵測到複雜指令時以另一個方式分配 暫存器,來定位出分配用於多個結果暫存器指令(複雜指令) 的暫存器隨附之問題。一般來說,所運用的另一種分配方 式利用分配一塊足以容納至少部份受複雜指令影響的暫存 器,以加速分配處理。為了確保有足夠的暫存器可用於區 塊分配,分配邏輯302會將包含主暫存器池的實體暫存器池 設定成已優先分配受單一結果暫存器指令(簡單指令)影響 的暫存器,以及將區塊暫存器分配給複雜指令的暫存器保 留池。 此時請參閱圖4,實體暫存器池306會配送成包含識別成 主暫存器池404的第一部份以及識別成保留暫存器池402的 第二部份。在一個具體實施例内,當將暫存器分配給簡單 指令時,分配邏輯302會優先分配在主暫存器404内的暫存 器。例如當在配送時間簡單指令而主池404内沒有暫存器可 以分配時,將會忽略此優先權。不過一般而言,會使用傳 統分配演算法將簡單指令的結果暫存器分配給主池404。只 要可能時就將簡單指令暫存器分配給主池404,而分配邏輯 3 02則將保留池402内的暫存器保留給複雜指令暫存器分 I己a 分配邏輯302將設定成偵測複雜指令的配送。在一個具體 實施例内,利用將已配送指令的opcode與複雜指令opcode _- 15-_ 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐)Although the specific embodiment described above illustrates that the distribution logic 302 is equipped with a single unit 220'-partial portion 'for the issuing unit, those familiar with the microprocessor architecture will understand that the distribution logic 302 may be located on the processor. (For example, the unit 221 is issued according to the implementation). Knife: Logic 302 can be used to allocate all required result registers in one cycle. 'The maximum processor performance has been achieved. In a single cycle, you can issue a deduction order, and the allocation logic is 302. It should be possible to allocate sufficient registers in the register pool 306 to accommodate the largest continuous distribution and dispatch rate. However, as processor cycle time decreases, it becomes more difficult to allocate a sufficient number of scratchpads within the allocated time. In most instruction sets, most of the microprocessor instructions only affect the internal # of the single- scratchpad. Usually, the allocation logic has been optimized. N scratchpads can be allocated in a single cycle, where N is micro The maximum number of instructions that the processor 101 can distribute and issue in a single cycle. In this case, usually, if one or more instructions affect the contents of multiple scratchpads, the allocation logic cannot allocate enough scratchpads in a single cycle. However, the allocation logic 302 will be set to accommodate situations where multiple result register allocations are required for an instruction, but this will not negatively affect the number of instructions that can be issued or dispatched during that period. An example of an instruction that requires multiple result registers is the pOWerpC instruction set attached to a multi-character (LMW) instruction. The LMW instruction affects the inner valley of the specific destination register 'and all registers larger than the specific destination register. The contents of the register of the register. For example, the command LMW R26, EA will load six consecutive characters (four-byte segments) from memory starting at effective address (EA) into R26 to R3 i (PowerPC® command and support 32 structured Register). To be limited time -14-This paper size applies the Chinese National Standard (CNS) A4 specification (210X 297 mm) 536685 A7 B7 V. Description of the invention (12) The number of registers required to allocate each instruction, implying the use of The processor cycle time of traditional configuration methods is usually not achievable. The allocation logic 302 according to the present invention uses the detection of complex instructions when a register has been allocated, and allocates the registers in another way when a complex instruction is detected to locate allocation for multiple result registers. Issue with register of instructions (complex instructions). Generally, another allocation method used is to speed up the allocation process by allocating a scratchpad that is large enough to hold at least part of the register affected by complex instructions. In order to ensure that there are sufficient scratchpads available for block allocation, the allocation logic 302 sets the physical scratchpad pool containing the main scratchpad pool as a priority that has been allocated by the single-response scratchpad instruction (simple instruction) Registers, and register reserve pools that allocate block registers to complex instructions. Referring to FIG. 4 at this time, the physical register pool 306 will be distributed to include the first part identified as the main register pool 404 and the second part identified as the retained register pool 402. In a specific embodiment, when a register is allocated to a simple instruction, the allocation logic 302 preferentially allocates a register in the main register 404. For example, when there is a simple order at the time of delivery and no register can be allocated in the main pool 404, this priority will be ignored. However, in general, the result register of a simple instruction is allocated to the main pool 404 using a traditional allocation algorithm. Whenever possible, the simple instruction register is allocated to the main pool 404, while the allocation logic 302 reserves the registers in the reservation pool 402 to the complex instruction register. The allocation logic 302 will be set to detect Delivery of complex orders. In a specific embodiment, the opcode of the distributed instruction and the complex instruction opcode are used. _- 15-_ This paper size is applicable to the Chinese National Standard (CNS) A4 specification (210 X 297 mm)

裝 訂Binding

線 536685 A7 B7 五、發明説明(13 ) 的預定集合相比較,就可達成複雜指令的偵測。在此方式 中,觸發使用保留池402的opcode可由設計者約定。在其他 具體實施例中,分配邏輯302包含opcode的可程式規劃表, 可觸發區塊分配以及保留池402的使用。如同範例中所示, 因為LMW指令可以比其他指令影響更多的暫存器,所以分 配邏輯302只有在偵測到LMW指令時才會分配保留池402内 的暫存器。 分配邏輯302除了包含偵測複雜指令配送的邏輯以外,還 會追蹤實體暫存器池306内暫存器的分配。當偵測到複雜指 令,分配邏輯302會決定保留池402内是否有足夠的暫存器 可容納複雜指令所影響的暫存器。若有足夠的暫存器,分 配邏輯302就會將保留池402内一塊實體暫存器分配給複雜 指令。實體暫存器由分配邏輯302所分配的邏輯暫存器區塊 最好包含一部份由複雜指令所影響的結構化暫存器。 在一個具體實施例内,若複雜指令為合適的複雜指令 時,才會開始將保留池402内的暫存器做區塊分配。將此處 所說的區塊分配機構應用限制成複雜指令的子集可在特定 的應用中說明,例如使用傳統暫存器分配邏輯可能可以更 有效率的處理只影響兩暫存器内容的複雜指令之暫存器分 配。在此具體實施例内,通常合適的複雜指令都包含這些 指令,像是可影響多個暫存器内容的PowerPC®指令集内之 LMW指令。在一個具體實施例内,利用每次偵測到合適的 複雜指令時就分配預定或固定的結構化暫存器區塊,如此 可簡化區塊分配機構。例如當分配邏輯302偵測到LMW指 -16- 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) 裝Line 536685 A7 B7 V. Comparison of the predetermined set of invention description (13), the detection of complex instructions can be achieved. In this manner, the opcode that triggers the use of the reserve pool 402 can be agreed by the designer. In other specific embodiments, the allocation logic 302 includes a programmable programming table of opcode, which can trigger block allocation and use of the reserve pool 402. As shown in the example, because the LMW instruction can affect more registers than other instructions, the allocation logic 302 will allocate the registers in the reservation pool 402 only when the LMW instruction is detected. In addition to the logic that detects the distribution of complex instructions, the allocation logic 302 also tracks the allocation of registers in the physical register pool 306. When a complex instruction is detected, the allocation logic 302 determines whether there are enough registers in the reserve pool 402 to hold the registers affected by the complex instruction. If there are enough registers, the allocation logic 302 will allocate a physical register in the reserve pool 402 to the complex instruction. The physical register is allocated by the allocation logic 302. The logical register block preferably contains a part of the structured register affected by the complex instructions. In a specific embodiment, if the complex instruction is a suitable complex instruction, the scratchpad in the reserve pool 402 will be allocated as a block. Limiting the application of the block allocation mechanism described here to a subset of complex instructions can be explained in specific applications. For example, using traditional scratchpad allocation logic may more efficiently handle complex instructions that only affect the contents of the two scratchpads. Register allocation. In this specific embodiment, generally suitable complex instructions include these instructions, such as LMW instructions in the PowerPC® instruction set that can affect the contents of multiple registers. In a specific embodiment, a predetermined or fixed structured register block is allocated every time a suitable complex instruction is detected, which can simplify the block allocation mechanism. For example, when the allocation logic 302 detects the LMW indicator -16- This paper size is applicable to the Chinese National Standard (CNS) A4 specification (210 X 297 mm).

線 536685 A7 B7 五、發明説明(14 ) 令時,則不管實際受到指令影響的結構化暫存器都會區塊 分配預定的結構化暫存器集合。例如在一個實施中可將實 體暫存器池302分割16個暫存器到保留池402内,其他則在 主池404内。當配送LMW指令時,在「平均」LMW指令需 要16個或以下的暫存器分配理論上,分配邏輯會自動區塊 分配保留池402内的結構化暫存器R16到R 31。在此具體實 施例内,之前討論過的LM W R26,EA指令會初次分配保留 池402内的結構化暫存器R16到R31,即使暫存器R17到R25 不受指令所影響。雖然在某些案例中不需要分配暫存器, 不過此實施藉由消除精確決定哪個暫存器受特定指令所影 響來大幅簡化分配處理。 圖4說明在合適複雜指令配送之前實體暫存器池306與對 照表304的狀態。在此圖式中,保留池402包含16個實體暫 存器,在配送複雜指令的時間上並未分配暫存器。主暫存 器池404包含64個暫存器(暫存器16到79),可分配一個或以 上的暫存器。在說明的範例中,對照表304内指出目前R 17 結構化暫存器分配給實體暫存器51並且對應值儲存在實體 暫存器51内。 此時請參閱圖5,將說明配送合適的複雜指令(像是之前 範例已經出現過的LMW R26,EA指令)之後實體暫存器池 306與對照表304的圖式。在說明的範例中,分配邏輯302會 偵測LMW指令來決定保留池402可用於區塊分配(即是目前 整個區塊都尚未分配)。在決定保留池402可用於分配後, 分配邏輯302會自動將結構化暫存器R16到R31分配給保留 -17- 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐)Line 536685 A7 B7 V. Description of the invention (14) When ordered, the structured register that is actually affected by the instruction will allocate a predetermined set of structured registers in blocks. For example, in one implementation, the physical register pool 302 can be divided into 16 registers into the reserve pool 402, and the other is in the main pool 404. When the LMW instruction is distributed, the "average" LMW instruction requires 16 or less register allocations. In theory, the allocation logic will automatically allocate the structured registers R16 to R 31 in the reserve pool 402. In this specific embodiment, the previously discussed LM W R26, EA instruction will allocate the structured registers R16 to R31 in the reserve pool 402 for the first time, even if the registers R17 to R25 are not affected by the instruction. Although there is no need to allocate registers in some cases, this implementation greatly simplifies the allocation process by eliminating the need to precisely determine which register is affected by a particular instruction. Figure 4 illustrates the state of the physical register pool 306 and the comparison table 304 before the appropriate complex instruction is dispatched. In this figure, the reserve pool 402 contains 16 physical registers, and no registers are allocated at the time of delivery of complex instructions. The main register pool 404 contains 64 registers (registers 16 to 79), and one or more registers can be allocated. In the illustrated example, the comparison table 304 indicates that the current R 17 structured register is allocated to the physical register 51 and the corresponding value is stored in the physical register 51. Please refer to Figure 5 at this time, which will illustrate the diagram of the physical register pool 306 and the comparison table 304 after the delivery of appropriate complex instructions (such as the LMW R26, EA instruction that has appeared in the previous example). In the illustrated example, the allocation logic 302 detects the LMW instruction to determine that the reserved pool 402 is available for block allocation (that is, the entire block has not yet been allocated). After deciding that the reservation pool 402 is available for allocation, the allocation logic 302 will automatically allocate the structured registers R16 to R31 to the reservation -17- This paper size applies the Chinese National Standard (CNS) A4 specification (210 X 297 mm)

Hold

線 536685 A7 B7 五、發明説明(15 ) 池402内連續的實體暫存器(在說明的具體實施例内為實體 暫存器0到15)。 在分配保留池402内的實體暫存器之後,分配邏輯302會 執行對照表304與實體暫存器池306的内部管理,來說明不 受複雜指令影響但分配邏輯302也未分配給保留池402的暫 存器。在說明的範例中,結構化暫存器R26到R31會受複雜 指令的影響。因此,在分配給保留池402後,處理器101會 開始從記憶體(或快取)取得值來存入對應的實體暫存器 内。不過,結構化暫存器R16到R 25並未受到複雜指令的影 響。分配邏輯302必須決定是否有未受影響但是已分配的暫 存器已經分配在主池404内。在說明的範例中,結構化暫存 器R17會在複雜指令配送之前分配給實體暫存器51(如圖4内 所示)。在一個具體實施例中,當複雜指令配送時發生的區 塊分配會更新對照表304,將結構化暫存器17映射到實體暫 存器1。此外,實體暫存器51的内容會複製到實體暫存器 1(假設實體暫存器5 1具有有效值)。若暫存器51内的結果尚 未有效,就會延遲複製到實體暫存器1。在其他適合於要將 未受複雜指令(呈現時間限制)影響的暫存器(像是範例中的 結構化暫存器17)重新映射過於複雜的應用之具體實施例 内,未受影響的暫存器就不重新映射。在此具體實施例 内,先前範例内的結構化暫存器16到25就不會重新映射到 保留池内。 若複雜指令需要比預定分配數量更多的暫存器分配,則 在傳統方式中分配邏輯302就會分配未使用區塊分配法分配 -18- 本紙張尺度適用中國國家標準(CNS) A4規格(210X297公釐)Line 536685 A7 B7 V. Description of the invention (15) Continuous physical registers in the pool 402 (in the illustrated embodiment, the physical registers 0 to 15). After the physical register in the reserve pool 402 is allocated, the allocation logic 302 performs internal management of the comparison table 304 and the physical register pool 306 to illustrate that it is not affected by complex instructions but the allocation logic 302 is not allocated to the reserve pool 402 Register. In the illustrated example, the structured registers R26 to R31 are affected by complex instructions. Therefore, after allocating to the reserve pool 402, the processor 101 will start to get the value from the memory (or cache) to store it in the corresponding physical register. However, the structured registers R16 to R 25 are not affected by complex instructions. The allocation logic 302 must decide if there are unaffected but allocated scratchpads already allocated in the main pool 404. In the illustrated example, the structured register R17 is allocated to the physical register 51 (as shown in Figure 4) before the complex instruction is dispatched. In a specific embodiment, the block allocation that occurs when a complex instruction is dispatched updates the lookup table 304, mapping the structured register 17 to the physical register 1. In addition, the contents of the physical register 51 are copied to the physical register 1 (assuming that the physical register 51 has a valid value). If the result in the register 51 is not yet valid, the copy to the physical register 1 will be delayed. In other specific embodiments suitable for remapping overly complex applications that are not affected by complex instructions (presentation time constraints) (such as the structured register 17 in the example), unaffected temporary registers The memory is not remapped. In this embodiment, the structured registers 16 to 25 in the previous example are not remapped to the reserved pool. If the complex instruction requires more register allocation than the predetermined allocation amount, in the traditional way, the allocation logic 302 will allocate the unused block allocation method. -18- This paper standard applies to China National Standard (CNS) A4 specifications ( 210X297 mm)

Hold

線 536685 A7 B7 五、發明説明(16 ) 的暫存器。在分配保留池402之後,分配邏輯302會繼續分 配主池404内的暫存器。時間經過之後,保留池402内的暫 存器分配就會失效或收回,然後保留池會再開放給後續區 塊分配。在一個具體實施例中,分配邏輯302可設定為重新 定位保留池402,來改善能力以容納空間緊迫的複雜指令。 在此具體實施例内,保留池402可包含任何相鄰實體暫存器 區塊。在完成區塊分配之後,保留池的邊界會移動到任何 未分配暫存器的相鄰區塊。若沒有這種區塊,保留池402會 移動到具有最少已分配暫存器的相鄰區塊,並且在新定義 的保留池内重新分配已分配的暫存器。 精通此技藝並知道箇中好處的人士就會了解到,本發明 利用簡單的方式來分配多個暫存器,以容納影響多個暫存 器内容的指令,而預斯可改善微處理器的效能。吾人可了 解到,本發明顯示的型式以及詳細說明和圖式都僅供呈現 較佳範例,因此就可解釋成下列申請專利範圍廣泛包含較 佳具體實施例所公佈的所有改變。 -19- 本紙張尺度適用中國國家標準(CNS) A4規格(210X 297公釐) 裝 訂Line 536685 A7 B7 V. Register of Invention Description (16). After the reservation pool 402 is allocated, the allocation logic 302 will continue to allocate the registers in the main pool 404. After the time has elapsed, the scratchpad allocation in the reserve pool 402 will be invalidated or reclaimed, and then the reserve pool will be opened to subsequent block allocations. In a specific embodiment, the allocation logic 302 may be configured to reposition the reserve pool 402 to improve the ability to accommodate space-constrained complex instructions. In this specific embodiment, the reservation pool 402 may contain any adjacent physical register blocks. After the block allocation is completed, the boundaries of the reserve pool are moved to any adjacent blocks of unallocated scratchpad. Without such a block, the reserve pool 402 is moved to an adjacent block with the least allocated scratchpad, and the allocated scratchpad is re-allocated in the newly defined reserve pool. Those who are proficient in this technique and know the benefits will understand that the present invention uses a simple way to allocate multiple registers to accommodate instructions that affect the contents of multiple registers, and Predictive can improve the performance of the microprocessor. efficacy. I can understand that the type, detailed description and drawings shown in the present invention are only for presenting better examples, so it can be interpreted that the scope of the following application patents encompasses all the changes disclosed in the better specific embodiments. -19- This paper size is applicable to China National Standard (CNS) A4 (210X 297mm) binding

line

Claims (1)

536685 A B c D 七、申請專利範圍 1 · 一種處理微處理器内指令的方法,包含: 決足配送的指令是否為需要多個暫存器分配的複雜指 令; 若扣令只需要單一暫存器分配,則將已配送指令的實 體暫存器分配到暫存器池的第一部份内,·以及 若指令需要多個暫存器分配,則將一組在暫存器池的 第二部份中的實體暫存器當成區塊分配。 2 ·如申請專利範圍第1項之方法,其中若指令影響多個一般 用途暫存器的内容,則該指令就需要多個暫存器分配。 3 ·如申請專利範圍第1項之方法,其中只有在需要多個暫存 器分配的指令為合適的指令時’才會開始該區塊分配。 4 ·如申請專利範圍第3項之方法,其中合適的指令包含一負 載多重字元(LMW)指令。 5 ·如申請專利範圍第1項之方法,其中分配實體暫存器的區 塊組對應到一組結構化暫存器,並且進一步其中該組結 構化暫存器獨立於受指令影響的結構化暫存器。 6 ·如申請專利範圍第5項之方法,進一步包含,若受指令影 響的結構化暫存器超過暫存器池内第二部份的暫存器數 量,則部份受到指令影響的結構化暫存器會分配到暫存 器池的第一部份内。 7 ·如申清專利氣圍弟1項之方法,進一步包含在分配暫存器 池第二部份以回應需要多個暫存器分配的指令配送後, 選擇暫存器池的第三部份當成保留池,用於分配後續複 雜指令的暫存器。 -20 - 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) 536685 A B c D 六、申請專利範圍 8. —種微處理器,包含: 一適合將接收自指令記憶體的指令轉送到發出單元之 配送單元; 一實體暫存器池; 分配邏輯,其設定成決定指令是否為需要多個暫存器 分配的複雜指令,並進一步設定為若指令只需要單一暫 存器分配,就優先分配暫存器池内第一部份的暫存器給 指令,並且若指令為複雜指令時,將暫存器池第二部份 内的一組暫存器當成區塊分配。 9 ·如申請專利範圍第8項之處理器,其中若指令影響多個一 般用途暫存器的内容,則該指令就需要多個暫存器分 配。 10·如申請專利範圍第8項之處理器,其中只有在複雜指令為 合適的指令時,才會開始該區塊分配。 11·如申請專利範圍第8項之處理器,其中合適的複雜指令包 含一負載多重字元(LMW)指令。 12.如申請專利範圍第8項之處理器,其中若偵測到複雜指 令,則分配邏輯區塊會分配一組實體暫存器對應到一組 結構化暫存器,並且進一步其中該組結構化暫存器獨立 於受指令影響的結構化暫存器。 13·如申請專利範圍第12項之處理器,其中分配邏輯或設定 成,若受指令影響的結構化暫存器超過暫存器池内第二 部份的暫存器數量,則部份受到指令影響的結構化暫存 器會分配到暫存器池的第一部份内。 -21 - 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) 536685 8 8 8 8 A BCD 、申請專利範圍 14. 如申請專利範圍第8項之處理器,其中分配邏輯會在分配 暫存器池第二部份以回應需要多個暫存器分配的指令配 送後,選擇暫存器池的第三部份當成保留池,用於分配 後續複雜指令的暫存器。 15. —種包含處理器、記憶體、輸入裝置和顯示器的資料處 理系統,其微處理器包含: 一適合將接收自指令記憶體的指令轉送到發出單元之 配送單元; 一實體暫存器池; 分配邏輯,其設定成決定指令是否為需要多個暫存器 分配的複雜指令,並進一步設定為若指令只需要單一暫 存器分配,就優先分配暫存器池内第一部份的暫存器給 指令,並且若指令為複雜指令時,將暫存器池第二部份 内的一組暫存器當成區塊分配。 16. 如申請專利範圍第15項之資料處理系統,其中若指令影 響多個一般用途暫存器的内容,則該指令就需要多個暫 存器分配。 17. 如申請專利範圍第15項之資料處理系統,其中只有在複 雜指令為合適的指令時,才會開始該區塊分配。 18. 如申請專利範圍第15項之資料處理系統,其中合適的複 雜指令包含一負載多重字元(LMW)指令。 19. 如申請專利範圍第15項之資料處理系統,其中若偵測到 複雜指令,則分配邏輯區塊會分配一組實體暫存器對應 到一組結構化暫存器,並且進一步其中該組結構化暫存 -22- 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐)536685 AB c D 7. Scope of Patent Application 1. A method for processing instructions in a microprocessor, including: whether the instructions to be distributed are complex instructions that require multiple registers to be allocated; if the deduction order requires only a single register Allocation, the physical register of the distributed instruction is allocated to the first part of the register pool, and if the instruction requires multiple register allocation, a group is allocated to the second part of the register pool The physical register in the share is allocated as a block. 2 · If the method of the first item of the scope of patent application is applied, if the instruction affects the contents of multiple general-purpose registers, the instruction requires multiple register allocations. 3. The method of item 1 in the scope of patent application, wherein the block allocation will only start when the instruction requiring multiple temporary register allocation is a suitable instruction '. 4. The method of claim 3, wherein the appropriate instruction includes a load multiple character (LMW) instruction. 5 · The method according to item 1 of the patent application scope, wherein the block group of the allocation physical register corresponds to a group of structured registers, and further wherein the group of structured registers is independent of the structured structure affected by the instruction Register. 6 · If the method according to item 5 of the scope of patent application, further includes that if the number of structured registers affected by the instruction exceeds the number of registers in the second part of the register pool, part of the structured temporary registers affected by the instruction Registers are allocated to the first part of the register pool. 7 · The method of claiming item 1 of the patent gas, further comprising selecting the third part of the register pool after distributing the second part of the register pool in response to an instruction requiring multiple register allocation. Used as a reserve pool for the allocation of subsequent complex instructions. -20-This paper size is in accordance with Chinese National Standard (CNS) A4 specification (210 X 297 mm) 536685 AB c D 6. Patent application scope 8. — A type of microprocessor, including: a suitable for receiving from the instruction memory The instruction is transferred to the distribution unit of the issuing unit; a physical register pool; allocation logic, which is set to determine whether the instruction is a complex instruction that requires multiple register allocations, and further set that if the instruction requires only a single register allocation , The first part of the register in the register pool is preferentially allocated to the instruction, and if the instruction is a complex instruction, a group of registers in the second part of the register pool is allocated as a block. 9 • If the processor of item 8 of the scope of patent application, if the instruction affects the contents of multiple general-purpose registers, the instruction requires multiple register allocation. 10. If the processor of the patent application No. 8 range, the block allocation will only start when the complex instruction is a suitable instruction. 11. The processor of claim 8 in which the appropriate complex instruction includes a load multiple character (LMW) instruction. 12. The processor according to item 8 of the patent application, wherein if a complex instruction is detected, the allocation logic block will allocate a group of physical registers to a group of structured registers, and further wherein the group of structures The structured register is independent of the structured register affected by the instruction. 13. If the processor of item 12 of the patent application scope, the allocation logic or setting is such that if the structured register affected by the instruction exceeds the number of the second part of the register in the register pool, part of the instruction is subject to the instruction The affected structured registers are allocated to the first part of the register pool. -21-This paper size applies Chinese National Standard (CNS) A4 specification (210 X 297 mm) 536685 8 8 8 8 A BCD, patent application scope 14. If the processor of patent scope item 8 is applied, the allocation logic will be After the second part of the register pool is allocated in response to an instruction that requires multiple register allocation, the third part of the register pool is selected as a reserved pool for allocating registers for subsequent complex instructions. 15. —A data processing system including a processor, a memory, an input device, and a display, the microprocessor includes: a distribution unit adapted to forward instructions received from the instruction memory to the issuing unit; a physical register pool ; Allocation logic, which is set to determine whether the instruction is a complex instruction that requires multiple register allocation, and further sets that if the instruction requires only a single register allocation, the first part of the register in the register pool is preferentially allocated Register, and if the instruction is a complex instruction, a group of registers in the second part of the register pool is allocated as a block. 16. If the data processing system of the scope of application for item 15 of the patent, where the instruction affects the contents of multiple general-purpose registers, the instruction requires multiple register allocation. 17. In the case of the data processing system in the scope of patent application No. 15, the block allocation will start only when the complex instruction is a suitable instruction. 18. The data processing system of claim 15 in which the appropriate complex instruction includes a load multiple character (LMW) instruction. 19. If the data processing system of the scope of application for patent No.15, if a complex instruction is detected, the allocation logic block will allocate a group of physical registers to a group of structured registers, and further where the group Structured temporary storage-22- This paper size applies to Chinese National Standard (CNS) A4 specifications (210 X 297 mm) 536685 8 8 8 8 A B c D •、申請專利範圍 器獨立於受指令影響的結構化暫存器。 20·如申請專利範圍第19項之資料處理系統,其中分配邏輯 或設定成,若受指令影響的結構化暫存器超過暫存器池 内第二部份的暫存器數量,則部份受到指令影響的結構 化暫存器會分配到暫存器池的第一部份内。 -23- 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐)536685 8 8 8 8 A B c D • The scope of patent application is independent of the structured register affected by the directive. 20 · If the data processing system of item 19 of the patent application scope, the allocation logic or setting is such that if the structured register affected by the instruction exceeds the number of the second part of the register in the register pool, part of the The structured registers affected by the instruction are allocated to the first part of the register pool. -23- This paper size applies to China National Standard (CNS) A4 (210 X 297 mm)
TW90123583A 2000-09-28 2001-09-25 Group rename allocation for multiple register instructions TW536685B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US67186900A 2000-09-28 2000-09-28

Publications (1)

Publication Number Publication Date
TW536685B true TW536685B (en) 2003-06-11

Family

ID=29270969

Family Applications (1)

Application Number Title Priority Date Filing Date
TW90123583A TW536685B (en) 2000-09-28 2001-09-25 Group rename allocation for multiple register instructions

Country Status (1)

Country Link
TW (1) TW536685B (en)

Similar Documents

Publication Publication Date Title
US6728866B1 (en) Partitioned issue queue and allocation strategy
US5870582A (en) Method and apparatus for completion of non-interruptible instructions before the instruction is dispatched
US6826704B1 (en) Microprocessor employing a performance throttling mechanism for power management
US5721855A (en) Method for pipeline processing of instructions by controlling access to a reorder buffer using a register file outside the reorder buffer
US5463745A (en) Methods and apparatus for determining the next instruction pointer in an out-of-order execution computer system
US5751983A (en) Out-of-order processor with a memory subsystem which handles speculatively dispatched load operations
US8683180B2 (en) Intermediate register mapper
JP3093639B2 (en) Method and system for tracking resource allocation in a processor
EP0649085B1 (en) Microprocessor pipe control and register translation
US7093106B2 (en) Register rename array with individual thread bits set upon allocation and cleared upon instruction completion
US20120265971A1 (en) Allocation of counters from a pool of counters to track mappings of logical registers to physical registers for mapper based instruction executions
US6697939B1 (en) Basic block cache microprocessor with instruction history information
US5898854A (en) Apparatus for indicating an oldest non-retired load operation in an array
KR19980079702A (en) A method of delivering the result of a store instruction and a processor implementing the same
JPH0816870B2 (en) System for draining the instruction pipeline
TW475149B (en) Secondary reorder buffer microprocessor
US6928533B1 (en) Data processing system and method for implementing an efficient out-of-order issue mechanism
US5717882A (en) Method and apparatus for dispatching and executing a load operation to memory
US6378062B1 (en) Method and apparatus for performing a store operation
JP2682812B2 (en) Operation processing system and method
JP3600467B2 (en) Data processing system and method for out-of-order logic condition register processing
JPH10283179A (en) Data processing system and method for judging instruction order with instruction identifier
KR100310798B1 (en) Concurrent execution of machine context synchronization operations and non-interruptible instructions
US5694553A (en) Method and apparatus for determining the dispatch readiness of buffered load operations in a processor
US5875326A (en) Data processing system and method for completing out-of-order instructions

Legal Events

Date Code Title Description
GD4A Issue of patent certificate for granted invention patent
MM4A Annulment or lapse of patent due to non-payment of fees