TW200411565A - Processor with demand-driven clock throttling power reduction - Google Patents
Processor with demand-driven clock throttling power reduction Download PDFInfo
- Publication number
- TW200411565A TW200411565A TW092118079A TW92118079A TW200411565A TW 200411565 A TW200411565 A TW 200411565A TW 092118079 A TW092118079 A TW 092118079A TW 92118079 A TW92118079 A TW 92118079A TW 200411565 A TW200411565 A TW 200411565A
- Authority
- TW
- Taiwan
- Prior art keywords
- unit
- bit
- clock
- pause
- gcsr
- Prior art date
Links
- 230000009467 reduction Effects 0.000 title description 7
- 230000001360 synchronised effect Effects 0.000 claims abstract description 32
- 230000000694 effects Effects 0.000 claims description 37
- 238000000034 method Methods 0.000 claims description 21
- 238000012544 monitoring process Methods 0.000 claims description 21
- 239000000872 buffer Substances 0.000 claims description 19
- 238000003860 storage Methods 0.000 claims description 16
- 238000007667 floating Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000004891 communication Methods 0.000 claims description 8
- 230000008901 benefit Effects 0.000 claims description 4
- 230000004044 response Effects 0.000 claims description 3
- 244000309464 bull Species 0.000 claims 1
- 238000005191 phase separation Methods 0.000 claims 1
- 238000004080 punching Methods 0.000 claims 1
- 238000013461 design Methods 0.000 description 13
- 238000005516 engineering process Methods 0.000 description 13
- 239000000725 suspension Substances 0.000 description 13
- 230000007246 mechanism Effects 0.000 description 11
- 238000011144 upstream manufacturing Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 230000033228 biological regulation Effects 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 3
- 230000001364 causal effect Effects 0.000 description 3
- 238000006731 degradation reaction Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000001960 triggered effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 239000013078 crystal Substances 0.000 description 2
- 230000009849 deactivation Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 206010011469 Crying Diseases 0.000 description 1
- 241000630665 Hada Species 0.000 description 1
- 206010062519 Poor quality sleep Diseases 0.000 description 1
- 102100022059 Serine palmitoyltransferase 2 Human genes 0.000 description 1
- 101710122477 Serine palmitoyltransferase 2 Proteins 0.000 description 1
- RTAQQCXQSZGOHL-UHFFFAOYSA-N Titanium Chemical compound [Ti] RTAQQCXQSZGOHL-UHFFFAOYSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003990 capacitor Substances 0.000 description 1
- 230000000739 chaotic effect Effects 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 238000001816 cooling Methods 0.000 description 1
- 230000002354 daily effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000000428 dust Substances 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 235000012149 noodles Nutrition 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000008929 regeneration Effects 0.000 description 1
- 238000011069 regeneration method Methods 0.000 description 1
- 230000024977 response to activity Effects 0.000 description 1
- 230000017280 response to inactivity Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 239000002689 soil Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000009885 systemic effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
- G06F9/3869—Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3237—Power saving characterised by the action undertaken by disabling clock generation or distribution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3243—Power saving in microcontroller unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3287—Power saving characterised by the action undertaken by switching off individual functional units in the computer system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Advance Control (AREA)
- Executing Machine-Instructions (AREA)
Abstract
Description
200411565 玖、發明說明: 【發明所屬之技術領域】 處由複數個被計時組件或單元組成之微 或糸統的琶力消耗的減低與控制。 【先前技術】 曰:導體技術與晶片生產之進步導致晶載時鐘頻率、單一 曰日片上之電晶體數目及晶粒自身的尺寸穩定增長 晶片供電電壓的對應增長。—般而 : 電力消耗隨其切換頻率而線性增長。因此, 電塵下降,晶片的電力消耗仍然上升。對於 ^電 別’因晶片電力的增長’冷卻與包 ’及 對於低端系統(例如,手持口。 石μ去# 丄 冢7矛、統),電池壽侖 要在:降低性能至不可接受水準的前提下 :: 此,微處理器電力消耗的增加即成為人後 改善性能的主要障礙。 I攻馮7後 純量處理器-次擷取與發出/執行—個指令 在純量資料運算元上運算。各該等運算元為單广均 處…㈣線作二:= 一問題範例。、’〜’理多個指令,同時保持單 超純量處理器可在特定機器週期搁取 指令。此外’各指令之擷取 '出。執行多個 成管線以致動進-步彳料控通常係排列 /的亚订性。超純量 麵公司_wer/PowerPC處理器、 ②的祀例包括 、 爾A司的Pentium 86560 200411565200411565 (1) Description of the invention: [Technical field to which the invention belongs] The reduction and control of micro or systemic palm force consumption composed of a plurality of timed components or units. [Previous technology]: The advancement of conductor technology and wafer production has led to a steady increase in the clock frequency of the wafer, the number of transistors on a single wafer, and the size of the crystal itself. The corresponding increase in the supply voltage of the wafer. Normally: Power consumption increases linearly with its switching frequency. Therefore, the electric dust is reduced, and the power consumption of the chip is still increased. For ^ Dianbei 'due to the increase in chip power,' cooling and packaging ', and for low-end systems (for example, handheld ports. 石 μ 去 # 丄 丘 7lance, uniform), the battery life must be: reduce performance to unacceptable Under the premise of this :: As a result, the increase in the power consumption of the microprocessor has become a major obstacle to improving performance in the future. After I attacked Feng 7 scalar processor-fetch and issue / execute-an instruction to operate on scalar data operands. Each of these operands is where Shan Guang is everywhere ... The second line is: = An example of a problem. , '~' Process multiple instructions while maintaining a single Ultrascalar processor can fetch instructions in a specific machine cycle. In addition, 'fetching of each instruction'. Performing multiple pipelines to actuate the advance-step control is usually a permutation / ordering. Ultra scalar noodle company _wer / PowerPC processor, examples of ② include, Pentium 86560 200411565
Pro (P6)處理為系列、幵陽公司的uitraSparc處理器與惠普 公司(Hewlett Packard Company ; HP)的 PA-RISC處理器、以 及前康柏公司(現已與HP合併)的Alpha處理器系列。 向量處理器一般排列成管線,可在單一結構步驟或指令 的整個數目陣列上實施一操作。例如,一單一指令可將陣 列A的各項目加至陣列B的對應項目,並將結果儲存於陣列 C的對應項目。向量指令通常係、被支援為基本純量指令組的 延伸。只有可在較大應用程式中向量化的編碼區段才可在 向量引擎上執行。向量引擎可為單一、管線執行單元,或 可組織成一陣列或單一 multiple data ; SIMD)機 才曰々夕重資料(single inStructi〇n 具有多個同樣的執行單元,對不 同資料同時執行同一指令 腦均係向量處理器。Pro (P6) processes are series, Puyang's uitraSparc processor and Hewlett Packard Company (HP) PA-RISC processor, and the former Compaq (now merged with HP) Alpha processor family. Vector processors are generally arranged in a pipeline that can perform an operation on an entire number of arrays of a single structural step or instruction. For example, a single instruction can add each item of array A to the corresponding item of array B, and store the result in the corresponding item of array C. Vector instructions are usually supported as an extension of the basic scalar instruction set. Only encoding segments that can be vectorized in larger applications can be executed on the vector engine. The vector engine can be a single, pipeline execution unit, or it can be organized into an array or a single multiple data (SIMD) machine. The single inStructiary has multiple identical execution units, which simultaneously execute the same instruction on different data. Both are vector processors.
Cray超級電 例如,一般而言, 步被計時處理器或系統具有一單 Μ 覺體主時鐘’ 動組成該㈣之所有單元或組件n該時鐘的比例 生可能較主時鐘頻率循環特定子單元更快或更慢。通常 由設計靜態預定與預設該等時鐘決定。例如,英代爾 ⑽職4處理器的整數管道時鐘兩倍快於晶片主時鐘, 了熟悉技術人士熟知的加倍幫浦或波形管線技術 该專日,鐘加倍技術提高了處理器的執行速度與性能。但 ,匯流排與晶片外記憶體的 的速度保持同步。因此,大Μ算邏輯核 數取新技術處理器呈有晶 外匯流排及快取記憶體,其運作 … 的敕奴从如 、丰為主處理器時鐘頻 的正數約數倍。通常,該等時鐘 ^作頻率係在設計系統 86560 200411565 口疋此即田蝻代處理器複合體可能具有多個時鐘速率的 口冑日寸力σ倍幫浦及波形管線技術也用於更高端的機 器’以減輕處理器與外部匯流排或記憶體之間的性能失配。Cray Super Electric For example, in general, a stepped timing processor or system has a single MU perceptual master clock, and all units or components that make up the unit. The proportion of the clock may be greater than the specific clock unit of the main clock cycle. Faster or slower. This is usually determined by designing static reservations and presets of such clocks. For example, the Intel Indie 4 processor's integer pipeline clock is twice as fast as the chip's master clock. This doubles the pump or waveform pipeline technology familiar to those skilled in technology. On this day, the clock doubling technology improves the processor's execution speed and speed. performance. However, the speed of the bus is synchronized with the speed of the off-chip memory. Therefore, the large M arithmetic logic core fetching new technology processor has a crystal foreign exchange stream and cache memory, and its operation ... The slave clocks of Congru and Feng are approximately several times the positive clock frequency of the main processor. Generally, the operating frequency of these clocks is in the design system 86560 200411565. This means that the Titan generation processor complex may have multiple clock speeds. The daily power is σ times the pump and the waveform pipeline technology is also used for higher-end Machines' to mitigate performance mismatches between processors and external buses or memory.
Rabaey,Jan Μ與㈤酿,心議d編輯出版㈤歸“學術 出版社,1996年版)的「低電力設計方法」說明了使用同步 時鐘問控的電力減低方法,其中,在一再生點可停止時鐘 一 在局邠日可4里緩衝|§ 〇ocai ci〇ck buffer ; lcb)内饋 达特疋日曰片區域、組件或鎖存器。在粗略控制級別,時 鐘,沿功能邊界閘控。在精確控制級別,時鐘係於單獨鎖 存益内閘控。例如,Ger〇sa等人在簡年叫Η IEEE期刊「固能兩改 μ a + ^ 旦 口心电路」上發表的論文「2.2 W,80 MHz超純 ;.cu處理s」(第144()至1454頁)中說明了根據各週期 調度及執行的指令的不同執行單元的閘控時鐘。 八田處理益執行某功能類別的指令(如純整數或純浮點指 々)之序列時,粗粒單元級別時鐘閘控較有利。當輸入工作 重為處理器僅看到整數碼時,即停用該浮點單元的時鐘再 =。同樣地’在純浮點運算期間,可停㈣整數單元的 H。此舉可節省相當數量的晶片電力。通常使用軟體藉 序列指令或使用硬體偵測閒置週期局部實現粗閒置控制 。精閒置控制通常係藉由在指令解碼期間避免不必要的傳 播热效或無關緊要的資料而局部實現。閘控資訊之一因果 =二:Γ始點流向下游級或單元稱為前饋流。此種 :…包括迴路,具有明顯的倒流,但是,該因果資 言“仍然視為前饋程序。因此,粗間置控制及精間制 86560 200411565 均係自我觸發前饋。 另一方面,使用下游管線暫停信號管理前饋流構成一回 杈控制系統。於此,控制資訊流係從下游的「果」泞内上 游的「因」。粗粒與細粒暫停控制係主要用於防止覆蓋管線 處理器中的有效暫停資料,但是,該等機制也可用省 電力消耗。例如,Jacobson等人在2〇〇2年4月的江邱 ASYNC-2002會議上的論文「同步互鎖管線」中提出一用以 減少同步管線電力的細粒暫停傳播機制,該機制使用「有 效」位兀補充更傳統之時鐘閘控的細粒前饋機制,如上述 Gerosa等人所述,亦參見(3(^扣等人的η〕以微處 理器設計中之電力考量」(1998年从則_設計自動= 議期刊第 726 至 731W199wq))MnJac〇b_u 之已發表的細粒暫停閘控(回授)機制不似本發明用於控制 資訊流的速率(透過時鐘或匯流排頻寬調節)。 粗閒置控制至少產生必須解決的兩個問題。第一,大瞬 變電流下降及增益可導致晶載供電電壓不可接受的電感 (Ldi/dt)雜訊位準。第二’閘控開關程序需要額外週期保持 正確的功能運作。針對工作量的細粒相位變化,過於頻够 地在閘㈣致動模式之間切換可導致不可接受的性能下降^ 而且’最新精閒置控制依賴局部產生的管線級別時鐘閘 控的間控信號或條件’例如根據資料無效或無關運算元條 件。該等最新方法不根據預計或預料產生閘控信號。因此 ’時序要求通常很重要,因為問控信號必須在判定前提供 ’亚且用於判^無錯時鐘問控運作的合適期間。G〇wan,M. 86560 10 200411565 K.,Biro, L· L.與Jackson,D· Β·在 1998 acm/IEEe設計自動 化會議刊物上的論文「Alpha 2丨微處理器設計之電力考 $」(第726至731頁)(1998年6月)中說明了該等限制如何極 大地使設計時序分析複雜化,甚至導致時鐘頻率性能劣化。 不論基本控制機制係前饋(因果流)還是基於回授(果因流) ,最新時鐘問控技術(不論粗或精)均僅為空間控制。這是因 為,應用資訊係用於消除受影響區域的冗餘時鐘,而不考 慮該等區域或機器之其他區域的時間活動或歷史。下賴 費旬單元與級(如執行管道或問.題仔列)中的活動狀態與事 件I未:授以调即上游(生產者)在非鄰近區域(如指令掏取 或调度單元)的時鐘或資兮 、、 才鯉及貝汛机速率。同樣地,上游生產者區 或的活動狀恶與事件也未前饋以調節下游消費者的時鐘或 貧訊流速率。而且,若時 L現已致動或未致動,一般閘 閉時鐘信號即可或沒有影響。 因此,需要改進可按细 營m * 間及時間粒度運作的所連接 t 不引起性能(額外)劣化及對基層 甩路的大電流/電壓擺動。 【發明内容】 本务明之目的係減低處 性能損失。 -的电力消耗同時沒有明顯的 本發明係一種同步積體 理器。電路組件或單元传由―’如純量處理器或超純量處 該共同系料鐘。至少㈣共同系統時鐘料並同步於 級,例如管線級。各被吃/皮記時的單元包括多個暫存器 守早元中的一本地時鐘產生器結 86560 200411565 ΠΓ系統時鐘與一或多個其他單元的暫停狀態,以向 上或向下調節暫存器時鐘頻率。 【實施方式】 現參考附圖,更特定t之, 目1顯不一典型最新技術管線 、皮里处理咨100之範例的高位準方塊圖及一對應的指令時 序圖。該主要功能資料路徑係分成兩個主要組件或單元, 通常稱為指令單元(instrUetiGn unit;〗糊T)1 Q2與執行單元 (⑽⑽⑽unit ; E_UNIT)1〇4。本發明未說明[單元1〇技£ 單元1〇4中的許多細分子單元與功能邏輯(如分支預計邏輯) ,因此’為使處理請之整個時鐘控制的說明簡明,對之 予以省略。該範例之處理器1〇〇係管線純量設計之代表,此 種設計不包括防止單元、子單元或其包括之輔助儲存 '資源 的冗餘計時的時鐘閘控。 "由單7G 1G2或1G4之-或二者存取的晶載錯存包括暫存器 播案(register file,· REGFILE)1〇6、指令快取記憶 ^ Gnst而tion cache ; ICACHE)108及資料快取記憶體(如 cache ; DCACHE)110。REGFILE 1〇6一般為共用資源可 從單元102、104存取,因此被視為單獨實體。icache ι〇8 係I-單兀管道的第一終止級’因此被視為^單元1〇2的部分 。DCACHE 110通常可單獨從E_單元1〇4存取,因此被視為 E-單兀104的部分。兩個單獨的局部時鐘緩衝、 114各放大並分配一共同同步時鐘115至單元1〇2、ι〇4中的 一對應單元。各單元1〇2、104包括一輸入佇列116I、U6E 與一管道1181、118E。視需要在^單元1〇2與^單元ι〇4中, 86560 -12- 200411565 可具有用於該共同系統時 之、·,田粒刀配、放大與控制之 LCB 2、114的進一步層級。 一 。電細私式的指令係包括於I-單元管道的 第一級、IC ACHE 108中。一炉而丄.Rabaey, Jan Μ and ㈤ Brewing, discussed and edited and published Zigui "Academic Press, 1996 Edition" "Low Power Design Method" illustrates the power reduction method using synchronous clock interrogation. Among them, it can be stopped at a regeneration point. Clock 1 can be buffered for 4 li on the next day | § 〇ocai cioc buffer; lcb) feeds the special day area, component or latch. At the coarse control level, the clock is gated along the functional boundary. At the precise control level, the clock is gated within a separate lock-in benefit. For example, Gerosa et al. Published a paper "2.2 W, 80 MHz ultra-pure; .cu processing s" published in the IEEE journal "Solid Energy Two Modifications μ a + ^ Dankouxin Circuits" (Section 144 ( ) To pages 1454) describe the gated clocks for different execution units of instructions scheduled and executed in each cycle. When Hada processes sequences of instructions of a certain functional category (such as pure integers or pure floating-point instructions), coarse-grained unit-level clock gating is advantageous. When the input workload is such that the processor sees only whole digits, the clock of the floating-point unit is disabled and =. Similarly'H can be stopped for integer units during pure floating point operations. This can save a considerable amount of chip power. Usually, the software uses borrowed sequence instructions or uses hardware to detect the idle period to partially implement coarse idle control. Fine idle control is usually implemented locally by avoiding unnecessary propagation of thermal effects or insignificant data during instruction decoding. One cause and effect of gated information = two: Γ starting point flow to downstream stage or unit is called feed-forward flow. This type of: ... includes a loop, which has a significant backward flow, but the causal predicate "is still considered a feedforward procedure. Therefore, coarse interposition control and refined intermediation 86560 200411565 are self-triggered feedforward. On the other hand, using The downstream pipeline suspends the signal management feed-forward flow to constitute a loop control system. Here, the control information flow is from the “effect” downstream to the “cause” inside the upstream. Coarse-grain and fine-grain pause control systems are mainly used to prevent overwriting of effective pause data in pipeline processors, however, these mechanisms can also be used to save power consumption. For example, Jacobson et al. Proposed a fine-grained pause propagation mechanism to reduce the power of synchronous pipelines in the paper "Synchronous Interlocking Pipelines" at the Jiangqiu ASYNC-2002 conference in April 2002. This mechanism uses "effective "Wu Wu supplements the more traditional fine-grained feed-forward mechanism of clock gating, as described by Gerosa et al. Above, see also (3 (^ buck et al. Η) in terms of power considerations in microprocessor design" (1998 From the rule _ design automatic = pp. 726 to 731W199wq)) MnJac〇b_u's published fine-grain pause gating (feedback) mechanism is not like the invention used to control the rate of information flow (through clock or bus bandwidth) Regulation). The coarse idle control generates at least two problems that must be solved. First, the large transient current drop and gain can cause the inductor (Ldi / dt) noise level to be unacceptable on the wafer supply voltage. The second is' gating The switching program requires extra cycles to maintain correct functioning. For fine-grained phase changes in workload, switching too frequently between gate actuation modes can cause unacceptable performance degradation ^ and 'the latest fine idle control relies on the bureau Interim signals or conditions for pipeline-level clock gating 'for example based on invalid data or irrelevant operator conditions. These latest methods do not generate gated signals based on expectations or expectations. Therefore,' timing requirements are often important because of interrogation signals A suitable period must be provided before the judgment for the operation of the error-free clock interrogation. Gowan, M. 86560 10 200411565 K., Biro, L.L. and Jackson, D.B. in 1998 acm / IEEe Design Automation Conference Paper "Alpha 2 丨 Microprocessor Design Power Test $" (pages 726-731) (June 1998) illustrates how these constraints can greatly complicate design timing analysis Even the clock frequency performance is degraded. Whether the basic control mechanism is feedforward (causal flow) or feedback (causal flow), the latest clock control technology (regardless of coarse or fine) is only for space control. This is because, Application information is used to eliminate redundant clocks in affected areas, regardless of the time activity or history of those areas or other areas of the machine. It depends on Fischer units and stages (such as execution pipelines or .Title column) Activity status and event I are not: to adjust the clock or resource rate of upstream (producer) in non-adjacent areas (such as instruction extraction or scheduling unit), speed, and speed Similarly, the events and events in the upstream producer area or the event are not feed forward to adjust the clock or poor flow rate of the downstream consumer. Moreover, if L is now activated or not activated, the clock is generally closed The signal may or may not have an effect. Therefore, there is a need to improve the connected t that can operate at fine granularity of time and time without causing (extra) degradation in performance and large current / voltage swing to the grass-roots circuit. [Summary of the Invention] This The clear purpose is to reduce performance loss. -The power consumption is not obvious at the same time. The present invention is a synchronous integrator. Circuit components or units are passed by a common clock such as a scalar processor or ultra-scalar. At least, the common system clock is synchronized and synchronized to a stage, such as a pipeline stage. Each eaten / picked time unit includes a plurality of registers. A local clock generator node in the early guard unit 86560 200411565 ΠΓ system clock and the pause state of one or more other units to adjust the temporary storage up or down Clock frequency. [Embodiment] Referring now to the drawings, more specifically, item 1 shows a typical latest technology pipeline, a high-level block diagram of an example of the Piri processing, and a corresponding instruction sequence diagram. The main function data path is divided into two main components or units, which are usually called instruction unit (instrUetiGn unit; 1) T2 and execution unit (⑽⑽⑽unit; E_UNIT) 104. The present invention does not describe the [unit 10 technology] many fine-molecular units and functional logics (such as branch prediction logic) in the unit 104, so 'the explanation of the entire clock control for processing is concise and omitted. The processor 100 in this example is representative of the pipeline scalar design. This design does not include clock gating to prevent redundant timing of the unit, sub-unit or its included auxiliary storage 'resources. " Chip-mounted misstores accessed by a single 7G 1G2 or 1G4-or both include register file (REGFILE) 106, instruction cache memory ^ Gnst and tion cache; ICACHE) 108 And data cache memory (such as cache; DCACHE) 110. REGFILE 106 is generally a shared resource that can be accessed from units 102, 104, so it is considered a separate entity. icache 08 is the first termination stage of the I-Unit pipeline, so it is considered as part of the unit 102. DCACHE 110 is generally accessible separately from E_unit 104 and is therefore considered part of E-unit 104. Two separate local clock buffers 114 each amplify and allocate a common synchronous clock 115 to a corresponding one of the cells 102 and ι04. Each unit 102, 104 includes an input queue 116I, U6E, and a pipe 1181, 118E. As necessary, in the unit 102 and the unit 104, 86560-12-200411565 may have a further level of LCB 2, 114, which is used in the common system, field knife allocation, amplification and control. One . The electronic private instruction is included in the first stage of the I-unit pipeline, IC ACHE 108. Suddenly in a rush.
兩又而吕,在可能引起ICACHE ,的:種^件下,ICACHE 108可暫停可變數目的處理器 週期4暫停隱含適應未中對先前指令傳送級(即該指令記 憶體層級的較低級別)的影響。 外該同步時鐘連續驅動各單元通過lcb ιΐ2、ιΐ4,而不論 管線中的暫停。切換電容調變及圖案位元變化引起電力消 耗變化’但是變化範圍很小。因&,從程式執行的開始到 、.σ束各日守知週期大致消耗了同樣的能量(此處以標準能量 單位表示)。因此,藉由使用電力節省技術,尤其是根據本 發明之電力節省方法可節省相當的能量。 在編碼產生期間,藉由插入包括於指令組結構中的特別 指令的編譯器,可同步化粗閒置控制,或者,可由作業系 統動怨發出該等指令,例如當實施一特別中斷或在某背景 切換時。在最粗略控制級別,可能發出一特別睡眠型指令 或命令,該特別睡眠命令可產生一停用信號,將晶片選取 部分的時鐘停止一段時間。該同一睡眠命令也可用於停用 指令擷取程序。同樣地,當該停用信號變負後,或在睡眠 週期過後,一隱含喚醒開始,或可用一明示非同步中斷實 現喚醒。如技術人士所熟知,可藉由在LCB層級之各級別 選擇性地停用的時鐘分配樹來提供各種省電模式(如小睡 、淺睡或睡眠)。在下一級別更細的粒度,每當該編譯器能 86560 -13 - 夠靜態預計計算相 始閑閉特定單元的心 插入特別指令,開 、寸1里,如浮點單元的時鐘。 可包括一自我偵測機制,當一單元、· 許其停用自己的時鐘叶 x自己閒置時,允 測處理ϋ的局部:日硬體中,可設計邏輯偵 些或全部閒置區域此果可觸發停用某 # ^ ξι! ^ ^ — 、里。喚S生也疋根據停用或睡眠單亓 接收到的新任務以類似方式自我啟動的。 民早- 鐘:例二1 :控制’動態定義信號逐個週期地閘控局部時 決〜可,、左田作:超純量機器,該處理器在指令解碼期間 ’、疋σ ik後執行週期時鐘、^ ^ 在具有η戈床八山 刀恥早兀&逞。该方式 念、,一序出機制的處理11内運作良好,因此,可明 確並提前足夠時間『g 月 八“ 才間(即在解石馬或調度時)做出閉控決定。若於 於次序混亂的發出問心宁列’則即使對 也可在务出時間產生該等閘控信 祝0 在任何管線資料路徑,可動態偵測並選擇性地防止冗餘 計時’例如’沿該邏輯管線傳播一資料有效旗標,僅當產 生於:週期的f料有效時才設定該資料有效旗標。然後, ,邏輯級的資料有效旗標可用作—時鐘致動錢,用於設 定該級輸出鎖存哭。Ll_ 一 ^ 因此,在該等隨後管線級無須為無效 資料計時,該等管線級可稱為基於細粒、有效位元的管線 級級別時鐘閘控。 年月12日杈予Sproch等人的美國專利6,247,1342 B1號「節省電力運作管線電路中的管道級閘控方法與系統」 86560 200411565 說:了-種處理器,其邏輯將在由該邏輯之第_級實施的 一 w週期計算中不會在管線中變化的任何新收到的運蓄元 辨識為無關的。將此種不變條件信號偵測為無關可用:停 用該第-級的時鐘,然後連續停用隨後各級的時鐘。、τ ◦hnishi, M.,Yamada,a” N〇da,Η 與^^,丁在 dm年 低電力電子產品及設計國際研討會(Int· >ρ_ % L⑽ P〇wer Electronics and Design ; ISLpED)刊物上的論文「以 級別設計之冗餘計時偵測與電力減低之方法」(第HI至 136頁)中說明了其他更詳細的防止各種冗餘鎖存器計時的 4貞測機制。 如上所述,不論粗或細控制,該等最新技術閒置控制機 制均,自我觸發、空間或前饋的,即閘控條件或信號係根 據對單元閒置或無效狀態的㈣而局部產±,並在有該等 無效或閒置狀態位元時避免不必要的逐個週期時鐘和資料 傳播。一單元可為一整個區域或功能單元,或可為一管線 級鎖存器組。 相反地,在-第—項具體實施例中,—具有以要求驅動 的時鐘調節的純量管線處理器與一與一執行單元 ㈣ecution unit ; E_unit)可調節運作的一指令單元 (instruction unit ; ^仙⑴在該兩個單元之間建立了生產者與 消費者的關係。生產者I單元轉送現有資料致動指令至該執 行單元,以在其中以不快於該執行單元可接受的速率處理 。各單元保持具有至少1位元資訊的一活動狀態暫存器。在 該項具體實施例中,該PE單元對係由一共同同步時鐘計時 86560 -15- 200411565 。但是’各單元料鐘係根據在單元之間傳遞的局部單元 活動資訊進行局部修改及控制。各單元的局部時鐘控制可 為该局部單元及遠端單元f σ 早兀(即兩個早兀)之活動狀態資訊的 功能。 圖:顯示根據本發明之以要求或活動驅動電力控制純量 處理為120之較佳具體實施例的高位準範例,其元件與圖1八 相同標識的元件相同。在該項具體實施财,處理單元122 124各包括活動監視及時鐘控制邏輯I〗。Kg,監視單元 活動級別。在一簡化具體實施例中,一單一活動狀態位元 13〇係一暫停位元,指示暫停/非暫停條件。當e_單元124感 應到暫ir條件時(當前的或迫近的)時,其即判定該暫停位 元130 。亥暫彳τ位元130係用於調低^單元時鐘clkj 132的 時鐘速度,以調低!-單元丨22,並因此減低(或切斷)至£_單 元124的指令速率。根據控制粒度,&單元活動狀態或暫停 位兀130可在E_單元124内調節其自己的時鐘,例如1^。當 單元124暫停結束後,CLK_i ^被調升至其正常時鐘速 率同樣地,當L單元122偵測到暫停條件時(如一 ICAcHE 1〇δ未中),數個週期後^管道1181為空,因而沒有給單元 124的貧訊,則一 ^管道空位元136係用於調低clk_e 138 乂節省電力。活動監視及時鐘控制邏輯126、128的該上/下 凋節此力可使用多種不同方法實施,下文將提供數個範例。 例如,各單元時鐘可一次退出一管線級。同樣地,當閘 拴條件、’’σ束日^,該單元時鐘又一次返回一管線級。該控制 邏輯允卉各單元時鐘及時退出或返回,而不失去有效資訊 86560 -16- 200411565 ,也不增加介面邏輯、緩衝,也沒有浪費能量的管線佔用 與再循環。或者,可放慢或逐步降低單元時鐘頻率以節省 龟力而且’若/根據需要,時鐘可隨後放慢至停止。 圖3顯示活動監視與時鐘控制邏輯懸置電路的一第一範 例’其中一閘控控制移位暫存器(gating c〇ntr〇1 shift register,GCSR) 150使系統時鐘115傳遞至ι_管道}} 81的個別 級152。GCSR 150係被計時1位元正反器154的一 “立元線性 移位暫存器,該正反器對應於各卜管道級152。系統時鐘 115係AND於各級,在AND閘極156之一中具有gcsr 15〇 之各位元,選擇性地將系統時鐘115當作一 AND閘極輸出傳 遞至一對應I-管道級152,共同構成1_(:]^〖158。在該範例中 ,通常沒有偵測E-單元暫停,暫停位元13〇保持為低,並由 反相器160反轉。藉由GCSR 15〇的所有i^管道1181處於 全調節。因此,AND閘極156將未改變的系統時鐘115當作 I-CLK 158傳遞至I-管道級152的各級,各等同於系統時鐘 11 5。只要E-單元暫停位元13〇保持為〇,一 i即移位至qcSR 150,藉此將I-管道1181保持於全調節。 在此範例中,當E-單元活動監視與時鐘控制邏輯(分別為 圖2的1 24與1 28)感應到一迫近的暫停條件時,其即判定 單兀暫停位兀130。該暫停位元13〇由反相器16〇反轉,向 GCSR 150提示一〇,該〇隨後在下一系統時鐘週期被移位至 GCSR 150。在同一週期,該第一 j•管道級(即自圖2的 ICACHE 116)被停用,暫停對I-管道丨181的資料輸入,同時 將0移位至並通過GCSR 150。只要暫停位元13〇被判定,〇 86560 -17- 200411565 移位至GCSR 150 ’每系統時鐘週期一次。各零隨後停止向 本項範例中自左向右連波的A N D開極i 5 6之後續傳遞丰^ 時鐘。因此’ H道時鐘158的—級在某時從左至右問閉、。 於是,在判定暫停位元13G之前,有效丨_管道項目繼續 I-官逞1181亚進入E-單元。為避免在[管道U8i遺失有效資 訊’至少當可用EH緩衝器(仵列U6E)空間等於!_管二 1181中的有效項目數時,必須判^單元暫停位。因 此’至少S可用E-單元緩衝器(仔列U6E)空間等於【_管道 ⑽中的有效項目數時,必須判定單元暫停位元⑽。: 極端情況下,後者直接等於!_管道1181的長度。目此,在傳 先。又D十中每田E-單元仵列i i 6E填充至使可用(空白)件列項 目數等於I_f道級之數目時,E·單元124即判定暫停位元 13〇。E-單元佇列116E中的空白或無效項目係假定用預輸入 有效位元(為節省電力)進行時鐘問控’與傳統細粒、級-級 別時鐘閘控一樣。Two again, under the conditions that may cause ICACHE, ICACHE 108 may suspend a variable number of processor cycles. 4 The suspension implies an adaptation failure to the previous instruction transfer level (ie, the lower level of the instruction memory level). )Impact. In addition, the synchronous clock continuously drives each unit through lcd ι2, ιΐ4, regardless of the pause in the pipeline. Switching capacitor modulation and pattern bit change cause power consumption changes' but the range of change is small. Because of &, from the beginning of the program execution to the .σ beam, the monitoring period consumes approximately the same energy every day (here it is expressed in standard energy units). Therefore, considerable energy can be saved by using power saving techniques, especially the power saving method according to the present invention. During code generation, coarse idle control can be synchronized by a compiler that inserts special instructions included in the instruction set structure, or these instructions can be issued by the operating system, such as when implementing a special interrupt or in a certain context When switching. At the coarsest control level, a special sleep-type instruction or command may be issued. The special sleep command may generate a disable signal to stop the clock of the selected part of the chip for a period of time. This same sleep command can also be used to disable instruction fetching. Similarly, when the disable signal becomes negative, or after the sleep cycle, an implicit wake-up is initiated, or an explicit asynchronous interrupt can be used to achieve wake-up. As is well known to those skilled in the art, various power saving modes (such as nap, light sleep, or sleep) can be provided by a clock distribution tree that is selectively disabled at each level of the LCB level. At the next level of finer granularity, whenever the compiler can 86560 -13-enough to statically predict the calculation phase, start and close the heart of a particular unit. Insert special instructions, open and close 1 inch, such as the clock of a floating point unit. It may include a self-detection mechanism. When a unit is allowed to deactivate its own clock leaf x and idle, it is allowed to measure the part of the processing: In the hardware, you can design logic to detect some or all idle areas. Trigger deactivation of a # ^ ξι! ^ ^ —, 里. Calling students also self-initiated in a similar manner based on new tasks received based on deactivation or sleep orders. Min Zao-Zhong: Example 2: 1: Control 'Dynamic Definition Signals' Periodic Gating of Local Timings ~ Yes, Zuo Tian Zuo: Ultra-scalar machine, the processor executes cycles after instruction decoding, 疋 σ ik Clock, ^ ^ in the early days with η go bed Yayama sword shame & 逞. This method reads that the processing of the first-order mechanism works well within 11, so it can be clear and enough time in advance "g month eight" to make a closed control decision (that is, when the calculus or dispatch). If Yu Yu Issue out of order in a chaotic order, then you can generate these gate control letters even at the time of the transaction. Even in any pipeline data path, you can dynamically detect and selectively prevent redundant timing 'for example' along the logical pipeline. Propagate a data valid flag, and set the data valid flag only when it is generated from the periodic data of f. Then, the logical level data valid flag can be used as a clock actuation money to set the level The output latch is crying. Ll_ ^ Therefore, in these subsequent pipeline stages, there is no need to time the invalid data. These pipeline stages can be referred to as fine-grained, effective bit-based pipeline-level clock gating. Sproch et al., US Patent No. 6,247,1342 B1, "Pipe-Level Gating Method and System in Power-Saving Operation Pipeline Circuits" 86560 200411565 states: a processor whose logic will be implemented at the _th level of the logic One w period Does not count changes in the pipeline received any new operational storage yuan recognized as independent. Detect such unconditional signals as irrelevant: Stop the clock of the first stage, and then continuously deactivate the clock of the subsequent stages. Τ ◦hnishi, M., Yamada, a ”N〇da, Η and ^^, Ding Dm International Workshop on Low Power Electronic Products and Design (Int · > ρ_% L⑽ P〇wer Electronics and Design; ISLpED The article "Methods for Redundant Timing Detection and Power Reduction by Level Design" (pages HI to 136) in the publication) describes other more detailed 4-timing detection mechanisms to prevent timing of various redundant latches. As mentioned above, regardless of the coarse or fine control, these latest technology idle control mechanisms are self-triggering, space or feedforward, that is, the gate control conditions or signals are locally produced according to the condition of the idle or invalid state of the unit, and Avoid unnecessary cycle-by-cycle clock and data propagation when such invalid or idle state bits are present. A unit can be an entire area or a functional unit, or it can be a pipeline-level latch group. On the contrary, in the first specific embodiment, the scalar pipeline processor with a demand-driven clock adjustment and an instruction unit with an execution unit Eecution unit; E_unit can be adjusted to operate. Xianxuan established a producer-consumer relationship between the two units. The producer I unit forwards the existing data actuation instruction to the execution unit for processing at a rate not faster than the execution unit can accept. Each The unit maintains an active state register with at least one bit of information. In this specific embodiment, the PE unit pair is clocked by a common synchronous clock 86560 -15- 200411565. However, the clock of each unit is based on The local unit activity information transmitted between units is locally modified and controlled. The local clock control of each unit can be a function of the activity status information of the local unit and the remote unit f σ early (that is, two early). Figure: Shows a high-level example of a preferred embodiment in which the scalar processing of power control according to the requirements or activities is 120 according to the present invention, the elements of which are the same as those in FIG. The components are the same. In this specific implementation, the processing units 122 to 124 each include activity monitoring and clock control logic I. Kg, the monitoring unit activity level. In a simplified embodiment, a single activity status bit 13 is A pause bit indicates a pause / non-pause condition. When the e_unit 124 senses a temporary ir condition (current or imminent), it determines the pause bit 130. The hailer 位 τ bit 130 is used In order to reduce the clock speed of the unit clock clkj 132 in order to reduce!-Unit 22, and thus reduce (or cut off) the instruction rate to £ _ Unit 124. Depending on the control granularity, & unit activity status or pause bit Wu 130 can adjust its own clock in E_unit 124, such as 1 ^. When unit 124 pauses, CLK_i ^ is raised to its normal clock rate. Similarly, when L unit 122 detects a pause condition ( If an ICAcHE is not missed), after a few cycles, the pipeline 1181 is empty, so there is no poor signal to the unit 124, then the pipeline empty bit 136 is used to lower clk_e 138 to save power. Activity monitoring and clock Control logic 126, 128 This upper / lower joint can be implemented using a variety of different methods, and several examples are provided below. For example, each unit clock can exit one pipeline stage at a time. Similarly, when the brake conditions, "σ 束 日 ^", the The unit clock returns to a pipeline stage again. This control logic allows all unit clocks to exit or return in time without losing valid information 86560 -16- 200411565, without adding interface logic, buffers, and no wasteful pipeline occupation and re-use of energy. Alternatively, the unit clock frequency can be slowed down or gradually reduced to save power and 'if / as required, the clock can then be slowed down to a stop. FIG. 3 shows a first example of an activity monitoring and clock control logic suspension circuit. 'One of the gated control shift registers (GCSR) 150 transfers the system clock 115 to the pipeline. }} 81 individual levels 152. The GCSR 150 is a “litre linear shift register” which is clocked by a 1-bit flip-flop 154. The flip-flop corresponds to each pipeline stage 152. The system clock 115 is AND at each stage, and the AND gate 156 One of the members with gcsr 15〇 selectively passes the system clock 115 as an AND gate output to a corresponding I-pipe stage 152, which together constitute 1 _ (:) ^ 〖158. In this example, E-cell pauses are usually not detected, and the pause bit 13 remains low and is inverted by inverter 160. All i ^ pipelines 1181 by GCSR 150 are in full regulation. Therefore, the AND gate 156 will not The changed system clock 115 is passed as I-CLK 158 to each stage of I-pipe stage 152, each of which is equivalent to system clock 115. As long as the E-unit pause bit 13 remains at 0, one i is shifted to qcSR 150, thereby keeping the I-pipe 1181 in full regulation. In this example, when the E-unit activity monitoring and clock control logic (1 24 and 1 28 in Figure 2 respectively) senses an imminent pause condition, That is, it is determined that the unit pauses at position 130. The pause position 13 is inverted by the inverter 160, and a 10 is presented to GCSR 150, which It is then shifted to GCSR 150 in the next system clock cycle. In the same cycle, the first j • pipe stage (ie, ICACHE 116 from Figure 2) is deactivated, and the data input to I-pipe 181 is suspended while the 0 shifts to and passes GCSR 150. As long as the pause bit 13 is judged, 0865560 -17- 200411565 shifts to GCSR 150 'once every system clock cycle. Each zero then stops from left to right in this example The follow-up pass of the AND AND pole of the wave i 5 6 is the clock. Therefore, the stage of the H-channel clock 158 is closed at some time from left to right. Therefore, it is valid until the pause bit 13G is determined. Continue to I-Official 1181 and enter E-unit. To avoid losing valid information in [pipe U8i ', at least when the available EH buffer (queue U6E) space is equal to! Unit pause bit. Therefore, 'at least S available E-unit buffer (Zai U6E) space is equal to [_ number of valid items in the pipeline, you must determine the unit pause bit ⑽ .: In extreme cases, the latter is directly equal to! _ The length of the pipe 1181. At this point, before the transmission. And D The E-cell queue ii 6E in Nakakata is filled until the number of available (blank) line items equals the number of I_f track levels. E · cell 124 determines the pause bit 13. The blank in the E-cell queue 116E The invalid or invalid items are assumed to be clock-controlled with pre-entered valid bits (to save power), as with traditional fine-grained, level-level clock gating.
同樣地田暫仔位元1 3 0返回至〇時,即當偵測到E_單元 活動已經回到預定級別以下時,即開始反向上調運作。暫 停位兀130的低值將恢復移位1至GCSR 150,同時移位有效 輸入貝料至I-官這1181。因此,卜管道U8I的超連續後續系 先守麵週期I-CLK 158逐級致動,使I-管道1181恢復正常運 作,將資料傳遞至處於全調節的^單元124。逐級閘控^cLK 158關/開可防止大的電流擺動,藉此使供電電壓的Ldi/dt 雜訊影響最小。 圖4顯示一減速電路170的第二範例,該電路可用於替代 86560 -18- 200411565 f包括在根據按要求卜咖調節之本發明的―項較佳具體 貝知例之圖3的懸置電路中。減速電路m實質類似於圖^ 的懸置電路,其共用或共同元件標記相同。一遲緩選擇1 72 係提供至正反器/丨位元觸發計數器m的一反轉組輸入及 and閘極176。當遲緩選擇位元172高時,即被判定時,娜 閘極176選擇性地將系統時鐘115傳遞至時鐘i位元計數器 174。遲緩4擇GCSR 178實質類似於圖3的GCSR 15〇,僅最 終級輸出180傳遞至所有AND閘極182除外。and閑極a] 可為三輸入AND閘極,提供圖3 AND閘極⑸㈣置選擇功 能。因此,AND閘極182可組合最終級輸出18〇與系統時鐘 115及GCSR (圖3的130,兮益☆丨土 & 、 v 口曰]忒轭例未顯示)的一對應級輸出。 and閘極182的個別輸出時鐘對應於卜管道級。 在該項具體實施例巾,回應遲緩選擇172, US的時鐘 頻率可调低(问)。藉由判定(解判定)該遲緩選擇172,匕單 元向I-單it提S星E-單元内的遲緩(增加)要求。此外,上㈣ 置I-CLK158的全部特徵可保持—或多個週期,如上述㈣ 圖3所述,當宁列幾乎滿時,即觸發__次懸置,因而 從GCSR 15〇向對應的個別AND閑極182提供停止輸出。如 上所述,當E_單元活動仔列的長度減低至預定臨界值以下 時,I-CLK 158重新啟動。 在正常運作條件下,遲緩選擇172為低(使該i位元控 數器向GCSR178輸出-連續高移位值)。因此,GCSRm -般包括所有!’而系統時鐘115將未改變的值傳遞至 級m。當判定遲緩選擇172以發出&單元減速要求信號時 86560 19 200411565 (士由於暫彳τ ),AND閘極丨76防止1位元控制計數器1 Μ觸發 。緩選擇174升至其反轉組輸入且AND閘極176傳遞系 統時鐘115時’該1位元控制計數器Π4被釋放。該!位元控 制:數器174開始觸發肖GCSR m傳遞。與丨的交替序歹: 。一旦該交替圖案傳播通過GCSR m,AND問極⑻的^ CLK控制即按交替時鐘週期致動與停用。結果,此舉將當 作I-CLK 158提供的主系統時鐘頻率二等分。 圖5 不圖3與4的具體實施例的變化,其中個別級 輸出係傳遞至對應的㈣AND閘極.運作實質類似於圖 4 ’但疋最終I·管道時鐘調節可能不同。在調節(減速)階段 、,在敎狀態運作下管道的替代級係按—特定系統時鐘 週期計時’即系統時鐘頻率的—半。各卜管道級内僅為示範 標明-有效位元(V)。該等有效位元通常存在於圖⑴之範 雜管道、E-管道糊糾仔列結構中。根據傳統前 饋方案’上游I-單S有效位元係向下游傳播至£•單元,以致 動細粒、級·級別時鐘閉控節省額外電力。在該範例中,各 管線級的有效位元為同步該局部級'級別時鐘的AND結構 提供-額外閘控控制。該㈣實施制該特定q道計時配 置需要基礎電路(未顯示)防止在調節(減速)模式期間覆蓋卜 管道級的有效資料,例如級之間的多餘中間佔位鎖存器, 或可能重複資訊儲存的各鎖存器級的主與從部分。 圖6 A至B更詳細地且顯 11 81的另一範例1 9 0以及一 暫存器級192包括在輸入與 對應於圖3斷面圖的純量I-管道Similarly, when Tian Tianzi returned to 0 from 130, that is, when it was detected that the E_unit activity had fallen below the predetermined level, the reverse upward adjustment operation was started. The low value of pause position 130 will resume shifting 1 to GCSR 150, while shifting the valid input material to I-guan 1181. Therefore, the super-continuous follow-up system of the pipeline U8I is actuated step-by-step to keep the surface cycle I-CLK 158, so that the I-channel 1181 resumes normal operation and transfers the data to the fully-adjusted unit 124. Step-by-step gate control ^ cLK 158 off / on prevents large current swings, thereby minimizing the Ldi / dt noise impact of the supply voltage. FIG. 4 shows a second example of a speed reduction circuit 170, which can be used in place of 86560 -18-200411565 f. The suspension circuit of FIG. 3 included in the preferred embodiment of the present invention which is adjusted according to requirements in. The deceleration circuit m is substantially similar to the suspension circuit of FIG. ^, And the common or common components are labeled the same. A lazy selection 1 72 provides an inversion set input and the gate 176 to the flip-flop / bit trigger counter m. When the slow selection bit 172 is high, that is, when judged, the gate 176 selectively passes the system clock 115 to the clock i-bit counter 174. The retarded GCSR 178 is substantially similar to the GCSR 15 of FIG. 3, except that only the final stage output 180 is passed to all AND gates 182. and idle pole a] can be a three-input AND gate, providing the selection function of the AND gate setting in Figure 3. Therefore, the AND gate 182 can combine the final stage output 180 with a corresponding stage output of the system clock 115 and the GCSR (130 in Fig. 3, Xi Yi ☆ 丨 soil &, v 曰) example. The individual output clocks of the and gate 182 correspond to the pipeline stages. In this specific embodiment, the response delay is selected 172, and the clock frequency of the US can be lowered (Q). By judging (determining) the delay selection 172, the dagger unit raises the delay (increase) requirement in the S-star E-unit to the I-single it. In addition, all the characteristics of the I-CLK158 can be maintained—or multiple cycles, as described in Figure 3 above. When the column is almost full, the suspension is triggered __ times, so from GCSR 15 to the corresponding The individual AND idler 182 provides a stop output. As mentioned above, the I-CLK 158 restarts when the length of the E_cell activity queue decreases below a predetermined threshold. Under normal operating conditions, the slow selection 172 is low (to make the i-bit controller output to the GCSR178-a continuous high shift value). Therefore, GCSRm generally includes everything! 'And the system clock 115 passes the unchanged value to stage m. When it is determined to be slow to select 172 to issue the & unit deceleration request signal 86560 19 200411565 (not due to temporary 彳 τ), the AND gate 76 prevents the 1-bit control counter 1M from being triggered. When the slow-selection 174 rises to its inversion group input and the AND gate 176 passes the system clock 115, the 1-bit control counter Π4 is released. Yes! Bit control: Counter 174 starts to trigger the GCSR m transfer. Alternating sequence with 丨:. Once the alternating pattern propagates through the GCSR m, the AND CLK control of the AND pin is activated and deactivated at alternate clock cycles. As a result, this will be treated as a halving of the main system clock frequency provided by I-CLK 158. Fig. 5 is a variation of the specific embodiments of Figs. 3 and 4, wherein the output of individual stages is passed to the corresponding ㈣AND gate. The operation is substantially similar to that of Fig. 4 ', but the final I · pipe clock adjustment may be different. During the adjustment (deceleration) phase, the alternative stage of the pipeline under the 敎 state operation is based on-specific system clock cycle timing ', that is, half of the system clock frequency. Each pipeline level is for demonstration purposes only-significant bit (V). These valid bits usually exist in the structure of the complex pipes and E-pipes in Figure ⑴. According to the traditional feed-forward scheme, the upstream I-Single effective bit system is transmitted downstream to the £ • unit to activate the fine-grained, level-level clock closure control to save additional power. In this example, the effective bits of each pipeline stage provide additional gate control for the AND structure that synchronizes the local stage 'level clock. The implementation of this specific q-channel timing configuration requires a basic circuit (not shown) to prevent valid data from being covered in the pipeline stages during the adjustment (deceleration) mode, such as redundant intermediate placeholders between stages, or possibly duplicate information Master and slave sections of each latch stage stored. Figs. 6 A to B show another example of 11 81 in more detail 1 9 0 and a register stage 192 including an input and a scalar I-pipe corresponding to the sectional view of Fig. 3
對應時序圖。各卜管道級152係由 輸出上。在該項範例中,各GCSR 86560 -20- 200411565 鎖存器㈣兩級鎖存器及一基本上為串列-輸入/並行-輸 出暫存器的-單-位元。該兩級鎖存器154包括—第—級鎖 存器m及—第二級鎖存器196。該第'級鎖存器194與暫存 益級192的鎖存器相同。該第二級鎖存器196係、使用負時鐘 極性自該第-級鎖存器丨9 4的㈣鐘極性致動。因此,例如 ’當該等第-級鎖存器丨9 4在上升時鐘邊緣被計時時,該第 二級鎖存urn係在下降時鐘邊緣被計時,以確保為[時鐘 驅動器156提供有效輸入。如參考圖3至5及圖6AsB所述, I-時鐘驅動器156係AND閘極並用作AND閘極。但是,若需 要,各AND閘極156可包括可為雙相時鐘驅動器的一時鐘驅 動器。輸入區塊198為選取懸置/遲緩控制邏輯的特定類型 提供合適的時鐘控制邏輯。因此,在圖3的範例中,輸入區 塊198可包括反相器16〇及gCSR 15〇之該第一級的鎖存器 154的一第一級194。同樣地,對於圖4與5的範例,輸入區 塊130可包括鎖存器174與AND閘極176。 圖7顯示本發明應用於管線超純量處理器2〇〇之一範例。 5亥起純里處理器200包括一 I -單元202與一E -單元220。該1_ 單元202包括一指令快取記憶體(ICACHE)204、包括一指令 擷取單元(instruction fetch unit ; IFU)及一分支單元(branch unit ; BRU)的一組合 IFU/BRU (206)、一調度單元(dispatch unit ; DPU) 208、一完成單元(completion unit; CMU)210 及包括一分支歷史表(branch history table ; BHT)與一分支 目標位址快取記憶體(branch target address cache ; BTAC) 的一分支位址單元214。此外,I-單元202還包括監視及時鐘 86560 -21 - 200411565 控制邏輯216。該等單元204、206、208、210與214的運作 與廣為人知的該等對應單元實質相同,但是如下文所述, 係根據本發明藉由監視及時鐘控制邏輯2 1 6計時。 通常’ IFU/BRU 206中的IFU每週期自ICACHE 2 04擷取指 令。棟取頻寬(fetch bandwidth ; fetch—bw)在先前技術處理 為中係固定為每週期擷取指令的最大數,現可由監視及控 制邏輯2 1 6即時調節。根據可用的自由空間,IFU將所擷取 的指令放置於IFU/BRU 206的擷取佇列(fetch queue ; FETCH—Q)中。IFU中的指令指員取位址暫存器(instrucu〇n fetch address register ; IFAR)指導指令擷取並於各週期的開 始提供下一擷取位址。IFU將各週期的每個下一擷取位址設 定到以下各項之一 ··(a)為先前週期IFAR值的下一次序位址 ’增加到足以滿足在前一週期被擷取至fetch—q的指令數 目;(b)決定或預計在前一週期採用的一分支指令目標;或 (c)在確定係先前錯誤預計後,一分支指令的正確擷取位址 。分支及指令擷取位址預計硬體206包括分支歷史表(BHT) 及分支目標位址快取記憶體(BTAc),並引導指令擷取程序 。按對應固定頻寬參數(fetch—bw或disp—bw)的決定,各活 動擷取(或調度)週期通常擷取(或調度)固定數目的指令。但 疋,當E-單元220指示一減速/懸置係必要時,除上述時鐘 調節外,該較佳具體實施例處理器(該範例中的2〇〇)即動態 調節各fetch一bw與/或disp—bw的值。 E-單元的減速/懸置(或其相反的加速/繼續)信號係同步 為在E-單元内產生及監視的狀態信號組合功能。該等狀態 86560 -22- 200411565 k唬可包括·⑷問題佇列以卩229、LSQ 232、fpq 2切與 VXQ 246的滿度或空度指示;化)一 DCACHE 238擊中或未中 事件,(C)E_單元内部共用匯流排流量擁塞或沒有擁塞(例如 單一匯流排可共用(及仲裁)以攜帶完成資訊至完成單 儿24〇),或(d)由於分支誤預計或誤考慮的其他形式而產生 的執行官迢淹沒或重發條件。該範例的FXU管道中可執 订處理器分支指令。或者,可使用-單獨並行BRU管道執 行分支指令。 可藉由一或兩個I-CLK調節與/或壓縮相關的^單元匯流 排頻見判定E-單兀調節卜單元管線流速率的減速/懸置信號 而不藉由停用接收ICACHE之所擷取資料的半數線來調 節時鐘(如在可有效二等分的特定存取因此, 為在凋節頻覓模式中節省電力,項目通常數目之一半係在 私7、、爰衝态中(在IFU 206内)。一般而言,根據下游E-單元 減速/懸置顯示的嚴重性,擷取頻寬可調節至正常模式的任 何刀數,包括一直到零。同樣地,為節省電力可調節調度 匯々丨l排頻寬(bus bandwidth ; disp一bw),以按需要或指示向 耗能的E-單元執行管道調度較少的指令。 E-單元220包括一固定點執行單元(fixed point execution it’ FXU) 222、一載入{諸存單元(i〇a(j store unit ; lsu) 224 浮點執行單元(floating point execution unit ; FPU) 226 及向里多媒體延伸單元(vector multimedia extension unit ·’ VMXU) 228。FXU222包括一固定點佇列229及一固定點 單元執行單元管道230。LSU 224包括一載入儲存佇列232 86560 -23 - 200411565 及一載入儲存單元管道23 4。FXU 222與LSU 224均與通用 暫存器236通信。LSU 224也與資料快取記憶體238通信。 FPU 226包括一固定點佇列240及固定點單元管道242,以及 固定點暫存器與重新命名緩衝器244。LSU 224也與固定點 重新命名緩衝器244通信。VMXU 228包括一向量延伸佇列 246及一向量多媒體延伸單元管道248。 各單元 229、230、232、234、236、238 ' 240、242、246 與248均與廣為人知的該等對應單元的運作實質相同,但是 如下所述,係根據本發明計時。如任何典型最新技術之超 純量處理器一樣,在特定工作量執行階段,FXU 222與FPU 226中的活動很可能常常可相互排斥。當FPu 226在用時, 該項較佳具體實施例處理器200可停用或減速FXU 222的局 部時鐘,反之亦然。此外,該項較佳具體實施例處理器2〇〇 允許LSU224與FPU226相互懸置/遲緩時鐘速度。該等單元 内、細粒要求驅動型時鐘調節模式係對已說明之單元間粗 粒模式的補充。 與上述較佳純量處理器範例不同,E_單元22〇中的該等兩 個單元224、226沒有直接資料流路徑生產者一消費者關係, 即在LSU 224與FPU 226之間不存在一直接的資訊流。該等 兩個單元224、226之間的通信係透過資料快取記憶體/記憶 體及浮點暫存器檔案244間接地實施。一般而言,一 Fpu管 線242具有若干級(如在現代千兆赫範圍處理器中為6至8級) ,而一般LSU執行管道234為2至4級。因此且因為當前的處 理器具有大量暫存器重新命名緩衝器,在DCACHE 23 8擊中 86560 -24- 200411565 階段,LSU管道23 4趨向於實質上較FPU管道242提前運行。 另一方面,在叢集DC ACHE 23 8未中階段,可大幅增加有效 的LSU路徑延遲。若一系列快速未中暫停DC ACHE 238,則 LSU問題佇列232填滿,因而可暫停上游生產者。本發明使 用上游資源的活動驅動細粒時序時鐘閘控或FPU 226局部 時鐘調節利用此特點。 圖8顯示E-單元200之LSU 224與FPU 226的一更詳細範例 。在該項實施例中,LSU事件/活動狀態監視邏輯250監視各 種LSU佇列的應用,並為LSU 226驅動一活動狀態。在該範 例中,LSU仔列包括一載入-儲存問題符列(l〇ad-store issue queue; LSQ)232、一 擱置載入符列(pending load queue; PLQ) 252及一擱置儲存作列(pending store queue PSQ) 254,以及 DC ACHE 236。監視DC ACHE 236,並再記錄快取記憶體問 題事件。應明白,該等四個單元232、236、252、254係僅 為示範而選用,也可監視更少或更多的仔列及事件。對於 該範例,LSU事件/活動狀態監視邏輯250判定在該範例中傳 送至FPU 226的一輸出暫停位元256。若需要更細的控制, 可使用一組暫停位元。為控制可將暫停位元25 6傳送至I-單 元202及IFU或DISPATCH單元206的時鐘控制。 舉例而言,開始時,LSU活動狀態監視邏輯輸出暫停位 元25 6及FPU活動狀態監視暫停26〇的輸出258均被解判定 ,使LSU 224與FPU 226均處於通常的全調節運作。若FPU 活動狀態監視暫停位元258被判定而LSU活動狀態暫停位 元256保持未判定,例如由於fpq内的高使用率。則LSU局 86560 -25 - 200411565 部時鐘被調慢,以允許FPU 226趕上LSU 224,後者由於快 取記憶體擊中階段而快於FPU 226。相反,當LSU活動狀態 暫停位元256被判定,而FPU活動狀態暫停位元258保持未 判定,則FPU局部時鐘被調慢。若LSU與FPU的暫停位元同 時被判定/或解判定,則根據E-單元220 4!•單元2〇2的其他 狀態條件,LSU與FPU的局部時鐘均調慢或調快相同頻率。 有利的是,回應其他處理器或系統單元中的活動/不活動 ,本發明可選擇性地減速、加速或閘閉一單元或一單元内 的組件,即本發明具有可變時鐘控制粒度。各單元局部 時4里控制係源自按資料流方向向前及向後流動的活動及資 ^。與先前技術時鐘閘控之全有或全無不同,該較佳具體 貫施例的適應時鐘可使用前饋及回授控制以提供具有選擇 眭頻寬凋筇的更靈活的一般時鐘調節機制。 置資訊Corresponds to the timing diagram. Each pipeline stage 152 is output by. In this example, each of the GCSR 86560-20-200411565 latches is a two-stage latch and a single-bit element that is basically a serial-input / parallel-output register. The two-stage latch 154 includes a first-stage latch m and a second-stage latch 196. The 'stage' latch 194 is the same as the latch of the temporary benefit stage 192. The second-stage latch 196 is actuated from the clock polarity of the first-stage latch 9 4 using a negative clock polarity. Therefore, for example, when the first-stage latches 94 are clocked on the rising clock edge, the second-stage latch urn is clocked on the falling clock edges to ensure that a valid input is provided to the [clock driver 156. As described with reference to FIGS. 3 to 5 and FIG. 6 AsB, the I-clock driver 156 is an AND gate and functions as an AND gate. However, each AND gate 156 may include a clock driver, which may be a two-phase clock driver, if desired. The input block 198 provides suitable clock control logic for selecting a particular type of suspension / latency control logic. Therefore, in the example of FIG. 3, the input block 198 may include a first stage 194 of the first stage latch 154 of the inverter 16o and gCSR 15o. Similarly, for the examples of FIGS. 4 and 5, the input block 130 may include a latch 174 and an AND gate 176. FIG. 7 shows an example of applying the present invention to a pipeline ultrascalar processor 2000. From May 5, the pure processor 200 includes an I-unit 202 and an E-unit 220. The 1_ unit 202 includes an instruction cache memory (ICACHE) 204, a combined IFU / BRU (206) including an instruction fetch unit (IFU) and a branch unit (BRU), a A dispatch unit (DPU) 208, a completion unit (CMU) 210, and a branch history table (BHT) and a branch target address cache (BTAC) ) Of a branch address unit 214. In addition, I-unit 202 includes monitoring and clocking 86560 -21-200411565 control logic 216. The operations of these units 204, 206, 208, 210, and 214 are substantially the same as the widely known counterparts, but as described below, they are timed by monitoring and clock control logic 2 1 6 according to the present invention. Normally, the IFU in ‘IFU / BRU 206 fetches instructions from ICACHE 2 04 per cycle. The fetch bandwidth (fetch-bw), which was processed in the prior art, was fixed to the maximum number of fetch instructions per cycle, and can now be adjusted by the monitoring and control logic 2 1 6 in real time. According to the available free space, IFU places the fetched instructions in the fetch queue (FETCH-Q) of IFU / BRU 206. The instruction in IFU refers to the instruction fetch address register (IFAR) to instruct instruction fetch and provide the next fetch address at the beginning of each cycle. IFU sets each next fetch address of each cycle to one of the following ... (a) The next sequential address of the IFAR value of the previous cycle is increased to be sufficient to satisfy the fetch to fetch in the previous cycle —Q the number of instructions; (b) the target of a branch instruction that is determined or expected to be used in the previous cycle; or (c) the correct fetch address of a branch instruction after determining that it was a previous misprediction. The branch and instruction fetch address prediction hardware 206 includes a branch history table (BHT) and branch target address cache memory (BTAc), and guides the instruction fetch process. According to the decision of the corresponding fixed bandwidth parameter (fetch-bw or disp-bw), each activity fetch (or scheduling) cycle usually fetches (or schedules) a fixed number of instructions. However, when the E-unit 220 indicates that a deceleration / suspension system is necessary, in addition to the above-mentioned clock adjustment, the processor of the preferred embodiment (200 in this example) dynamically adjusts each fetch a bw and / Or the value of disp-bw. The deceleration / suspension (or opposite acceleration / continuation) signal of the E-unit is synchronized. It is a combined function of status signals generated and monitored in the E-unit. These states 86560 -22- 200411565 can include: ⑷ Problem lists are indicated by 229, LSQ 232, fpq 2 and VXQ 246 full or empty; 化) a DCACHE 238 hit or miss event, (C) E_Bus internal bus traffic is congested or non-congested (for example, a single bus can be shared (and arbitrated) to carry completion information to the completed order 24), or (d) due to mispredictions or misconceptions Other forms of executive officers overwhelm or reissue conditions. This example's FXU pipeline can execute processor branch instructions. Alternatively, branch instructions can be executed using a separate parallel BRU pipeline. One or two I-CLKs can be used to adjust and / or compress the ^ unit bus frequency. See the E-Unit to adjust the deceleration / suspension signal of the unit pipeline flow rate without disabling the receiving ICACHE location. Capture half of the data to adjust the clock (such as in the specific access that can be effectively bisected. Therefore, in order to save power in the withdraw mode, half of the items are usually in the private state, the rush state ( Within IFU 206). Generally speaking, depending on the severity of the downstream E-unit deceleration / suspension display, the capture bandwidth can be adjusted to any number of tools in normal mode, including all the way to zero. Similarly, to save power Adjust the scheduling bandwidth (bus bandwidth; disp_bw) to execute less pipeline scheduling instructions to the energy-consuming E-unit as needed or indicated. The E-unit 220 includes a fixed-point execution unit (fixed point execution it 'FXU) 222, a loading {various storage units (i〇a (j store unit; lsu) 224 floating point execution unit (FPU) 226 and vector multimedia extension unit (vector multimedia extension unit unit · 'VMXU) 228. FXU22 2 includes a fixed-point queue 229 and a fixed-point unit execution unit pipeline 230. The LSU 224 includes a load storage queue 232 86560 -23-200411565 and a load storage unit pipeline 23 4. The FXU 222 and the LSU 224 are both The general purpose register 236 communicates. The LSU 224 also communicates with the data cache memory 238. The FPU 226 includes a fixed point queue 240 and a fixed point unit pipe 242, and a fixed point register and a rename buffer 244. LSU 224 It also communicates with the fixed point rename buffer 244. The VMXU 228 includes a vector extension queue 246 and a vector multimedia extension unit pipeline 248. Each unit 229, 230, 232, 234, 236, 238 '240, 242, 246, and 248 Both are essentially the same as the well-known counterparts, but they are timed according to the present invention as described below. Like any typical latest technology ultra-scalar processor, in a specific workload execution phase, FXU 222 and FPU 226 It is likely that the activities are often mutually exclusive. When the Fpu 226 is in use, the preferred embodiment processor 200 may deactivate or slow down the local clock of the FXU 222, and vice versa. In addition, the item Best embodiment allows the processor 2〇〇 LSU224 suspension with FPU226 / slow clock speed of each inner unit such, requires fine adjustment mode clock-driven system has been added to the inter-cell of the coarse grain mode is described. Unlike the above-mentioned preferred scalar processor example, the two units 224, 226 in E_unit 22 have no direct data flow path producer-consumer relationship, that is, there is no direct relationship between LSU 224 and FPU 226. News stream. The communication between these two units 224, 226 is implemented indirectly through data cache / memory and floating point register files 244. Generally speaking, an Fpu pipeline 242 has several levels (for example, 6 to 8 levels in modern gigahertz range processors), and the general LSU execution pipeline 234 is 2 to 4 levels. Therefore, and because the current processor has a large number of scratchpad rename buffers, when DCACHE 23 8 hits 86560 -24- 200411565, LSU pipeline 23 4 tends to run substantially earlier than FPU pipeline 242. On the other hand, in the cluster DC ACHE 23 8 miss stage, the effective LSU path delay can be greatly increased. If a series of quick misses suspends DC ACHE 238, the LSU problem queue 232 fills up, so upstream producers can be suspended. The present invention utilizes this feature by using the activity of upstream resources to drive fine-grained timing clock gating or FPU 226 local clock adjustment. FIG. 8 shows a more detailed example of the LSU 224 and the FPU 226 of the E-unit 200. In this embodiment, the LSU event / active state monitoring logic 250 monitors various LSU queued applications and drives an active state for the LSU 226. In this example, the LSU queue includes a load-store issue queue (LSQ) 232, a pending load queue (PLQ) 252, and a pending storage queue. (Pending store queue PSQ) 254, and DC ACHE 236. Monitor DC ACHE 236, and then log cache memory events. It should be understood that these four units 232, 236, 252, and 254 are selected for demonstration purposes only, and fewer or more queues and events can be monitored. For this example, the LSU event / activity monitoring logic 250 determines an output pause bit 256 that is passed to the FPU 226 in this example. For finer control, a set of pause bits can be used. To control the clock, the pause bit 256 can be transmitted to the I-unit 202 and the IFU or DISPATCH unit 206. For example, at the beginning, the LSU active state monitoring logic output pause bit 256 and the FPU active state monitoring pause output 258 are both determined to make the LSU 224 and FPU 226 in the normal full adjustment operation. If the FPU active state monitoring pause bit 258 is judged and the LSU active state pause bit 256 remains undetermined, for example due to high usage within fpq. The LSU bureau 86560 -25-200411565 clock is slowed down to allow the FPU 226 to catch up with the LSU 224, which is faster than the FPU 226 due to the cache memory hit phase. In contrast, when the LSU active state pause bit 256 is determined and the FPU active state pause bit 258 remains undetermined, the FPU local clock is slowed down. If the suspended bits of the LSU and FPU are both judged / de-judged at the same time, according to other status conditions of E-Unit 220 4! • Unit 202, the local clocks of LSU and FPU are slowed down or turned up the same frequency. Advantageously, in response to activity / inactivity in other processors or system units, the present invention can selectively decelerate, accelerate, or shut down a unit or components within a unit, i.e., the invention has a variable clock control granularity. The local time control of each unit is derived from the activities and resources flowing forward and backward in the direction of the data stream. Unlike all or nothing of the prior art clock gating, the adaptive clock of the preferred embodiment can use feedforward and feedback control to provide a more flexible general clock adjustment mechanism with selectable bandwidth. Information
11 ,在閘閉先前技術管線單元時可能遺失的擱 P可在單元佇列中保持在場且具有合適大小。該等 由有關各藉置分允从ei人欠 86560 -26- 200411565 的此種損失引起的性能劣化幾乎被減低至零。 與傳統時鐘閘控方法相 ..^ ^ ^ ^ , ,调杈或調快頻率的時鐘速率 漸進调即方式確保優良的 ^ m , 更+和的)電流擺動(di/dt)特 性。因此,時鐘速率的逐漸 低。因此,-較佳具體實μ 使笔感雜訊降至最 、也例糸統消耗的電力大幅降 同時沒有明顯的性能損失。平均電力降低而沒有社 性)的性能損失(如每週期指令或叫也不需要重大:硬 體。/需要嚴格遵守最大電力消耗及溫度限制日寺,本發明 可成功控制電力消耗,同日卑將 „^ ^將性能損失限制於預定的小時 間視窗’以保持正常運作條件及快速返回正常運作。 個別系統組件的動態活動級別係使用傳播於該整個晶片 或糸統的早—同步時鐘框架内其他組件的時鐘速率來於視 。而且’與具有局部計時的非同步(或自我計時)單元^呈 :全局非同步控制之多同步時鐘域的多時鐘同步系統:同 步糸統或處理器不,,本發明不要求和動搖單獨 件之間的協定保持同步。此外’本發明動態調節各種組件 的時知速率,麵通常與傳統粗粒時鐘閘控方法相關連 電感雜訊減至最小。 雖然本發明已就若干(示範)較佳具體實施例加以說明,但 是熟知本技術人士娜’本發明可在隨附申請專利範圍 的精神及範疇内進行修改。 【圖式簡單說明】 從上述本發明說明十生具體實施例的詳細說明i參考圖式 ,可更加瞭解上述及其它目的、觀點及優點,其中:二 86560 -27- 200411565 圖1顯示一典型最新技術管線純量處理器的範例及一 應指令時序圖; 旦圖2顯示根據本發明之以要求或活動驅動電力控制之純 里處理器的較佳具體實施例的高位準範例; 圖3顯示活動監視與時鐘控制邏輯懸置電路的一# 例,盆cb 示摩li /、中一間控控制移位暫存器(GCSR)使系統時鐘 ^管道的個別級; 圖4顯示一減速電路的第二範例,該電路可用於替代或包 括在根據按要求i-CLK調節之本發明的一項較佳具體實施 例之圖3的懸置電路中; 、 •圖5顯示圖3與4的具體實施例的變化,其中個別㊀^^以級 輪出係傳遞至對應的個別AND閘極; 圖6 A至3B更詳細地顯示以對應於圖3斷面圖的純量^管道 的另一範例; 圖7顯示本發明應用於管線超純量處理器之一範例; 圖8顯示圖7之E-單元之LSU與FPU的一更詳細範例。 【圖式代表符號說明】 100 管線純量處理器 102 指令單元(I-UNIT) 104 執行單元(E-UNIT) 106 暫存器檔案(REGFILE) 108 指令快取記憶體(ICACHE) 110 資料快取記憶體(dcache) 112 局部時鐘緩衝器(LCB) -28- 200411565 114 115 116E 1161 118E 1181 122 124 126 128 138 150 152 154 156 158 160 170 172 174 176 178 180 182 局部時鐘緩衝器(LCB) 同步時鐘 輸入^[宁列 輸入彳宁列 管道 管道 I-單元 E-單元 活動監視及時鐘控制邏輯 活動監視及時鐘控制邏輯 E-單元時鐘 閘控控制移位暫存器(GCSR) 個別級 正反器 AND閘極 I-時鐘 反相器 減速電路 遲緩選擇 正反器/1位元觸發計數器 AND閘極 減速選擇GCSR 最終級輸出 AND閘極 86560 -29- 192200411565 194 196 198 200 202 204 206 208 210 214 216 220 222 224 226 228 229 230 232 234 236 238 240 暫存器級 第一級鎖存器 第二級鎖存器 輸入區塊 管線超純量處理器 I-單元 指令快取記憶體(ICACHE) 組合 IFU/BRU 調度單元(DPU) 完成單元(CMU) 分支位址單元 監視及時鐘控制邏輯 E-單元 固定點執行單元(FXU) 載入儲存單元(LSU) 浮點執行單元(FPU) 向量多媒體延伸單元(VMXU) 固定點佇列 固定點單元執行單元管道 載入儲存佇列 載入儲存單元管道 通用暫存器 資料快取記憶體 固定點佇列 86560 -30- 200411565 242 固定點單元管道 244 固定點暫存器及重新命名緩衝器 246 向量延伸彳宁列 248 向量多媒體延伸單元管道 250 LSU事件/活動狀態監視邏輯 252 搁置載入佇列(PLQ) 254 擱置儲存佇列(PSQ) 256 LSU活動狀態監視邏輯輸出暫停位元 258 輸出 260 FPU活動狀態監視暫停 A 陣列 B 陣列 C 陣列 disp_bw 固定頻寬參數 fetch_bw 固定頻寬參數 FPU 浮點執行單元 FXU 固定點執行單元 LSU 載入儲存單元 8656011. The P, which may be lost when the prior art pipeline unit is closed, can be kept in the unit queue and of a suitable size. The performance degradation caused by such losses owed by the relevant borrowers to 86560 -26- 200411565 was almost reduced to zero. In contrast to the traditional clock gating method, .. ^ ^ ^ ^, the clock rate of the tuning or faster frequency is gradually adjusted, which means to ensure excellent ^ m, more + and) current swing (di / dt) characteristics. Therefore, the clock rate is gradually lower. Therefore, -preferably and concretely, the noise of the pen is minimized, and the power consumption of the system is greatly reduced, and there is no obvious performance loss. The performance loss (average power reduction without sociality) (such as command or call per cycle does not need to be significant: hardware. / It is necessary to strictly adhere to the maximum power consumption and temperature limit. Risi, the present invention can successfully control power consumption, „^ ^ Limit performance loss to a predetermined small time window 'to maintain normal operating conditions and return to normal operation quickly. The dynamic activity levels of individual system components use the early-synchronous clock framework that propagates throughout the chip or system. The clock rate of the component depends on the situation. And it's related to a non-synchronous (or self-timed) unit with local timing: a multi-clock synchronization system with multiple asynchronous clock domains in global asynchronous control: a synchronous system or a processor, The present invention does not require synchronization with the agreement between shaking individual pieces. In addition, the present invention dynamically adjusts the time-to-know rate of various components, and the inductance noise associated with traditional coarse-grain clock gating methods is usually minimized. Although the present invention A number of (exemplary) preferred embodiments have been described, but those skilled in the art will appreciate that the present invention may be used in Modified within the spirit and scope of the scope of patent application. [Brief description of the drawings] From the detailed description of the ten-year specific embodiment of the invention described above, i can refer to the drawings to better understand the above and other objectives, viewpoints and advantages, of which: 2865560 -27- 200411565 Figure 1 shows an example of a typical latest technology pipeline scalar processor and a command timing diagram. Figure 2 shows the preferred pure processor based on the present invention to drive power control with requirements or activities. High level example of a specific embodiment; FIG. 3 shows an example of a suspension circuit for activity monitoring and clock control logic. The cb is shown in FIG. 1 and the intermediate control shift register (GCSR) makes the system clock ^ pipe. Figure 4 shows a second example of a speed reduction circuit that can be used instead of or included in the suspension circuit of Figure 3, which is a preferred embodiment of the invention adjusted according to the requirements i-CLK; Fig. 5 shows a variation of the specific embodiment of Figs. 3 and 4, in which the individual ^^^ is transmitted to the corresponding individual AND gates in a step-out system; Figs. 6A to 3B are shown in more detail to correspond to the faults in Fig. 3 Surface Another example of a scalar ^ pipeline; Fig. 7 shows an example of the present invention applied to a pipeline ultra-scalar processor; Fig. 8 shows a more detailed example of the LSU and FPU of the E-unit of Fig. 7. [Schematic representation Symbol description] 100 pipeline scalar processor 102 instruction unit (I-UNIT) 104 execution unit (E-UNIT) 106 register file (REGFILE) 108 instruction cache memory (ICACHE) 110 data cache memory (dcache ) 112 Local clock buffer (LCB) -28- 200411565 114 115 116E 1161 118E 1181 122 124 126 128 138 150 152 154 156 158 160 170 170 172 174 176 178 180 182 Local clock buffer (LCB) Synchronous clock input ^ [ning Column input 彳 ning column pipeline pipe I-unit E-unit activity monitoring and clock control logic E-unit clock gating control shift register (GCSR) Individual stage flip-flop AND gate I- Clock inverter deceleration circuit slow selection of flip-flop / 1 bit trigger counter AND gate deceleration selection GCSR final stage output AND gate 86560 -29- 192200411565 194 196 198 200 202 204 204 206 208 210 214 216 220 222 224 2 26 228 229 230 232 234 236 238 240 Register level First level latch Second level latch input Block pipeline Ultra-scalar processor I-unit instruction cache (ICACHE) Combined IFU / BRU scheduling Unit (DPU) Complete Unit (CMU) Branch Address Unit Monitoring and Clock Control Logic E-Unit Fixed Point Execution Unit (FXU) Load Storage Unit (LSU) Floating Point Execution Unit (FPU) Vector Multimedia Extension Unit (VMXU) Fixed Point queue fixed point unit execution unit pipeline load storage queue load storage unit pipeline general register data cache memory fixed point queue 86560 -30- 200411565 242 fixed point unit pipeline 244 fixed point register and re- Named buffer 246 Vector extension queue 248 Vector multimedia extension unit pipeline 250 LSU event / activity monitoring logic 252 Shelve loading queue (PLQ) 254 Shelve storage queue (PSQ) 256 LSU activity monitoring logic output pause bit 258 Output 260 FPU activity monitoring pause A array B array C array disp_bw fixed bandwidth parameter fetch_bw fixed bandwidth parameter FPU floating point Row unit FXU a fixed point execution unit load store unit LSU 86560
Claims (1)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/187,698 US7076681B2 (en) | 2002-07-02 | 2002-07-02 | Processor with demand-driven clock throttling power reduction |
Publications (2)
Publication Number | Publication Date |
---|---|
TW200411565A true TW200411565A (en) | 2004-07-01 |
TWI232408B TWI232408B (en) | 2005-05-11 |
Family
ID=31975946
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW092118079A TWI232408B (en) | 2002-07-02 | 2003-07-02 | Processor with demand-driven clock throttling power reduction |
Country Status (2)
Country | Link |
---|---|
IL (1) | IL165542A0 (en) |
TW (1) | TWI232408B (en) |
-
2003
- 2003-07-02 TW TW092118079A patent/TWI232408B/en not_active IP Right Cessation
- 2003-08-26 IL IL16554203A patent/IL165542A0/en unknown
Also Published As
Publication number | Publication date |
---|---|
TWI232408B (en) | 2005-05-11 |
IL165542A0 (en) | 2006-01-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7076681B2 (en) | Processor with demand-driven clock throttling power reduction | |
US7134028B2 (en) | Processor with low overhead predictive supply voltage gating for leakage power reduction | |
US5420808A (en) | Circuitry and method for reducing power consumption within an electronic circuit | |
US7089443B2 (en) | Multiple clock domain microprocessor | |
US20080229068A1 (en) | Adaptive fetch gating in multithreaded processors, fetch control and method of controlling fetches | |
US20040003309A1 (en) | Techniques for utilization of asymmetric secondary processing resources | |
CN100498691C (en) | Instruction processing circuit and method for processing program instruction | |
US6317840B1 (en) | Control of multiple equivalent functional units for power reduction | |
CN105892629A (en) | Multicore-based screen awakening method and device as well as mobile terminal | |
Ratković et al. | An overview of architecture-level power-and energy-efficient design techniques | |
Abdel-Majeed et al. | Origami: Folding warps for energy efficient gpus | |
EP1658560B1 (en) | Processor with demand-driven clock throttling for power reduction | |
Buyuktosunoglu et al. | An oldest-first selection logic implementation for non-compacting issue queues [microprocessor power reduction] | |
TW200411565A (en) | Processor with demand-driven clock throttling power reduction | |
US7243217B1 (en) | Floating point unit with variable speed execution pipeline and method of operation | |
CN103650346B (en) | pipeline power gating technology | |
Das et al. | A study on performance benefits of core morphing in an asymmetric multicore processor | |
JP4194953B2 (en) | Multiple instruction issue processor | |
KR20060107508A (en) | Processor with demand-driven clock throttling for power reduction | |
Wang et al. | BTB access filtering: A low energy and high performance design | |
CN100353288C (en) | Performance control method and device for data processing | |
CN106020424A (en) | Active power efficiency processor system structure | |
CN1645293A (en) | Frequency-voltage mechanism for power management and its control method | |
Liu et al. | Power-efficient asynchronous design | |
Grunwald | Micro-architecture design and control speculation for energy reduction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MM4A | Annulment or lapse of patent due to non-payment of fees |