TWI232408B

TWI232408B - Processor with demand-driven clock throttling power reduction

Info

Publication number: TWI232408B
Application number: TW092118079A
Authority: TW
Inventors: Pradip Bose; Daniel M Citron; Peter W Cook; Philip G Emma; Hans M Jacobson
Original assignee: Ibm
Priority date: 2002-07-02
Filing date: 2003-07-02
Publication date: 2005-05-11
Also published as: IL165542A0; TW200411565A

Abstract

A synchronous integrated circuit such as a scalar processor or superscalar processor. Circuit components or units are clocked by and synchronized to a common system clock. At least two of the clocked units include multiple register stages, e.g., pipeline stages. A local clock generator in each clocked unit combines the common system clock and stall status from one or more other units to adjust register clock frequency up or down.

Description

1232408 玫、發明說明：【發明所屬之技術領域】本發明一般係關於由複 ^ _ 數们破計蚪組件或單元組成之微處理益或糸統的電力消耗的減低與控制。【先前技術】半導體技術與晶片生產之曰進步V致晶載時鐘頻率、單一日日片上之電晶體數目及晶教曰y 身的尺寸穩定增長，並導致日日片ί、笔電壓的對應增長。一 4 Φ ^ ^ ^ 叙而吕，特定被計時單元的電力湞耗隨其切換頻率而線曰長。因此，雖然晶片供電電壓下降，晶片的電力消耗 m a 仍W上升。對於晶片和系統級別，因晶片電力的增長，A欠 7 σρ與包裝成本自然也會增加。對於低端系統（例如，手持、 ‘式及仃動糸·統），電池壽命至關重要，在不降低性能至不曰妾又水準的刮提下減低淨能置很重要。因此，微處理哭雷 σο電力泊耗的增加即成為今後改善性能的主要障礙。純量處理器-次祿取與發出/執行一個指令。各該指令均在純量資料運算元上運算。各該等運算元為單一或原料值或數。一純量處理器内的管4 、吕、·泉作業引起所謂的並行性問題，即，在一特定時鐘週期處夕個才曰令，同時保持單一問題範例。平週期擷取、發出與執行多個發出與執行路徑通常係排列。超純量處理器的範例包括杰、英代爾公司的Pentium 超純量處理器可在特定機器指令。此外’各指令之榻取、成管線以致動進一步的並行性 IBM公司的 P〇wer/P〇werPC處理 86560 12324081232408 Description of the invention: [Technical field to which the invention belongs] The present invention generally relates to the reduction and control of micro processing benefits or system power consumption composed of complex components or units. [Previous technology] The progress of semiconductor technology and wafer production caused the clock frequency of crystal carrier, the number of transistors on a single chip and the size of the crystal chip to increase steadily, and led to a corresponding increase in the voltage of the chip and the pen. . A 4 Φ ^ ^ ^ As described above, the power consumption of a specific timed unit increases with its switching frequency. Therefore, although the power supply voltage of the wafer decreases, the power consumption m a of the wafer increases. For wafer and system level, due to the increase of wafer power, A owing 7 σρ and packaging cost will naturally increase. For low-end systems (for example, handheld, mobile, and mobile systems), battery life is critical, and reducing net energy settings without sacrificing performance to standard levels is important. Therefore, the increase in the power consumption of microprocessing cry Thunder σο becomes the main obstacle to improve performance in the future. Scalar Processor-Cilo fetches and issues / executes an instruction. Each of these instructions operates on a scalar data operand. Each of these operands is a single or raw value or number. The operations of the tubes 4 and 4 in a scalar processor cause the so-called parallelism problem, that is, ordering at a specific clock cycle while maintaining a single problem paradigm. Flat cycle fetch, issue, and execute Multiple issue and execute paths are usually aligned. Examples of ultrascalar processors include Jay and Indell's Pentium ultrascalar processors that can be executed on specific machine instructions. In addition, the instructions are fetched and pipelined to enable further parallelism. IBM ’s Power / PowerPC process 86560 1232408

Pro (P6)處理器系列、昇陽公司的處理器與惠普 A 司（Hewlett Packard Company ; HP)的 PA_RISC處理器、以及月）康柏A司（現已與Hp合併）的Α1_處理器系列。向LH般排列成管線’可在單—結構步驟或指令的整個數目陣列上實施-操作。例如，m可將陣歹Μ的各項目加至陣列B的對應項目，並將結果儲存於陣列 C的對應項目。向量指令通常係被支援為基本純量指令組的申/、有可在較大應用程式中向量化的編碼區段才可在向ϊ引擎上執行。向量引擎可為單一、管線執行單元，或可、、且、，哉成一陣列或單_指令多重資料（如咖 multiple data ; SIMD)機，同資料同時執行同一指令腦均係向量處理器。具有多個同樣的執行單元，對不。例如，一般而言，Cray超級電同步被計時處理器或系統具有一單一、整體主時鐘，驅動組成該系統之所有單元或組件。㈣，料鐘的比例衍生可能較主時鐘頻率循環特定子單元更快或更慢。通常藉由設計靜態職與預設該等時鐘決定。例如，英代爾的 Pentlum 4處理器的整數管道時鐘兩倍快於晶片主時鐘，顯然應用了熟悉技術人士熟知的加倍幫浦或波形管線技術。該等時鐘加倍技術提高了處理器的執行速度與性能。但是 ’匯流排與晶片外記憶體的速度未與處理H計算邏輯核心的速度保持同步。@此，大多數最新技術處判且有晶片外匯流排及快取記憶體’其運作頻率為主處理器時鐘頻率的整數約數倍。通常’料時鐘運作頻率係、在設計系統時 86560 1232408 固定。此即當前代處理器複合體可能具有多個時鐘速率的 '、有加倍幫浦及波形管線技術也用於更高端的機 σσ以減輪處理器與外部匯流排或記憶體之間的性能失配。Pro (P6) processor series, Sun Microsystems processors and Hewlett Packard Company (HP) PA_RISC processors, and Month) Compaq A (now merged with HP) A1_ processor series . Arranged in an LH-like pipeline 'can be performed on an entire number of arrays of single-structural steps or instructions. For example, m can add each item of array 歹 M to the corresponding item of array B, and store the result in the corresponding item of array C. Vector instructions are generally supported as basic scalar instruction sets. Only code segments that can be vectorized in larger applications can be executed on the puppet engine. The vector engine can be a single, pipeline execution unit, or can be, and, and can be formed into an array or single instruction multiple data (such as multiple data; SIMD) machines, and the same instruction can be executed simultaneously with the same data. The brain is a vector processor. Have multiple identical execution units, right? For example, in general, a Cray super-synchronized timed processor or system has a single, integrated master clock that drives all the units or components that make up the system. Alas, the proportional generation of the clock may be faster or slower than the main clock frequency cycling certain subunits. This is usually determined by designing static clocks and presetting such clocks. For example, Intel's Pentlum 4 processor's integer pipeline clock is twice as fast as the chip's master clock, obviously using doubling pump or waveform pipeline technology familiar to those skilled in the art. These clock doubling technologies increase processor execution speed and performance. However, the speeds of the 'bus and off-chip memory are not synchronized with the speed of the processing logic core. @ 这， Most of the latest technology judges and has chips. Forex Streaming and Cache Memory ’its operating frequency is an integer multiple of the main processor clock frequency. Normally, the clock frequency of the material clock is fixed. When designing the system, 86560 1232408 is fixed. This means that the current generation of processor complexes may have multiple clock rates, and double-pump and waveform pipeline technologies are also used in higher-end machines σσ to reduce the performance loss between the processor and external buses or memory. Match.

Rabaey’ Jan Μ•與 Pedram，Massoud 編輯出版（Kluwer 學術出版社，1996年版）的「低電力設計方法」說明了使用同步才4里閘}工的電力減低方法，纟中，在—再生點可停止時鐘 17在局部時鐘緩衝器（local clock buffer ; LCB)内饋迗一特义晶片區域、組件或鎖存器。在粗略控制級別，時鐘㈣功能邊界閘控。在精確控制級別，時鐘係於單獨鎖存杰内閘控。例如，Gerosa等人在1994年12月12日出版的删期刊「固態電路」上發表的論文「2.2w，8QMHz超純量RISC微處理器」（第144〇至1454頁)中說明了根據各週期調：及執行的指令的不同執行單元的閘控時鐘。當處理器執行某功能類別的指令（如純整數或純浮點指令）之序列時，粗粒單元級別時鐘閑控較有利。當輸入工作量$處理器僅看到整數碼時，即停用該浮點單元的時鐘再生器。同樣地，在純浮點谨管運异期間，可停用該整數單元的時鐘。此舉可節省相當數詈，、曰曰片電力。通常使用軟體藉由序列指令或使用硬體伯泪丨卩卩谓，則閒置週期局部實現粗閒置控制。精閒置控制通常係藉由在沪 , 田在^令解碼期間避免不必要的傳播無效或無關緊要的資料而局部實現。閘控資訊之一因果 =其起源之初始點流向下游級或單元稱為前饋流。此種 k路fe可能包括迴路，具有 ,、、、員的倒流，但是，該因果資訊流仍然視為前饋程序。因、 ’粗間置控制及精閒置控制 86560 1232408 均係自我觸發前饋。另一方面，使用下游管線暫停㈣f理前饋流構成一回杈控制糸統。於此’控制資訊流係從下游的「果」流向上游的「因」。粗粒與細粒暫停控㈣主要用於防止覆蓋管線處理器中的有效暫停資料，但是，該等機制也可用於節省電力消耗。例如]aeGbs—人在助年4月的麵 ASYNC-2002會議上的論文「同步、、」乂立鎖官線」中提出一用以減少同步管線電力的細粒暫停傳播機制，該機制使用「有效」位元補充更傳統之時鐘閘控的細粒前饋機制，如上述 Ge觸等人所述，亦參見G〇wan等人的「αι_2ι264微處理器設計中之電力考量」（1998年ACM/咖設計自動化會議期刊第726至731頁〇998年6月））。但是，如等人之已發表的細粒暫停間；^「q拆、拖& 一 & (口 & )機制不似本發明用於控制資訊流的速率（透料鐘或匯流排頻寬調節）。 /且閒置控制至少產生必須解決的兩個問題。第-，大瞬Rabaey 'Jan Μ • and Pedram, Massoud's Editing and Publishing (Kluwer Academic Press, 1996 edition) "Low Power Design Method" illustrates the method of reducing power using synchronous brakes. The stop clock 17 feeds a special chip area, component or latch in a local clock buffer (LCB). At the coarse control level, the clock boundary function is gated. At the precise control level, the clock is gated on a separate lock. For example, Gerosa et al., In a paper "2.2w, 8QMHz ultra-scalar RISC microprocessor" (pages 1440 to 1454) published in the deleted journal "Solid State Circuit" published on December 12, 1994 (pages 1440 to 1454), explained Periodic adjustment: The gated clock of different execution units of the executed instruction. Coarse-grained unit-level clock idle control is advantageous when the processor executes a sequence of instructions of a certain functional category, such as pure integer or pure floating-point instructions. When the input workload $ processor sees only whole digits, the clock regenerator for that floating-point unit is disabled. Similarly, the clock for this integer unit can be deactivated during a pure floating-point operation. This can save a considerable amount of electricity. Usually using software by sequence instructions or using hardware hardware, the idle period partially implements coarse idle control. Fine idle control is usually implemented locally by avoiding unnecessary transmission of invalid or irrelevant data during decoding in Shanghai. One cause and effect of gated information = the initial point of its origin to flow to the downstream stage or unit is called feed-forward flow. This k-way fe may include a loop, with a backward flow of ,,,, and members, but the causal information flow is still considered a feedforward procedure. Because, ′ coarse interposition control and fine idle control 86560 1232408 are self-triggered feedforward. On the other hand, the use of downstream pipelines to pause the feedforward flow constitutes a loop control system. Here, the 'control information flow' is the "cause" from the downstream "effect" to the upstream. Coarse-grained and fine-grained pause control is mainly used to prevent the effective suspension of data in pipeline processors, but these mechanisms can also be used to save power consumption. For example] aeGbs—People proposed a fine-grained pause propagation mechanism to reduce the power of synchronous pipelines in the paper "Synchronization," and "Holding the Official Line" at the ASYNC-2002 conference in April. “Effective” bits complement the more traditional fine-grained feedforward mechanism of clock gating, as described by Ge Touch et al. Above, and also see “Power Considerations in αι_2ι264 Microprocessor Design” by Gowan et al. (1998 ACM / Coffee Design Automation Conference Journal pp. 726-731 June 998)). However, such as the fine-grained pauses that have been published by others; ^ "q split, drag & one & (mouth &) mechanism is not like the present invention used to control the rate of information flow (through clock or bus frequency (Wide adjustment). / And idle control generates at least two problems that must be solved.

變電流下降及增益可墓私a I ¥日日载七、電電壓不可接受的電感 (Ldi/dt)雜訊位準。第二，閘控開關程序需要額外週期保持正確的功能運作。針對工作量的細粒相位變化，過於頻繁地在閘控與致動模式之間切換可導致不可接受的性能下降。而且《最新精閒置㈣依賴局部產生的管線級別時鐘問控的閘控信號或條件，你丨‘拍德次> 例如根據貝料無效或無關運算元條件。該等最新方法不根據預計或預料產生閘控信號。因此 ’時序要求通常很重要’因為閘控信號必須在判定前提供 ’並且用於判定無錯時鐘閘控運作的合適期間。G〇wan，M. 86560 10 1232408 K·，Biro, L. L·與 Jacks〇n，D B 在 1998 acm/ieee 設計自動化會4刊物上的論文「Alpha 21264微處理器設計之電力考量」（弟726至731頁）（1998年6月）中說明了該等限制如何極大地使設計時序分析複雜化，甚至導致時鐘頻率性能劣化。不論基本控制機制係前饋（因果流）還是基於回授（果因流），取新％麵閘控技術（不論粗或精）均僅為空間控制。這是因為，應用資訊係用於消除受影響區域的冗餘時鐘，而不考慮5亥等區域或機器之其他區域的時間活動或歷史。下游(消費者）單元與級（如執行管道或問題符列）中的活動狀態與事件並未:授以調節上游（生產者）在非鄰近區域（如指令掏取或周又單元）的時知或貧訊流速率。同樣地，上游生產者區 j的活動狀恶與事件也未前饋以調節下游消費者的時鐘或貝：•速率。而且’若時鐘信號已致動或未致動，一般間閉時鐘信號即可或沒有影響。需要改進可按細粒空間及時間粒度運作的所連接吕線早元的時鐘電路的大Μ/% 能（額外）劣化及對基層 0大电流/電壓擺動。【發明内容】本毛明之目的係減低處理器的電力、、άΚ n + 性能損失。幻玉力4耗同時沒有明顯的本^明係一種同步積體電路，如吨β ~ # + 理器。電路組件或單元係由里處理裔或超純量處該共同、°錢時鐘料並同步於級，例如其:個被記時的單元包括多個暫存器 s線級。各被記時單元中的-本地時鐘產生器結 86560 1232408 合該共同系統時鐘與一或多個其他單元的暫停狀態，以向上或向下調節暫存器時鐘頻率。【實施方式】現參考附圖，更特定言之，圖1 一純量處理器i。。之範例的高位準方塊圖及二序圖。該主要功能資料路徑係分成兩個主要通常稱為指令單s (instruction unit;卜刪叩_二行單元 (^eCUtiGn _ ;㈣組）1 G4。本發明未說明I-單幻〇2與E-早几104中的許多細分子單元與功能邏輯（如分支預計邏輯），因此，為使處理器100之整個時鐘控制的說明簡明，對之予以省略。該範例之處王里器100係管線純量設計之代表，此種設計不包括防止單元、子單元或其包括之輔助儲存資源的冗餘計時的時鐘閘控。由單元1。2或1()4之-或二者存取的晶載儲存包括暫存器標案（register fUe ; REGFILE)1〇6、指令快取記憶體 (instruction cache ’· ICACHE)108及資料快取記憶體 cache ; DCACHE)ll〇。REGFILE 1〇6一般為共用資源可從單元102、104存取，因此被視為單獨實體。icache ι〇8 係I-單元管道的第一終止級，因此被視為單元1〇2的部分。DC ACHE 110通常可單獨從E_單元1〇4存取，因此被視為 E-單元104的部分。兩個單獨的局部時鐘緩衝器（LcB)i 12、 114各放大並分配一共同同步時鐘115至單元1〇2、1〇4中的對應單元。各單元1 02、1 04包括一輸入仔列1 1 61、1 1 6它與一管道1181、118E。視需要在！-單元1〇2與^單元1〇4中， 86560 1232408 可具有用於該共同系統時鐘* 之、，、田粒7刀配、放大與控制之 LCB 112、114的進一步層級。對於該範例而言，電腦程式飞的拓令係包括於I-單元管道的第一級、ICACHE 108中。—私土士 Μ々吕，在可能引起ICache 未中的各種條件下，Ic ache i μ γ t108可暫停可變數目的處理器週J。该暫停隱含適應未中f+ 對先刚指令傳送級(即該指令記隐體層級的較低級別）的影響。 —該同步時鐘連續„各單元通過咖U2、114，而不論 :::的暫停。切換電容調變及圖案位元變化引起電力消耗受化，但是變化範圍很小。 ,,^ u此仗轾式執行的開始到 :束，各時鐘週期大致消耗了同樣的能量(此處以標準能量早位表示）。因此，藉由使用電力gp讀術，尤其是根據本毛明之電力節省方法可節省相當的能量。在編碼產生期間’藉由插入包括於指令組結構中的特別心令的、編譯器，可同步化粗閒置控制，或者，可由作業系動〜各出^亥等扣令，例如當實施一特別中斷或在某背景 =時。在最粗略控制級別’可能發出一特別睡眠型指令 ,、亥特別睡眠命令可產生一停用信號，將晶片選取 :分的時鐘停止一段時間。該同一睡眠命令也可用於停用指令操取程序。同樣地，當該停用信號變負後，或在睡眠週期過後，-隱含喚醒開始，或可用—明示非同步中斷實 :喚醒。如技術人士所熟知，可藉由在LCB層級之各級別選擇性地停用的時鐘分配樹來提供各種省電模式（如小睡、淺睡或睡眠)。在下一級別更細的粒度，每當該編譯器能 86560 -13 - 1232408 夠靜態預計計算相位 ^ 始閘閉特定單元的時…編潭為即可插入特別指令，開、'’里’如浮點單元的時鐘。括自我偵測機制，告一單开穴許其停用自己的時鐘叶日士=早7^現自己閒置時，允測處理器的局部閒置在硬體中’可設計邏輯债此戋全1 n f 、功。然後，偵測結果可觸發停用某一 4王冲閒置區域的接收到的斩杯祆㈣里。喚醒也是根據停用或睡眠單元接收到的新任務以類似方式自我啟動的。對於精閒置控制， At 鐘。例如，當作—超二 =信號逐個週期地閘控局部時 t ^ ^ 〇里機态，该處理器在指令解碼期間 /、疋了在Ik後執行週期時 — 丁3里间才工的功旎早兀管道。該方式確並提Λ 機制的處理器内運作良好，因此’可明入I ’間（即在解媽或調度時）做出閘控決定。若指項方式保存於—中央問奸列，則即使對 ^ 幻也可在發出.時間產生該等閘控信就。 ♦在任何管線資料路徑，可動態_並選擇性地防止冗餘汁寺例如’沿該邏輯f線傳播—資料有效旗標，僅當產生於：週期的資料有效時才設定該資料有效旗標。然後，各邏輯級的資料有效旗標可用作一時鐘致動信號，用於設 =°亥級輸出鎖存器。因此’在該等隨後管線級無須為無效貧料計時’該等管線級可稱為基於細粒、有效位元的管線級級別時鐘閘控。 2001年6月12曰授予如〇6等人的美國專利6,247.，1342 5虎筇省电力運作管線電路中的管道級閘控方法與系統」 86560 1232408 說Γ了—種處理器，其邏輯將在由該邏輯之第—級實施的 :、週期計算中不會在管線中變化的任何新收到的運算元辨識為無關的。將此種不變條件信號制為無關可用於停用該第-級的時鐘，然後連續停用隨後各級的時鐘。τ ^hnishl，M.，Yamada，A” Noda，Η.與 Kambe，Τ.在 1997年低電力電子產品及設計國際研討會⑽· Symp· 〇n jowei· Electronics and Design ; ISLpED)刊物上的論文「以及別。又。十之几餘計時偵測與電力減低之方法」（第1 3 1至 136頁）中說明了其他更詳細的防止各種冗餘鎖存器計時的 "ί貞測機制。如上所述，不論粗或細控制，該等最新技術閒置控制機 :均^自我觸發、空間或前馈的，即閘控條件或信號係根 ^對早凡閒置或無效狀態的偵測而局部產生，並在有該等單7L可為一整個區域或功能單元，或可為一管線 ’’’、放或閒置狀態位元時避免不必要的逐個週期時鐘和資料傳播。一一級鎖存器組相士反地’在H具體實施例中’—具有以要求馬區動的日守鐘調節的純量管線處理器與一與一執行單元 (eXeCUtl〇ri unit ; E-unit)可調節運作的一指令單元 Gnstrucuon unit ;卜仙⑴在該兩個單元之間建立了生產者與 /肖費者的關係。生產者ί單元轉送現有資料致動指令至該執仃單凡，以在其中以不快於該執行單元可接受的速率處理上各早兀保持具有至少1位元資訊的一活動狀態暫存器。在泫項具體實施例中，該;μΕ單元對係由一共同同步時鐘計時 86560 -15- 1232408 。但是’各單元的時鐘係根據在單元之間傳遞的 … 活動資訊進行局部修改及控制。各單元的局。= =局部單元及遠端單元(即兩個單元)之活動狀態= =根據本發明之以要求或活動驅動電力控制純量 -^20之較佳具體實施例的高位準範例，其元件盘圖μ 相同標識的元件相同。在該項具體實施射，處理單元122 124各包括活動監視及時鐘控制邏輯丨26、us，監視單元活動級別。在一簡化具體實施例中，一單—活動狀態：： no係-暫停位元，指示暫停/非暫停條件。扣單元⑶感應到-暫停條件時（當前的或迫近的）時，其即判定該暫停位 ⑼0。該暫停位元130係用於調低卜單元時鐘CLP 132的時鐘速度，以調低！-單元122，並因此減低（或切斷）至匕單兀124的指令速率。根據控制粒度，E-單元活動狀態或暫停位元130可在E_單元124内調節其自己的時鐘，例如η心當 E-單元i24暫停結束後，clu⑴被調升至其正常時鐘速率同樣地，當I-單元1 22偵測到暫停條件時（如一 icache 108未中）’數個週期後μ管道118i為空，因而沒有給&單元 24的貝汛’則一官這空位元丨3 6係用於調低u x節省電力。活動監視及時鐘控制邏輯1的該上/下調節能力可使用多種不同方法實施，下文將提供數個範例。例如各單元犄鐘可一次退出一管線級。同樣地，當閘拴條件結束時，該單元時鐘又一次返回一管線級。該控制 T輯允卉各單元日寸知及時退出或返回，而不失去有效資訊 86560 -16- 1232408 盥也不增加介面邏輯、緩衝，也沒有浪費能量的管線佔用與再循％。或者，可放慢或逐步降低單元時鐘頻率以節省而且，若/根據需要，時鐘可隨後放慢至停止。圖3顯示活動監視與時鐘控制邏輯懸置電路的一第一範 "、中閘控控制移位暫存器（gating control shift register, GCSR) 150使系、統時鐘U 5傳遞至l管道1181的個別級152。GCSR 150係被計日夺i位元正反器154的—元線性移位暫存器，該正反器對應於各管道級152。系統時鐘 115係AND於各級，在AND閘極156之一中具有之各位元，選擇性地將系統時鐘115當作一 and閘極輸出傳遞至一對應I-管道級152,共同構成1_(：1^〖158。在該範例中，通常沒有偵測E-單元暫停，暫停位元13〇保持為低，並由反相器160反轉。藉由GCSR 15〇的所有丨，卜管道丨丨幻處於王凋節因此，AND閘極1 5 6將未改變的系統時鐘J1 5當作 I-CLK 158傳遞至I-管道級152的各級，各等同於系統時鐘 Π5 /、要E_單元暫V位元130保持為〇，一即移位至gcsr 150，藉此將I-管道1181保持於全調節。在此範例中，當E-單元活動監視與時鐘控制邏輯（分別為圖2的124與128)感應到一迫近的暫停條件時，其即判定單元暫停位元130。該暫停位元13〇由反相器16〇反轉，向 GCSR 150提示一0，該〇隨後在下一系統時鐘週期被移位至 GCSR 150。在同一週期，該第一^管道級（即自圖2的 ICACHE 1 16)被停用，暫停對卜管道丨181的資料輸入，同時將0移位至並通過GCSR 150。只要暫停位元13〇被判定，〇 86560 -17- 1232408 移位至GCSR 15G，每系統時鐘週期_次。各零隨後停止向本項範例中自左向右漣波的AND閘極丨56之後續傳遞系統時鐘。因此，丨-管道時鐘158的一級在某時從左至右閘閉。於是，在判定暫停位元13Q之前，有效^道項目繼續通過 I-管道11㈣進人E-單元。為避免在^管道⑴！遺失有效資 Λ至）虽可用Ε_單元緩衝器（佇列11 6Ε)空間等於1_管道 1181中的有效項目數時，必須判定Ε_單元暫停位元13〇。= 此至父备可用Ε-單元緩衝器（件列116Ε)空間等於I·管道 1181中的有效項目數時’必須判定Ε_單元暫停位元⑽。在極端情況下，後者直接等於^管道1181的長度。因此，在傳統設料’每當Ε-單元仵列116Ε填充至使可用（空旬件列項目數等於I-管道級之數目時，Ε_單元124即判定暫停位元 1 3 〇。Ε -單元仔列11 6 Ε中的空白或無效項目係假定用預輸入有效位元（為節省電力）進行時鐘閘控，與傳統細粒、級-級別時鐘閘控一樣。同樣地，當暫停位元13〇返回至〇時，即當偵測到^單元 ^動已經回到預定級別以下時，即開始反向上調運作。暫停位元130的低值將恢復移位uGCSR 15〇，同時移位有效 '貝料至1-官道1181。因此，卜管道1181的超連續後續系統時鐘週期LCLK 158逐級致動，使管道1181恢復正常運作將貝料傳遞至處於全調節的匕單元124。逐級閘控 158關/開可防止大的電流擺動，藉此使供電電壓的Ldi/dt 雜訊影響最小。圖4顯示—減速電路170的第二範例，t玄電路可用於替代 86560 -18 - 1232408 或包括在根據按要求LCLK調節之本發明的一項較佳具體實施例之圖3的懸置電路中。減速電路Π0實質類似於圖3 的懸置電路，其共用或共同元件標記相同。—遲緩選擇口〕係提供至正反器/1位元觸發計數器174的一反轉組輸入及 AND閘極176。當遲緩選擇位元172高時’即被判定時，侧閘極m選擇性地將系統時鐘115傳遞至時鐘i位元計數器 174。遲緩選擇GCSRm實質類似於圖3的gcsri5〇,僅最終級輸出180傳遞至所有娜閘極182除外。娜閑極⑻ 可為三輸入AND閘極，提供圖3 AND閘極156的懸置選擇功能。因此，AND閉極1δ2可組合最終級輸出18〇與系統時鐘 115 及 GCSR (圖 3 的 130，兮益/6；丨土 wi： -、 1的川s亥乾例未顯不）的一對應級輸出。 AND閘極182的個別輸出時鐘對應於^管道級。在該項具體實施例中，回應遲緩選擇172，J-單元的時鐘頻率可調低（高）。藉由判定（解判定）該遲緩選擇172，£ ^ 元向！-單元提醒£單元内的遲緩（增加）要求。此外，上述縣置i-clK158的全部特徵可保持—或多個週期，如上述參: 圖3所述’當E_單元佇列幾乎滿時，即觸發一次懸置，因而 KGCSR 15〇向對應的個別娜閘極182提供停止輸出。如 ^所述’當E-單元活動仔列的長度減低至預定臨界值以下時’ I-CLK 158重新啟動。在正常運作條件下，遲緩選擇172為低（使該1位元數器向GCSR 178輸出一連續古浐仿枯、制5十 -船包括所右〗連、.貝同和位值）。因此，GCSRl78 、匕括所有1，而系統時鐘115將未改變的級152。當判定遲緩呂道擇川，出E-早疋減速要求信 86560 -19- 1232408 (由於暫條），AND閘極1 76防止1位元控制計數器！ 74觸發田遲、k選擇174升至其反轉組輸入且AND閘極176傳遞系、、先呀知115時，該丨位元控制計數器174被釋放。該丨位元控制计數器174開始觸發向GCSR 178傳遞〇與1的交替序列。一旦該交替圖案傳播通過GCSR 178，AND閘極182的^ CLK控制即按交替時鐘週期致動與停用。結果，此舉將當作I-CLK 158提供的主系統時鐘頻率二等分。 -圖頒不圖3與4的具體貫施例的變化，其中個別⑼伙級輸出係胃傳遞至對應的個別AND問極182。運作實質類似於圖 4，但是最終I-管道時鐘調節可能不同。在調節（減速）階段 ’在較狀態運作m道㈣代級係按_特定系、統時鐘 :期計m統時鐘頻率的—半。各^道級内僅為示範 ‘明一有效位元(V)。該等有效位元通常存在於圖⑴之範例的I-管道、E-管道、ί卡列與E卞列結構中。根據傳統前饋方案’上游I-單元有效位元係向下游傳播至匕單元，以致 s線、及的有效位兀為同步該局部級 '級別時鐘的AND結構提供一額外閘控控制。該具體實施例的該特定[管道計時配 _基礎電路(未顯示)防止在調節(減速)模式期間覆蓋卜官這級的有效資料，例如級之間的多餘，間佔位鎖存器，或可能重複資訊儲存的各鎖存器級的主與從部分。圖6UB更詳細地且顯示對應於圖3斷面圖的㈣厂管道二81的另一犯例190以及_對應時序圖。各^管道級⑸係由曰存器級192包括在輸人與輸出上。在該項範例中，各⑽r 86560 -20- 1232408 鎖存态1 5 4係兩級鎖存哭 , 出新在^ π 基本上為串列-輸入/並行-輸出曰存為的一單一 - J. V. 位兀。该兩級鎖存器154包括一第一級鎖存器1 94及一第二纺縣六σσ，η 、，及鎖口口、、肩存為196。該第一級鎖存器194與暫存器級192的鎖存器相同。 ^ 茨弟一級鎖存态196係使用負時鐘極性自該第一級鎖存器1 94的一廿口。1料白勺負日守釦極性致動。因此，例如 ’當該等第-級鎖存器194在上升時鐘邊緣被計時時，該第二㈣存器]96係在下降時鐘邊緣被計時，以確保為^時鐘驅動器156提供有效輸入。如參考圖3至5及圖6八至3所述， I-時鐘驅動器156係AND閘極並用作AND閘極。但是，若需要，各AND閘極156可包括可為雙相時鐘驅動器的_時鐘驅動器。輸入區塊198為選取懸置/遲緩控制邏輯的特定類型提供合適的時鐘控制邏輯。因此，在圖3的範例中，輸入區塊198可包括反相器16〇&GCSR 15〇之該第一級的鎖存器 154的一第一級194。同樣地，對於圖4與5的範例，輸入區塊130可包括鎖存器174與AND閘極176。圖7顯示本發明應用於管線超純量處理器2⑽之一範例。該超純量處理器200包括一I-單元202與一E-單元220。該I-單元202包括一指令快取記憶體（IC ACHE)204、包括一指令擷取單元（instruction fetch unit; IFU)及一分支單元（branch unit ; BRU)的一組合IFU/BRU (206)、一調度單元（dispatch unit ; DPU) 208、一完成單元（completion unit ; CMU)210 及包括一分支歷史表（branch history table ; BHT)與一分支目標位址快取記憶體（branch target address cache ; BTAC) 的一分支位址單元2 1 4。此外，I-單元202還包括監視及時鐘 86560 1232408 方文輯216。5亥等單元2 04、206、208、210與214的運作與廣為人知的該等對應單元實質相同，但是如下文所述，系根據本發明藉由監視及時鐘控制邏輯2 1 6計時。一 L ¥ IFU/BRU 2 06中的ifu每週期自ICACHE 2〇4擷取指 7擷取頻見（fetch bandwidth ; fetch_bw)在先前技術處理為中係固定為每週期擷取指令的最大數，現可由監視及控制远輯2 1 6即時调節。根據可用的自由空間，將所擷取的指令放置於IFU/BRU 206的擷取佇列（fetch queue ; FETCH—Q)中。IFU中的指令擷取位址暫存器⑽价uCti〇n fetch address register ; IFAR)指導指令擷取並於各週期的開始提供下一擷取位址。IFU將各週期的每個下一擷取位址設疋到以下各項之一：（a)為先前週期IFAR值的下一次序位址，增加到足以滿足在前一週期被擷取至Q的指令數目；（b)決定或預計在前一週期採用的一分支指令目標；或 (c)在確定係先前錯誤預計後，一分支指令的正確擷取位址。分支及指令擷取位址預計硬體2〇6包括分支歷史表（BH丁）及分支目標位址快取記憶體（BTAC)，並引導指令擷取程序。按對應固定頻寬參數（fetch一bw或disp—bw)的決定，各活動擷取（或调度）週期通常擷取（或調度）固定數目的指令。作疋’當E -單元2 2 0指示一減速/懸置係必要時，除上述時鐘調節外，該較佳具體實施例處理器（該範例中的2〇〇)即動態調節各fetch__bw與/或disp_bw的值。 E-單元的減速/懸置（或其相反的加速/繼續）信號係同步為在E-單元内產生及監視的狀態信號組合功能。該等狀熊 86560 -22 - 1232408 化號可包括.⑷問題佇列fxq 229、lsq M2、Mq “ο與 VXQ 246的滿度或空度指示；（b)- DCACHE 238擊中或未中事件；（c)E-單元内部共用匯流排流量擁塞或沒有擁塞（例如口口一早一匯流排可共用（及仲裁）以攜帶完成資訊至完成單疋24〇)，或（d)由於分支誤預計或誤考慮的其他形式而產生的執行官迢淹沒或重發條件。該範例的FXU管道中可執仃處理器分支指令。或者，可使用-單獨並行BRU管道執行分支指令。可藉由一或兩個I-CLK調節與/或壓縮相關的^單元匯流排頻見判定E-單元調節仁單元管線流速率的減速/懸置信號而不藉由停用接收ICACHE之所擷取資料的半數線來調即時鐘（如在可有效二等分的特定存取fetch—bw上卜因此，為在調節頻寬模式中節省電力，$目通f數目之—半係在才曰7、爰衝為中（在IFU 2〇6内）。一般而言，根據下游單元減速/懸置顯示的嚴重性，擷取頻寬可調節至正常模式的任何分數，包括一直到零。同樣地，為節省電力可調節調度匯流排頻寬（bus bandwidth ; disp一bw)，以按需要或指示向耗能的E-單元執行管道調度較少的指令。 E-單元220包括一固定點執行單元（fixed P〇int executi〇n unit; FXU) 222、一載入儲存單 s(1〇adst〇reunit ; lsu)224 浮”、，占執行單元（fl〇ating p〇int executi〇n ；ρρυ) 226 及一向量多媒體延伸單元（vect〇r multimedia extensiQn unit ;VMXU) 228。FXU 222包括一固定點佇列229及一固定點單元執行單元管道230。LSU 224包括一載入儲存佇列232 86560 -23 - 1232408 及一載入儲存單元管道23 4。FXU 222與LSU 224均與通用暫存器236通信。LSU 224也與資料快取記憶體238通信。 FPU 226包括一固定點佇列240及固定點單元管道242，以及固定點暫存器與重新命名緩衝器244。LSU 224也與固定點重新命名緩衝器244通信。VMXU 228包括一向量延伸佇列 246及一向量多媒體延伸單元管道248。各單元 229、230、232、234、236、238、240、242、246 與248均與廣為人知的該等對應單元的運作實質相同，但是如下所述，係根據本發明計時。如任何典型最新技術之超純量處理器一樣，在特定工作量執行階段，FXU 222與FPU 226中的活動很可能常常可相互排斥。當FPU 226在用時，該項較佳具體實施例處理器200可停用或減速FXU 222的局部時鐘，反之亦然。此外，該項較佳具體實施例處理器2〇〇允許LSU 224與FPU 226相互懸置/遲緩時鐘速度。該等單元内、細粒要求驅動型時鐘調節模式係對已說明之單元間粗粒模式的補充。與上述較佳純量處理器範例不同，E-單元220中的該等兩個單元224、226沒有直接資料流路徑生產者一消費者關係，即在LSU 224與FPU 226之間不存在一直接的資訊流。該等兩個單元224、226之間的通信係透過資料快取記憶體/記憶體及浮點暫存器檔案244間接地實施。一般而言，一;ppu管線242具有若干級（如在現代千兆赫範圍處理器中為6至8級），而一般LSU執行管道234為2至4級。因此且因為當前的處理器具有大量暫存器重新命名緩衝器，在DCACHE 23 8擊中 86560 -24- 1232408 I1白&，LSU官迢234趨向於實質上較？1)1；管道242提前運行。另方面，在叢集DcACHE 238未中階段，可大幅增加有效的LSU路徑延遲。若一系列快速未中暫停dcache 238，則 LSU問題佇列232填滿，因而可暫停上游生產者。本發明使用上游資源的活動驅動細粒時序時鐘閘控或Fpu 226局部時鐘調節利用此特點。圖8顯示E-單元200之LSU 224與Fpu 226的一更詳細範例。在该項貫施例中，LSU事件/活動狀態監視邏輯25〇監視各種LSU佇列的應用，並為LSU 226驅動一活動狀態。在該範例中’ LSU佇列包括一載入-儲存問題佇列（1〇ad_st〇re issue queue’ LSQ)232、一擱置載入 j宁列（pending i〇ad queue; PLQ) 252及一搁置儲存仔列（pending store queue PSQ) 254，以及 DCACHE 236。監視DCACHE 236，並再記錄快取記憶體問題事件。應明白，該等四個單元232、236、252、254係僅為示範而選用，也可監視更少或更多的佇列及事件。對於該範例，LSU事件/活動狀態監視邏輯250判定在該範例中傳送至FPU 226的一輸出暫停位元256。若需要更細的控制，可使用一組暫停位元。為控制可將暫停位元25 6傳送至ϊ-單元202及IFU或DISPATCH單元206的時鐘控制。舉例而言，開始時，LSU活動狀態監視邏輯輸出暫停位元25 6及FPU活動狀態監視暫停260的輸出258均被解判定 ’使LSU 224與FPU 226均處於通常的全調節運作。若FPU 活動狀態監視暫停位元2 5 8被判定而L S U活動狀態暫停位元2 5 6保持未判定，例如由於F P Q内的南使用率。則l S U局 86560 1232408 鐘被調慢，以允許Fpu 226趕上LSU 224，後者由於快取記憶體擊中階段而快於;fPU 226。相反，當LSU活動狀態暫停位元256被判定，而FPU活動狀態暫停位元258保持未判疋，則FPU局部時鐘被調慢。若LSU與Fpu的暫停位元同吩被判定/或解判定，則根據£_單元22〇或1_單元2〇2的其他 ‘件LSU與的局部時鐘均調慢或調快相同頻率。有利的是，回應其他處理器或系統單元中的活動/不活動，本發明可選擇性地減速、加速或閘閉—單元或—單元内的-組件，即本發明具有可變時鐘控制粒度。各單元局部時鐘控制係源自按資料流方向向前及向後流動的活動及^ 訊。與先前技術時鐘閘控之全有或全無不同，該較佳呈體貫施例的適應時鐘可使用前饋及回授控制以提供具有旦性頻寬調節的更靈活的一般時鐘調節機制。因此’在閘閉先前技術管線單元時可能遺失的擱置資訊即可在單元仔列中保持在場且具有合適大小。該等單元係 =關各料元内的活動的資訊動態控制。回應表示減慢 I其他早^的活動級別’特定組件中的時鐘速率被逐周丨又’甚至調到零。當所監視的活動級別表示已經回到正當拄 —^ 士〜于儿而要復至该特疋組件中的時鐘速率即再次逐漸恢设至其原來的正贪纟f ㈣的組件的局部時鐘速率 ^的週期期間，淨系統電力消耗即時鐘速率均以及時及預計方式調整各、·且件體控制不會導致任何只“率史化的硬新計算捨辛的卜“的U貝失’如因額外暫停或重 ”而產生的性能損失。先前技術處理号中 86560 •26- 1232408 的此種損失引起的性能劣化幾乎被減低至跫。與傳統時鐘間控方法相比，調慢或調快頻率的時鐘速率漸進調節方式確保優良的（即更平和的_擺_/_ I·生口此丁知速率的逐漸降低或升高使電感雜訊降至最低。因此，-較佳具體實施例系統消耗的電力大幅降低，同時沒有明顯的性能損失。平均電力降低而沒有性）的性能損失（如每週期指令 '、，σ冓次1pc) ’也不需要重大額外硬體。當需要嚴格遵守最大電力消耗及溫度限制時，本發明可成功控制電力消耗，同時將性能損失限制於預定科時間視ί保持正常運作條件及快速返回正常運作。 ,個別“組件的動態活動級別係使用傳播於該整個晶片或系、’光的I $步時鐘框架内其他組件的時鐘速率來監視。而且，與具有局部計時的非同步（或自我計時）單元、或且有全局非同步控制之多同步時鐘域的多時鐘同步系統的；步糸統或處理器不同，本發明不要求和動搖單獨被計時组件之間的協定保持同步。此外，本發明動態調節各種組件的時知速率，亚將通常與傳統粗粒時鐘閘控方法相關連的電感雜訊減至最小。雖…、本备明已就若干（示範）較佳具體實施例加以說明，但是熟知本技術人士應明白’本發明可在隨附申請專利範圍的精神及範疇内進行修改。【圖式簡單說明】上述本兔明，兄明性具體實施例的詳細說明並參考圖式可更加瞭解上述及其它目的、觀點及優點，其中·· 86560 -27- !2324〇8 圖丨顯不一典型最新技術管線純量處理器的範例及一對應指令時序圖；圖2顯示根據本發明之以要求或活動驅動電力量處理广較佳具體實施例的高位準範例；、、. 固..、、員示活動監視與時鐘控制邏輯懸置電路的一第一範一 “中閘控控制移位暫存器（GCSR)使系統時鐘傳遞至 ^官道的個別級；圖4顯不一減速電路的第二範例，該電路可用於替代或包括在根據按要求調節之本發明的—項較佳具體實施例之圖3的懸置電路中； • β ”、’員示圖3與4的具體實施例的變化，其中個別sr級輪出係傳遞至對應的個別AND閘極；圖6 A至B更詳細地顯示以對應於圖3斷面圖的純量^管道的另一範例；圖7顯不本發明應用於管線超純量處理器之一範例；圖8顯示圖7之E-單元之Lsu與FPU的一更詳細範例。【圖式代表符號說明】 100 管線純量處理器 102 指令單元（I-UNIT) 104 執行單元（E-UNIT) 106 暫存器檔案（REGFILE) 108 指令快取記憶體（IC ACHE) 110 資料快取記憶體（DC ACHE) 112 局部時鐘緩衝器（LCB) 86560 -28· 1232408 114 115 116E 1161 118E 1181 122 124 126 128 138 150 152 154 156 158 160 170 172 174 176 178 180 182 局部時鐘緩衝器（LCB) 同步時鐘輸入彳宁列輸入仵列管道管道 I-單元 E-單元活動監視及時鐘控制邏輯活動監視及時鐘控制邏輯 E-單元時鐘閘控控制移位暫存器（GCSR) 個別級正反器 AND閘極 I-時鐘反相器減速電路遲緩選擇正反器/1位元觸發計數器 AND閘極減速選擇GCSR 最終級輸出 AND閘極 86560 -29- 1921232408 194 196 198 200 202 204 206 208 210 214 216 220 222 224 226 228 229 230 232 234 236 238 240 暫存器級第一級鎖存器第二級鎖存器輸入區塊管線超純量處理器 I-單元指令快取記憶體（ICACHE) 組合 IFU/BRU 調度單元（DPU) 完成單元（CMU) 分支位址單元監視及時鐘控制邏輯 E-單元固定點執行單元（FXU) 載入儲存單元（LSU) 浮點執行單元（FPU) 向量多媒體延伸單元（VMXU) 固定點佇列固定點單元執行單元管道載入儲存佇列載入儲存單元管道通用暫存器資料快取記憶體固定點佇列 86560 -30- 1232408 242 固定點單元管道 244 固定點暫存器及重新命名緩衝器 246 向量延伸彳宁列 248 向量多媒體延伸單元管道 250 LSU事件/活動狀態監視邏輯 252 搁置載入佇列（PLQ) 254 搁置儲存佇列（PSQ) 256 LSU活動狀態監視邏輯輸出暫停位元 258 輸出 260 FPU活動狀態監視暫停 A 陣列 B 陣列 C 陣列 disp_bw 固定頻寬參數 fetch_bw 固定頻寬參數 FPU 浮點執行單元 FXU 固定點執行單元 LSU 載入儲存單元 86560The variable current drop and gain can be set at a daily level. The inductance (Ldi / dt) noise level of unacceptable electrical voltage. Second, the gated switching program requires additional cycles to maintain proper functional operation. For fine-grain phase changes in workload, switching too frequently between gated and actuated modes can lead to unacceptable performance degradation. Moreover, the “latest idle” relies on locally generated pipeline-level clocks for interrogation of gated signals or conditions. You ‘‘ Paid Times ’, for example, based on the condition of invalid or irrelevant computing elements. These latest methods do not generate gated signals as expected or expected. So 'timing requirements are usually important' because the gate signal must be provided before the decision and it is used to determine the proper period of the error-free clock gating operation. Gowan, M. 86560 10 1232408 K ·, Biro, L. L. and Jacks On, DB described these in the 1998 "Acm / ieee Design Automation Society" 4 paper "Power Considerations for Alpha 21264 Microprocessor Design" (pages 726 to 731) (June 1998). Limitations can greatly complicate design timing analysis and even cause clock frequency performance degradation. Regardless of whether the basic control mechanism is feedforward (causal flow) or based on feedback (causal flow), the new control technology (regardless of coarse or fine) is only used for space control. This is because application information is used to eliminate redundant clocks in the affected area, regardless of time activity or history in areas such as Mayfair or other areas of the machine. Activity states and events in downstream (consumer) units and levels (such as execution pipelines or problem queues) are not: to regulate the time when upstream (producers) are in non-adjacent areas (such as instruction extraction or weekly units) Knowledge or poor flow rate. Similarly, the activities and events in the upstream producer zone j are not feed forward to adjust the downstream consumer's clock or rate: • rate. And if the clock signal has been activated or not activated, it may or may not be affected by the intermittent clock signal. There is a need to improve the clock circuit of the connected Lu line early element that can operate with fine-grained space and time granularity. The large M /% of the circuit can be (extra) degraded and there is no large current / voltage swing to the substrate. [Summary of the Invention] The purpose of this Maoming is to reduce the processor's power and performance loss. The magic jade force 4 consumes at the same time and there is no obvious problem. This is a synchronous integrated circuit, such as ton β ~ # + processor. The circuit components or units are processed by the common or ultra-scalar sources, and are clocked and synchronized to the stage. For example, a time-keeping unit includes a plurality of register s-line stages. The local clock generator junction in each timed unit 86560 1232408 combines the pause state of the common system clock with one or more other units to adjust the register clock frequency up or down. [Embodiment] Referring now to the drawings, more specifically, FIG. 1 shows a scalar processor i. . Example high level block diagram and sequence diagram. The main functional data path is divided into two, which are generally referred to as instruction unit s (instruction unit; bu 叩 _ two-line unit (^ eCUtiGn _; ㈣ group) 1 G4. The present invention does not explain I- 单幻 02 and E -Many fine-molecular units and functional logics in the early days 104 (such as branch prediction logic), therefore, to simplify the description of the entire clock control of the processor 100, it is omitted. This example is where the Wangli device 100 is a pipeline Representative of scalar design, this design does not include clock gating to prevent redundant timing of the unit, subunit or its included auxiliary storage resources. Access by unit 1.2 or 1 () 4-or both The on-chip storage includes a register fUe (REGFILE) 106, an instruction cache (· ICACHE) 108, and a data cache (DCACHE) 111. REGFILE 106 is generally a shared resource that can be accessed from units 102, 104, so it is considered a separate entity. icache 08 is the first termination stage of the I-unit pipeline and is therefore considered part of unit 102. The DC ACHE 110 is usually accessible separately from the E-unit 104 and is therefore considered as part of the E-unit 104. Two separate local clock buffers (LcB) i 12, 114 each amplify and allocate a common synchronous clock 115 to the corresponding unit in units 102, 104. Each unit 102, 104 includes an input array 1 1 61, 1 1 6 and a pipe 1181, 118E. As needed! -In units 102 and 104, 86560 1232408 may have a further level of LCB 112, 114 for the common system clock *, field size 7 knives, amplification and control. For this example, the computer program flying instruction is included in the first stage of the I-unit pipeline, ICACHE 108. —Private priest Μ々 Lu, Ic ache i μ γ t108 can suspend a variable number of processor cycles under various conditions that may cause ICache miss. This pause implies the effect of a missed f + on the first instruction transfer level (ie, the lower level of the instruction's hidden body level). —The synchronized clock is continuous. Each unit passes through U2, 114, regardless of the pause of ::. Switching capacitor modulation and pattern bit change cause power consumption to be affected, but the range of change is small. From the beginning of the execution of the formula to: beam, each clock cycle consumes approximately the same energy (represented here by the standard energy early). Therefore, by using the power gp reading method, especially according to the power saving method of this Maoming, considerable savings can be saved Energy. During coding generation, 'by inserting special instructions included in the instruction set structure, the compiler can synchronize the coarse idle control, or can be operated by the operating system ~ each deduction, such as when implementing A special interruption or at a certain background time. At the coarsest control level, a special sleep-type command may be issued, and the special sleep command may generate a disable signal to select the chip: stop the clock for a period of time. The same sleep The command can also be used to disable the instruction fetch program. Similarly, when the disable signal goes negative, or after the sleep cycle,-implicit wakeup is started, or available-explicitly asynchronous Assertion: Wake. As the skilled person knows, the clock distribution tree can be selectively disabled at each level of the LCB level to provide various power saving modes (such as nap, light sleep or sleep). More detailed at the next level The granularity of the compiler, whenever the compiler can 86560 -13-1232408 statically predict the calculation phase ^ start to close a specific unit ... edit the pool to insert special instructions, open, li, such as the clock of the floating-point unit. Including the self-detection mechanism, it is necessary to open a hole to allow him to deactivate his clock. Ye Ri Shi = 7 ^ When he is idle now, allow the processor to be partially idle in the hardware. nf, work. Then, the detection result can trigger the deactivation of a received chop cup in a idle area of a 4 king. The wake-up is also self-initiated in a similar way based on the new task received by the deactivation or sleep unit. For fine idle control, At clock. For example, as-Super 2 = signal gates the local time t ^ ^ cyclic state one cycle at a time, when the processor decodes the instruction and / or executes the cycle after Ik- The skill of Ding 3 Li This method confirms that the processor in the Λ mechanism works well, so the gate control decision is made between the “accessible I” (that is, during the dissolution of the mother or the dispatcher). , Then even for ^ 幻 can be issued. Time to generate such gate control letters. ♦ In any pipeline data path, it is possible to dynamically_ and selectively prevent redundant juice temples such as' propagating along the logical f-line—data valid flag. The data valid flag is set only when generated from: periodic data is valid . Then, the data valid flag of each logic stage can be used as a clock actuating signal for setting the output latches. Therefore, 'these pipeline stages need not be timed for invalid leans', these pipeline stages can be referred to as fine-grained, valid bit-based pipeline stage-level clock gating. On June 12, 2001, U.S. Patent No. 6,247 was granted to Ru et al. , 1342 5 Pipeline-level gate control method and system in the power operation pipeline circuit of Hutong Province "86560 1232408 said that a kind of processor whose logic will be implemented by the first level of this logic: Any newly received operands that change in the pipeline are identified as irrelevant. This invariant signal is made irrelevant and can be used to stop the clock of the first stage, and then the clocks of subsequent stages are continuously deactivated. τ ^ hnishl, M. , Yamada, A ”Noda, Η. With Kambe, T. In the 1997 International Symposium on Low Power Electronic Products and Design (y · Symp · 〇n jowei · Electronics and Design; ISLpED), the paper "And Beyond. And. Several Tens of Timing Detection and Power Reduction Methods" ( Pages 1 3 1 to 136) describe other more detailed "measurement mechanisms" to prevent timing of various redundant latches. As mentioned above, regardless of coarse or fine control, these latest technology idle controllers: self-triggering, space or feedforward, that is, the gated conditions or signals are based on the detection of early idle or invalid states and local Generate and avoid unnecessary cycle-by-cycle clock and data transmission when there is such a single 7L which can be an entire area or a functional unit, or can be a pipeline ', put, or idle state bits. A first-level latch group is inversely 'in the specific embodiment of H'-a scalar pipeline processor adjusted with a day clock that requires horse movement and an execution unit (eXeCUtl0ri unit; E -unit) An instruction unit Gnstrucuon unit with adjustable operation; Bu Xianyu established a relationship between producers and / or consumers between the two units. The producer unit forwards the existing data actuation instruction to the execution unit Fanfan to maintain an active state register with at least 1 bit of information in it at a rate not faster than the execution unit can accept. In the specific embodiment of this item, the μE unit pair is clocked by a common synchronous clock 86560 -15-1232408. However, the clock of each unit is partially modified and controlled according to the… activity information passed between the units. Bureau of each unit. = = Active state of local unit and remote unit (ie, two units) = = High-level example of the preferred embodiment of driving power control scalar-^ 20 according to the present invention, with its component map μ Identical components are the same. In this specific implementation, the processing units 122 to 124 each include activity monitoring and clock control logic 26, us, and monitoring unit activity levels. In a simplified specific embodiment, a single-active state :: no line-a pause bit, indicating a pause / non-pause condition. When the deduction unit (3) senses the -pause condition (current or imminent), it determines that the pause position is ⑼0. The pause bit 130 is used to lower the clock speed of the BU unit clock CLP 132 to lower it! -Unit 122, and therefore reduces (or cuts) the command rate to Dagger 124. According to the control granularity, the E-unit active state or the pause bit 130 can adjust its own clock in the E_unit 124. For example, when the E-unit i24 is suspended, clu⑴ is raised to its normal clock rate. When I-Unit 1 22 detects a suspension condition (such as an icache 108 miss), 'the μ pipe 118i is empty after a few cycles, so Bei Xun, which is not given to & Unit 24, will be empty. 3 The 6 series is used to reduce ux to save power. This up / down adjustment capability of activity monitoring and clock control logic 1 can be implemented using a number of different methods, and several examples are provided below. For example, each unit clock can exit one pipeline stage at a time. Similarly, when the brake condition ends, the unit clock returns to a pipeline stage again. This control series T allows Hui units to quit or return in time without losing effective information. 86560 -16-1232408 The toilet does not increase interface logic, buffers, nor waste pipeline occupation and recirculation%. Alternatively, the unit clock frequency can be slowed down or gradually reduced to save and, if / as required, the clock can be subsequently slowed down to a stop. FIG. 3 shows a first example of the activity monitoring and clock control logic suspension circuit. The gating control shift register (GCSR) 150 passes the system and system clock U 5 to the pipeline 1181. Individual level 152. The GCSR 150 is a one-element linear shift register of the i-bit flip-flop 154, which corresponds to each pipeline stage 152. The system clock 115 is AND at each level, and each element in one of the AND gates 156 selectively transmits the system clock 115 as an and gate output to a corresponding I-pipe stage 152, which together constitute 1_ ( : 1 ^ 〖158. In this example, the E-unit pause is usually not detected, and the pause bit 13 remains low and is inverted by the inverter 160. With all of the GCSR 15〇, the pipeline 丨丨 Phantom is in the section of King Wither. Therefore, the AND gate 1 5 6 passes the unchanged system clock J1 5 as I-CLK 158 to each level of the I-pipe stage 152, each of which is equivalent to the system clock Π5 /, E_ The unit V bit 130 remains at 0, and it is shifted to gcsr 150 to maintain the I-pipe 1181 in full regulation. In this example, when the E-unit activity monitoring and clock control logic (Figure 2 respectively) 124 and 128), when it senses an imminent pause condition, it determines the unit pause bit 130. The pause bit 13 is inverted by the inverter 160, and it indicates a zero to the GCSR 150, which is then the next The system clock cycle is shifted to GCSR 150. In the same cycle, the first pipeline stage (ie, ICACHE 1 16 from Figure 2) is stopped , Pause the data input of the channel 181, and shift 0 to and pass the GCSR 150. As long as the pause bit 13 is determined, 0865560 -17-1232408 is shifted to GCSR 15G, every system clock cycle _ times. Each zero then stops passing the system clock to the AND gate of the left-to-right ripple in this example. Therefore, the first stage of the pipeline clock 158 closes from left to right at some time. Prior to the suspension of bit 13Q, the effective channel project continues to enter the E-unit through the I-pipe 11. To avoid losing the effective pipeline in the ^ pipeline! Although the available space of E_unit buffer (queue 11 6E) is equal to When 1_the number of valid items in the pipeline 1181, the E_unit pause bit 13 must be determined. = When the space of the parent-standby available E-unit buffer (case 116E) is equal to the number of valid items in I · pipe 1181 ', it must be determined that E_unit pause bit ⑽. In extreme cases, the latter is directly equal to the length of the pipe 1181. Therefore, in the traditional design, whenever the E-unit queue 116E is filled to make it available (the number of empty items in the column is equal to the number of I-pipe levels, the E_ unit 124 determines the suspended bit 1 3 0. Ε- The blank or invalid items in cell row 11 6 Ε are assumed to be clock gating with pre-entered valid bits (to save power), just like traditional fine-grained, level-level clock gating. Similarly, when the bit is paused When the return from 13 to 0, that is, when it is detected that the unit movement has fallen below the predetermined level, the reverse upward operation will begin. The low value of the pause bit 130 will resume the shift uGCSR 15 and the shift is valid at the same time 'Beijing to 1-Guandao 1181. Therefore, the super-continuous follow-up system clock cycle LCLK 158 of pipeline 1181 is actuated step by step, so that pipeline 1181 resumes normal operation and transfers the shell material to the dagger unit 124 in full adjustment. Gating 158 off / on prevents large current swings, thereby minimizing the effect of Ldi / dt noise on the supply voltage. Figure 4 shows the second example of a speed reduction circuit 170. The t-phase circuit can be used instead of 86560 -18-1232408 Or included in the hair In the suspension circuit of FIG. 3, which is a preferred embodiment of the invention, the speed-reduction circuit Π0 is substantially similar to the suspension circuit of FIG. 3, and the common or common components are labeled the same.-The delay selection port] is provided to the flip-flop The / 1 bit triggers an inversion set input of the counter 174 and the AND gate 176. When the slow selection bit 172 is high, that is, determined, the side gate m selectively passes the system clock 115 to the clock i bit Counter 174. The slow selection GCSRm is substantially similar to gcsri50 in Figure 3, except that only the final stage output 180 is passed to all the Na gates 182. The Nana pole can be a three-input AND gate, providing a suspension of the AND gate 156 of Figure 3. Therefore, the AND closed pole 1δ2 can combine the final stage output 180 with the system clock 115 and GCSR (130 in Figure 3, Xiyi / 6; 丨 wi:-, 1 of the Sichuan example is not obvious. ) A corresponding stage output. The individual output clocks of the AND gate 182 correspond to the ^ pipe stage. In this specific embodiment, the response delay is selected 172, and the clock frequency of the J-unit can be adjusted low (high). (Solution) The slow choice is 172, £ ^ to the direction!-Unit reminder £ Delay (increase) requirements in the unit. In addition, all the characteristics of the above-mentioned county i-clK158 can be maintained-or multiple cycles, as described above: Figure 3 'When the E_ unit queue is almost full, trigger once Suspended, so KGCSR 15 provides a stop output to the corresponding individual Na gate 182. As described in ^ 'When the length of the E-cell activity array decreases below a predetermined threshold' I-CLK 158 restarts. In normal Under operating conditions, the slow selection of 172 is low (so that the 1-bit counter outputs a continuous ancient quasi-simple, 50-ship system including the right to GCSR 178). Bayer and bit value). Therefore, GCSR178, all 1s, and the system clock 115 will be unchanged stage 152. When judged to be slow, Lu Dao chose Sichuan, and issued E-early deceleration request letter 86560 -19- 1232408 (due to the temporary bar), AND gate 1 76 prevents 1-bit control counter! When 74 is triggered, Tian Chi, k selects 174 to rise to its inversion group input, and AND gate 176 transfers the system, the prophet 115, the bit control counter 174 is released. The bit control counter 174 starts to trigger the alternating sequence of 0 and 1 to be passed to the GCSR 178. Once the alternating pattern propagates through the GCSR 178, the ^ CLK control of the AND gate 182 is activated and deactivated at alternate clock cycles. As a result, this will be treated as a halving of the main system clock frequency provided by I-CLK 158. -Figure presents a variation of the specific embodiment shown in Figures 3 and 4, where individual partner-level outputs are passed from the stomach to the corresponding individual AND poles 182. The operation is essentially similar to Figure 4, but the final I-pipe clock adjustment may be different. In the adjustment (deceleration) phase, the m-channel generation system is operating in a more state according to the _ specific system and system clock: the period is calculated to be -half of the system clock frequency. Each ^ Dao level is only a demonstration ‘Ming Yi valid bit (V). These valid bits are usually found in the I-pipe, E-pipe, 卡 Kale and E 卞 structures of the example in Figure ⑴. According to the traditional feed-forward scheme, the upstream I-unit effective bit system is transmitted downstream to the dagger unit, so that the s-line and the effective bit provide an additional gate control for the AND structure that synchronizes the local level clock. The specific [pipeline timing_basic circuit (not shown) of the specific embodiment prevents the effective data of the magistrate level from being overwritten during the adjustment (deceleration) mode, such as excess between stages, occasional placeholder latches, or It is possible to duplicate the master and slave parts of each latch stage of the information store. FIG. 6UB shows in more detail and another time sequence diagram 190 and _corresponding sequence diagrams of the Yanchang Pipeline II 81 corresponding to the cross-sectional view of FIG. 3. Each pipeline stage is composed of a memory stage 192 on the input and output. In this example, each 86r 86560 -20-1232408 latch state 1 5 4 series two-level latch cry, the new ^ π is basically a single-serial-input / parallel-output-save-single-J . V. Bit Wu. The two-stage latch 154 includes a first-stage latch 194 and a second spinning county six σσ, η, and the lock port 196 and 196. This first stage latch 194 is the same as the latch of the register stage 192. ^ The first stage latch state 196 uses a negative clock polarity from a port of the first stage latch 194. 1 material actuated by the negative day guard buckle. Thus, for example, when the first-stage latches 194 are clocked on the rising clock edge, the second register] 96 is clocked on the falling clock edge to ensure that a valid input is provided to the clock driver 156. As described with reference to FIGS. 3 to 5 and FIGS. 6 to 8, the I-clock driver 156 is an AND gate and functions as an AND gate. However, if desired, each AND gate 156 may include a clock driver that may be a two-phase clock driver. The input block 198 provides suitable clock control logic for selecting a particular type of suspension / latency control logic. Therefore, in the example of FIG. 3, the input block 198 may include a first stage 194 of the first stage latch 154 of the inverter 16 & GCSR 150. Similarly, for the examples of FIGS. 4 and 5, the input block 130 may include a latch 174 and an AND gate 176. FIG. 7 shows an example of the present invention applied to a pipeline superscalar processor 2 ′. The superscalar processor 200 includes an I-unit 202 and an E-unit 220. The I-unit 202 includes an instruction cache memory (IC ACHE) 204, a combined IFU / BRU (206) including an instruction fetch unit (IFU) and a branch unit (BRU). A dispatch unit (DPU) 208, a completion unit (CMU) 210, and a branch history table (BHT) and a branch target address cache ; BTAC) branch address unit 2 1 4; In addition, I-Unit 202 also includes surveillance and clock 86560 1232408 Fang Wen Ji 216. The operations of units such as 5 04, 206, 208, 210, and 214 are essentially the same as those well-known counterparts, but as described below, It is timed by monitoring and clocking logic 2 1 6 according to the present invention. One L ¥ IFU / BRU 2 06 ifufu fetches from ICACHE 2 0 per cycle 7 fetch bandwidth (fetch bandwidth; fetch_bw) in the prior art processing is fixed to the maximum number of fetch instructions per cycle, Now it can be adjusted instantly by monitoring and controlling remote series 2 1 6. According to the available free space, the fetched instructions are placed in the fetch queue (FETCH-Q) of the IFU / BRU 206. The instruction fetch address register (IFAR) in IFU is used to guide instruction fetch and provide the next fetch address at the beginning of each cycle. IFU sets each next fetch address of each cycle to one of the following: (a) is the next sequential address of the IFAR value of the previous cycle, which is increased enough to satisfy the fetched to Q in the previous cycle The number of instructions; (b) determine or predict the target of a branch instruction to be used in the previous cycle; or (c) the correct fetch address of a branch instruction after determining that it was a previous misprediction. The branch and instruction fetch address prediction hardware 206 includes the branch history table (BH) and the branch target address cache memory (BTAC), and guides the instruction fetch process. According to the decision of the corresponding fixed bandwidth parameter (fetch-bw or disp-bw), each activity fetching (or scheduling) cycle usually fetches (or schedules) a fixed number of instructions. Action: When the E-unit 2 2 0 indicates that a deceleration / suspension is necessary, in addition to the clock adjustment described above, the processor of the preferred embodiment (200 in this example) dynamically adjusts each fetch__bw and / Or the value of disp_bw. The deceleration / suspension (or opposite acceleration / continuation) signal of the E-unit is synchronized. It is a combined function of status signals generated and monitored in the E-unit. These bears 86560 -22-1232408 can be included. ⑷Problem list fxq 229, lsq M2, Mq "ο and VXQ 246 full or empty indication; (b)-DCACHE 238 hit or miss event; (c) E-unit internal shared bus traffic congestion or There is no congestion (for example, the bus can be shared (and arbitrated) early in the morning to carry completion information to the completion order (24)), or (d) the executive officer is overwhelmed by other forms of branch misprediction or misconception or Retransmission conditions. The processor's branch instructions can be executed in the FXU pipeline of this example. Alternatively, branch instructions can be executed using a separate parallel BRU pipeline. One or two I-CLKs can be used to adjust the ^ units related to compression. The bus frequency is determined by determining the deceleration / suspension signal of the E-unit to adjust the flow rate of the pipeline of the unit without adjusting the clock by disabling the half line of the data received by ICACHE (such as in the specific Access fetch—bw. Therefore, in order to save power in the bandwidth adjustment mode, half of the number of $ meeting f—half of the time is only 7 and the charge is in the middle (in IFU 2006). Generally speaking, Capture bandwidth based on the severity of the slowdown / overhang display of the downstream unit Any score that can be adjusted to normal mode, including all the way to zero. Similarly, to save power, the bus bandwidth (disp-bw) can be adjusted to execute the pipeline to the energy-consuming E-unit as needed or indicated E-unit 220 includes fewer instructions. The E-unit 220 includes a fixed point execution unit (FXU) 222, a load storage order s (1〇adstore unit; lsu) 224, ", Occupy execution unit (fl0ating point execution; ρρυ) 226 and a vector multimedia extension unit (vect〇r multimedia extensiQn unit; VMXU) 228. FXU 222 includes a fixed-point queue 229 and a fixed-point unit execution Unit pipeline 230. The LSU 224 includes a load storage queue 232 86560 -23-1232408 and a load storage unit pipeline 23 4. Both the FXU 222 and the LSU 224 communicate with the general purpose register 236. The LSU 224 also communicates with the data cache The memory 238 communicates. The FPU 226 includes a fixed point queue 240 and a fixed point unit pipe 242, and a fixed point register and a rename buffer 244. The LSU 224 also communicates with the fixed point rename buffer 244. The VMXU 228 includes A vector The extension queue 246 and a vector multimedia extension unit pipeline 248. Each unit 229, 230, 232, 234, 236, 238, 240, 242, 246, and 248 operate substantially the same as these well-known counterparts, but as follows This is the timing according to the present invention. As with any typical latest technology superscalar processor, the activities in the FXU 222 and FPU 226 are likely to be mutually exclusive during certain workload execution phases. When the FPU 226 is in use, the processor 200 of this preferred embodiment may deactivate or slow down the local clock of the FXU 222, and vice versa. In addition, the preferred embodiment processor 200 allows the LSU 224 and FPU 226 to suspend / slow the clock speed to each other. The intra-cell, fine-grained, demand-driven clock adjustment modes are complementary to the inter-cell coarse-grained modes already described. Unlike the above-mentioned preferred scalar processor example, the two units 224, 226 in the E-unit 220 have no direct data flow path producer-consumer relationship, that is, there is no direct relationship between the LSU 224 and the FPU 226 Information flow. The communication between these two units 224, 226 is implemented indirectly through data cache / memory and floating point register files 244. Generally speaking, a CPU pipeline 242 has several levels (eg, 6 to 8 levels in modern gigahertz range processors), while the general LSU execution pipeline 234 is 2 to 4 levels. So and because the current processor has a large number of scratchpad renaming buffers, in DCACHE 23 8 hit 86560 -24-1232408 I1 white & LSU official 234 tends to be substantially compared? 1) 1; Pipeline 242 runs ahead. On the other hand, in the cluster DcACHE 238 miss stage, the effective LSU path delay can be greatly increased. If a series of quick misses suspends dcache 238, the LSU problem queue 232 fills up, so upstream producers can be suspended. The present invention utilizes this feature by using the activity of upstream resources to drive fine-grained timing clock gating or Fpu 226 local clock adjustment. FIG. 8 shows a more detailed example of the LSU 224 and the Fpu 226 of the E-unit 200. In this embodiment, the LSU event / active state monitoring logic 25 monitors various LSU queued applications and drives an active state for the LSU 226. In this example, the LSU queue includes a load-storage issue queue (1〇ad_st〇re issue queue) LSQ 232, a pending loading queue (PLQ) 252, and a pending queue. Pending store queue PSQ 254, and DCACHE 236. Monitor DCACHE 236, and then log cache memory events. It should be understood that these four units 232, 236, 252, and 254 are only selected for demonstration purposes, and fewer or more queues and events can be monitored. For this example, the LSU event / activity monitoring logic 250 determines an output pause bit 256 that is passed to the FPU 226 in this example. For finer control, a set of pause bits can be used. To control the clock, the pause bit 256 can be transmitted to the ϊ-unit 202 and the IFU or DISPATCH unit 206. For example, at the beginning, the LSU active state monitoring logic output pause bit 25 6 and the FPU active state monitoring pause 260 output 258 are both determined to be 'so that both the LSU 224 and the FPU 226 are in a normal full adjustment operation. If the FPU active state monitoring pause bit 2 5 8 is judged and the L S U active state pause bit 2 5 6 remains undetermined, for example due to the South usage rate in F P Q. The clock of the USU 86560 1232408 is slowed down to allow the Fpu 226 to catch up with the LSU 224, which is faster than the fPU 226 due to the cache memory hit phase. In contrast, when the LSU active state pause bit 256 is judged and the FPU active state pause bit 258 remains undetermined, the FPU local clock is slowed down. If the pause bit of the LSU and the Fpu is determined / determined, then the local clock of the LSU and the other clocks of the unit 22 or unit 2 202 are slowed down or adjusted to the same frequency. Advantageously, in response to activity / inactivity in other processors or system units, the present invention can selectively decelerate, accelerate, or shut down-the unit or-components within the unit, i.e., the invention has a variable clock control granularity. The local clock control of each unit is derived from activities and signals flowing forward and backward in the direction of the data stream. Unlike all or nothing of the prior art clock gating, the adaptive clock of the preferred embodiment can use feedforward and feedback control to provide a more flexible general clock adjustment mechanism with a densified bandwidth adjustment. Therefore, the shelving information that may be lost when the prior art pipeline unit is closed can be kept in the unit queue and of a suitable size. These units are the dynamic control of information about activities in each material element. The response indicates that the other early activity levels are slowed down. The clock rate in certain components is cycled weekly and again and even adjusted to zero. When the monitored activity level indicates that it has returned to the proper time— ^ 士 ~ and the clock rate in the special component is to be restored, it is gradually restored to the local clock rate of the component that was originally active. During the period of ^, the net system power consumption, that is, the clock rate is adjusted in a timely and predictable manner, and the piece control will not cause any "only historical hard calculations of new calculations" such as "U Be lost" such as Performance loss due to additional pauses or reloads ". The performance degradation caused by this loss in the prior art processing number 86560 • 26-1232408 is almost reduced to 跫. Compared to the traditional clock control method, it is slower or faster. The frequency of the clock rate is gradually adjusted to ensure an excellent (ie, a more peaceful _ pendulum _ / _ I. The gradual decrease or increase of the known rate minimizes the inductance noise. Therefore,-the preferred embodiment The power consumption of the system is greatly reduced without significant performance loss. The average power is reduced without sexual performance) (such as the command per cycle ',, σ 冓 times 1pc)' does not require significant additional hardware When strict compliance with maximum power consumption and temperature limits is required, the present invention can successfully control power consumption while limiting performance loss to a predetermined time period to maintain normal operating conditions and quickly return to normal operation. Individual "component dynamic activity levels Use the clock rate of other components in the I $ step clock framework that propagate throughout the wafer or system to monitor. Moreover, unlike a multi-clock synchronization system with a non-synchronous (or self-timed) unit for local timing or a multi-synchronous clock domain with global asynchronous control; the system or processor is different, the present invention does not require and shake separate The agreements between the timed components are kept in sync. In addition, the present invention dynamically adjusts the time-to-know rate of various components, minimizing inductive noise typically associated with traditional coarse-grained clock gating methods. Although ..., this note has described several (exemplary) preferred embodiments, but those skilled in the art should understand that the present invention can be modified within the spirit and scope of the scope of the accompanying patent application. [Brief description of the drawings] The above detailed description of the specific embodiments of the present rabbit and brother, and reference to the drawings can better understand the above and other purposes, viewpoints and advantages, of which 86560 -27-! 2324〇8 An example of a typical latest technology pipeline scalar processor and a corresponding instruction timing diagram; FIG. 2 shows a high-level example of a preferred embodiment of a wide-bodied and specific embodiment of driving power by request or activity according to the present invention; solid. . A first example of a suspension circuit for monitoring activity and clock control logic of staff instructions. The “Central Gate Controlled Shift Register (GCSR)” allows the system clock to be passed to individual stages of the official road; Figure 4 shows a slowdown. A second example of a circuit that can be used in place of or included in the suspension circuit of FIG. 3, a preferred embodiment of the present invention, which is adjusted as required; Variation of a specific embodiment, in which individual sr stage wheels are passed to corresponding individual AND gates; Figures 6 A to B show in more detail another example of a scalar pipe corresponding to the sectional view of Figure 3; Figures 7 shows an example in which the present invention is applied to a pipeline superscalar processor. FIG. 8 shows a more detailed example of the Lsu and FPU of the E-unit of FIG. 7. [Schematic representation of symbols] 100 pipeline scalar processor 102 instruction unit (I-UNIT) 104 execution unit (E-UNIT) 106 register file (REGFILE) 108 instruction cache memory (IC ACHE) 110 data cache Access Memory (DC ACHE) 112 Local Clock Buffer (LCB) 86560 -28 · 1232408 114 115 116E 1161 118E 1181 122 124 126 128 138 150 152 154 156 158 160 170 172 174 176 178 180 182 Local Clock Buffer (LCB) ) Synchronous clock input, serial input, serial pipeline, pipeline I-unit E-unit activity monitoring and clock control logic activity monitoring and clock control logic E-unit clock gating control shift register (GCSR) individual stage flip-flop AND gate I-clock inverter deceleration circuit slow selection of flip-flop / 1 bit trigger counter AND gate deceleration selection GCSR final stage output AND gate 86560 -29- 1921232408 194 196 198 200 202 204 206 208 210 214 216 220 222 224 226 228 229 230 232 234 236 238 240 Register stage 1st stage latch 2nd stage latch input block pipeline ultra-scalar processor I-unit instruction cache memory (ICACHE ) Combined IFU / BRU scheduling unit (DPU) completion unit (CMU) branch address unit monitoring and clock control logic E-unit fixed point execution unit (FXU) load storage unit (LSU) floating point execution unit (FPU) vector multimedia Extension unit (VMXU) Fixed-point queue Fixed-point unit execution unit Pipeline load storage queue Load storage unit pipeline Universal register data cache fixed-point queue 86560 -30- 1232408 242 Fixed-point unit pipeline 244 Fixed Point registers and rename buffers 246 Vector extension queue 248 Vector multimedia extension unit pipeline 250 LSU event / activity monitoring logic 252 Shelve loading queue (PLQ) 254 Shelve storage queue (PSQ) 256 LSU activity Monitor logic output pause bit 258 Output 260 FPU activity status monitor pause A array B array C array disp_bw fixed bandwidth parameter fetch_bw fixed bandwidth parameter FPU floating point execution unit FXU fixed point execution unit LSU load storage unit 86560

Claims

1232408 Patent application scope: 1 A synchronous integrated circuit, including a common system clock; and a plurality of timed units that are synchronized with the common system at least two /, 'first hope clock synchronization' Each includes: a plurality of register stages; and a local clock to produce sounds, performance, including the old point of mouth to receive the common system clock and pause the heart '... should be suspended to adjust the clock frequency. Wen Shuang temporarily stored state 2 · If the scope of patent application is 1 ^ Synchronous integrated circuit according to item 1 of the national generator. The generator includes a __. Haijukou Μπ 钟 Set several early one-bit counter register (GCSR). Private labor system 3. If the second item of the scope of the patent application includes a plurality of other local clocks, the local clock generates a plurality of local clock drivers, each receiving such multiple single-bit counts to find the complex number ... -Output, and combine this round-out with the system day to generate a-register-level clock. 4. If a patent-pending fan's synchronous integrated circuit, these multiple register stages Department of temporary storage of rice and double m ... The level of the line 'This _ is one of multiple early-bit counters for each pipeline stage including Suan Xun Xun. Department = Yuetun! Wai. Synchronous integrated circuit of 4 items 'Wherein the integrated circuit; processing is as follows' In the scalar processor, at least two clocked units include an I-unit and _ρ σσ ^ Ping Zao 提供, the I-unit provides the E-unit For the time being, 1τ is sad, and 留 μ stay — Λ ^ early 711 provides a pause state for the I-unit. 6. If a patent application [[fl 筮 ς tS member's synchronous integrated circuit], the scalar processor enters 86560 Ϊ232408 step includes ... a data cache memory, which communicates with the E-unit; a register Communication with the I-unit and the £ unit; the I-unit, which includes an I-cache memory, an I-queue that receives lean data from the cache memory, and an ^ Pipeline for receiving data: and E-line receiving L--early yuan for receiving poor materials, including T-line, including an E-pipeline for receiving data from the E-line 7, 7. If applying for a patent The synchronous integrated circuit of the sixth item, wherein in each of the ^ unit and the E- unit, each output of the GCSR and one of the plurality of clock guard drivers correspond to the system clock combination, Each of these multiple '4 drivers is gated-corresponding to the pipeline stage. 9. The synchronous integrated circuit of item 7 of the patent scope is declared, in which a pause-like bear bit is provided to the first stage of the GCSR. A synchronous integrated circuit that states the seventh item of the patent scope, in which a pause status bit is provided: provided to a single-bit counter, the single-bit counter remains sigh (unless the pause status bit is judged, when the pause status bit When the yuan is judged, # ^ 一 ^ one digit 疋 counter counts The one-bit count output is provided to a first stage of the GCSR. ^ Synchronous integrated circuit of item 6 of the patent scope, wherein the bismuth of the GCSR: It round-out is associated with the plurality of clock drivers. Each of these devices is combined.-The pause status bit is provided to-the single-bit counter ^: The bit count is kept set unless the pause status bit is determined when the pause status bit is determined. At the time, the one-bit counter 86560! 232408 implements a first stage of the one-bit counter and one output of the one-bit counter is provided to the GCSR 1. For example, the scope of patent application No. 5-Chao Zhongdan ^ Road, where the pure The amount of processor is ^ ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-, -__ From the 5H pause state bit, the I- step-by-step adjusts a [cache memory capture bandwidth. 2 · As claimed in the scope of the patent claim item 丨丨 the moon bean circuit 'where the E, unit package is dry:-fixed point unit, which receives instructions from the red unit and a general register / rename buffer unit Communication; a loading storage unit, which receives coolness-eight presses the instructions received from the I-early and opens with the general-purpose register / rename buffer buffer to ^ 10 rush early and poor Data cache memory communication;-foreign point single it, which receives instructions from the unit and communicates with the-floating point register / rename buffer unit, the load memory unit enters the foreign point register / Rename the buffer unit communication, the load storage unit provides a suspended state for the floating point unit, the floating point unit provides the suspended state for the load storage unit; and a vector multimedia extension unit that receives the Instructions and communicate with a completion unit of the I-unit. 13. The synchronous integrated circuit according to item 11 of the scope of patent application, wherein the unit further includes: an I-cache memory; an instruction fetch unit / branch unit (IFU / BRU), which receives from the BU Fetch memory instructions; and a dispatch unit that receives the (IFU / BRU) instructions and provides the received instructions to the E_unit 86560 1232408. 14 15. 16. 17. 18. Among them, each of the GCSRs and the respective gates of a pair of synchronous integrated circuits such as the scope of the patent application No. 12 are respectively output with the system clock and the pipeline stages. Step integrated circuit, one of which pauses the first stage. If the homomorphic bit in item 14 of the scope of patent application is provided to the GCSR, the synchronizing circuit as in item 14 of the scope of patent application ^ Compensation circuit, where in each of the GCSRs, a pause state bit is provided m-* Rainbow ,,,, 6 is a bit counter in the early days. The bit counter in the early bits remains set. Unless the pause status bit is judged, and the pause status bit is judged as β 7, the one The bit counter implements four counts. The one-bit counter counts a first stage of the GCSR provided by each of the GCSR. For example, please apply for the synchronous integrated circuit of item 14 of the patent, in which the output after the GCSR is taken and the system of each of the plurality of clock drivers ^ ^ ^ ° A pause status bit is provided to a single bit Element Counting I: The unit counter is set unless the pause status bit is determined. When the pause status bit is judged, the unit-bit count is counted continuously. The early rotation of the beta-hailer-bit counter provides a first stage of the GCSR. A scalar processor includes: a common system clock; a kappa unit clocked by the system clock; and each of the unit and the E-unit clocked by the heart clock & communicating with the BU unit includes : One 86560 1232408 multiple register levels; and, the generator was only known the next day, it received the common system clock and temporary operating state, and responded to the pause state to adjust the multiple register ^, clock frequency The L unit provides a pause status for the E_single, and the & unit provides a pause status for the I-single. 19. 20. 21. 22. 8656. The scalar processor according to item 18 of the scope of patent application, wherein the local clock generator includes: M-gated control shift register (GCSR), which includes a plurality of units. A bit counter; and a plurality of local clock drivers, each receiving an output of one of the plurality of single-bit leaf counters, and combining the output with the system clock to generate a register-level clock. For example, the scalar processor in the scope of application for item 19, wherein the plurality of register stages are stages of a register pipeline, and the GCSR includes a single bit counter for each pipeline stage. For example, the scalar processor in the 20th scope of the patent application, further includes: a data cache memory, which communicates with the £ unit; a scratchpad file, which communicates with the unit and the E unit; And the delta I-unit includes: a cache memory, an I-queue that receives data from the i-cache, and an L pipeline that receives data from the I-queue; and the E-unit Including: an E-queue receiving data from the μ pipeline, and an E-pipe receiving data from the E-queue. For example, the scalar processor in the 21st scope of the patent application, wherein each of the ^ units! 232408 23. 24. 25. 26 · and: an output of the resident-unit and the GCSR and the system clock are combined in a side-equal complex number Each of the clock drivers—corresponding to the clock driver—each of the plurality of clock drivers is a gated scalar processor, such as the 22nd patent application, which is a first stage provided to the GCSR. If the scalar processor of item 22 of the patent application is applied, one of the pause status bits is provided to a -single-bit counter. The single-bit counter: remains set, unless the suspended status bit S is judged, when the The pause status bit, when judged, 'the single-bit counter performs counting, and an output of the single-bit 70 counter is provided to a first stage of the GCSR. Female claim patent | & scalar processor around the 21st item, wherein the G (: SR one t output is combined with the system clock of each of the plurality of clock drivers, a pause status bit A single-bit counter is provided. The single-bit TL counter remains set unless the pause status bit is judged. When the pause status bit is judged, the single-bit counter implements counting. An output is a first stage provided to the gcSR. An ultra-scalar processor includes a common system clock; a unit clocked by the system clock; the L unit includes: an I-cache memory ; An instruction fetch unit / branch unit (IFU / BRU), which receives instructions from the I-cache; and a dispatch unit, which receives instructions from the (Ι] Ρι; / ΒΙιυ) and forwards 86560 1232408 received these instructions for execution. The time-clocked by the system clock—P σσ a & Zao Fan; the E-unit includes: a fixed point unit, which receives σ — Mumu I-zao instruction Communicate with _ register / rename buffer unit; load storage early (LSU), which receives the instruction of the unit and identifies the general register / rename buffer unit and -data cache memory Communication; U-floating point unit (PU), which receives instructions from the b element and checks a floating point register / renaming buffer. 〇〇 The σσ communicates early, and the loading storage unit further temporarily stores the cry with the floating point. σ / rename buffer unit communication, the load storage unit provides a pause status for the floating point unit, and; the floating point unit provides a pause status for the load storage unit; and extended to the multimedia early, it receives the ^ The unit's instruction parallels one of the I-units to complete unit communication; and 〆, the local clock generator in each of the LSU and the FPU receives the common system clock and the suspended state, and responds to the suspended state. To adjust the clock frequency of the unit register. 27. The ultrascalar processor of claim 26, wherein the pause status indicates the activity level of a unit. 2 8. The ultra-scalar processor as claimed in item 27 of the patent scope, wherein the unit responds to the E-unit pause state to adjust the acquisition bandwidth. 29 · —A method for controlling the local frequency of a local clock of an integrated circuit chip component 'The method includes: 86560 1232408 monitoring the activity level of the chip component; and adjusting in response to one of the activity levels of a second component exceeding one indication to adjust-the first- The component's local clock frequency.丰的 4 30. If the patent application scope of the 29th item of the salamander when the brother's second component of the activity level rises to the size of the foot ruler, 隹, ^ 31. If the date of the application of the scope of the patent 29th item 1τ ^ ^ ^, Go, where the second component is the point. When it is above +, the clock frequency is second-class. 32. For example, if you apply for item 31 of the scope of patent application, ^ Chu / Chinese ¥ When the activity level rises above the critical level of one brother, the clock is paused. 33. If the number of patent applications ranges from ㈣ to the critical level, the rate will decrease. '^ The local clock returns to a normal operating frequency 86560