TW201322120A

TW201322120A - An accelerated code optimizer for a multiengine microprocessor

Info

Publication number: TW201322120A
Application number: TW100142887A
Authority: TW
Inventors: Mohammad Abdallah
Original assignee: Soft Machines Inc
Priority date: 2011-11-23
Filing date: 2011-11-23
Publication date: 2013-06-01
Also published as: TWI512613B

Abstract

A method for accelerating code optimization in a microprocessor. The method includes fetching an incoming macroinstruction sequence using an instruction fetch component and transferring the fetched macroinstructions to a decoding component for decoding into microinstructions. Optimization processing is performed by reordering the microinstruction sequence into an optimized microinstruction sequence comprising a plurality of dependent code groups. The optimized microinstruction sequence is output to a microprocessor pipeline for execution. A copy of the optimized microinstruction sequence is stored into a sequence cache for subsequent use upon a subsequent hit optimized microinstruction sequence.

Description

Accelerated Encoding Optimizer for Multi-Engine Microprocessors

[Related application reference]

本申請案係關於美國專利申請案案號2010/0161948，其專利名稱為「用以在支援多種脈絡交換模式及虛擬化方案的多線程架構中處理複雜指令格式之裝置及方法(APPARATUS AND METHOD FOR PROCESSING COMPLEX INSTRUCTION FORMATS IN A MULTITHREADED ARCHITECTURE SUPPORTING VARIOUS CONTEXT SWITCH MODES AND VIRTUALIZATION SCHEMES)」，由Mohammad A.Abdallah於2010年1月5日所申請，其整體內容係併入本文。 The present application is related to U.S. Patent Application Serial No. 2010/0161948, entitled "APPARATUS AND METHOD FOR FOR PROCESSING COMPLEX INSTRUCTIONS FOR MULTIPLE STRUCTURE STRUCTURES WITH SUPPORTING MULTIPLE NETWORK EXCHANGE AND VIRTUALIZATION PROGRAMS PROCESSING COMPLEX INSTRUCTION FORMATS IN A MULTITHREADED ARCHITECTURE SUPPORTING VARIOUS CONTEXT SWITCH MODES AND VIRTUALIZATION SCHEMES)", filed on January 5, 2010 by Mohammad A. Abdallah, the entire contents of which is incorporated herein.

本申請案係關於美國專利申請案案號2009/0113170，其專利名稱為「用以處理在相依性操作中指定並行的指令矩陣之裝置及方法(APPARATUS AND METHOD FOR PROCESSING AN INSTRUCTION MATRIX SPECIFYING PARALLEL IN DEPENDENT OPERATIONS)」，由Mohammad A.Abdallah於2008年12月9日所申請，其整體內容係併入本文。 This application is related to U.S. Patent Application Serial No. 2009/0113170, entitled "APPARATUS AND METHOD FOR PROCESSING AN INSTRUCTION MATRIX SPECIFYING PARALLEL IN DEPENDENT" OPERATIONS), filed on December 9, 2008 by Mohammad A. Abdallah, the entire contents of which is incorporated herein by reference.

本申請案係關於美國專利申請案案號61/384,198，其專利名稱為「包含針對早期遠分支預測之陰影快取之單一循環多分支預測(SINGLE CYCLE MULTI-BRANCH PREDICTION INCLUDING SHADOW CACHE FOR EARLY FAR BRANCH PREDICTION)」，由Mohammad A.Abdallah於2010年9月17日所申請，其整體內容係併入本文。 This application is related to U.S. Patent Application Serial No. 61/384,198, entitled "SINGLE CYCLE MULTI-BRANCH PREDICTION INCLUDING SHADOW CACHE FOR EARLY FAR BRANCH" PREDICTION), filed on September 17, 2010 by Mohammad A. Abdallah, the entire contents of which is incorporated herein.

本申請案係關於美國專利申請案案號61/467,944，其專利名稱為「藉由使用由可劃分引擎所例示之虛擬核心執行指令序列編碼區塊(EXECUTING INSTRUCTION SEQUENCE CODE BLOCKS BY USING VIRTUAL CORES INSTANTIATED BY PARTITIONABLE ENGINES)」，由Mohammad A.Abdallah於2011年3月25日所申請，其整體內容係併入本文。 This application is related to U.S. Patent Application Serial No. 61/467,944, the patent The name is "EXECUTING INSTRUCTION SEQUENCE CODE BLOCKS BY USING VIRTUAL CORES INSTANTIATED BY PARTITIONABLE ENGINES" by Mohammad A. Abdallah on March 25, 2011. The entire content of the application is incorporated herein.

本發明一般係關於數位電腦系統，特別是關於用以選擇包含一指令序列之指令的系統及方法。 The present invention relates generally to digital computer systems, and more particularly to systems and methods for selecting instructions that include a sequence of instructions.

處理器需要處理相關或完全無關的多樣任務。此類處理器的內部狀態通常由暫存器組成，其可保持在程式執行的每一特定瞬間之不同數值。在程式執行的每一瞬間，內部狀態影像(internal state image)係稱作處理器的架構狀態。 The processor needs to handle a variety of tasks that are related or completely unrelated. The internal state of such a processor typically consists of a scratchpad that can hold different values for each particular instant of execution of the program. At each instant of program execution, the internal state image is referred to as the architectural state of the processor.

當編碼執行係切換以執行另一功能(例如另一線程、程序或程式)，則需儲存機器/處理器的狀態，使得新功能可利用內部暫存器來建立其新狀態。一旦新功能終止，則可拋棄其狀態，且先前脈絡(context)的狀態將恢復並重新開始執行。此一切換程序係稱作脈絡切換(context switch)，且通常包含數十或數百循環，特別是具有使用大量暫存器(如64、128、256個)及/或無序執行的現代架構。 When the code execution is switched to perform another function (such as another thread, program, or program), the state of the machine/processor needs to be stored so that the new function can utilize the internal scratchpad to establish its new state. Once the new function is terminated, its state can be discarded and the status of the previous context will be resumed and resumed. This switching procedure is called a context switch and usually contains tens or hundreds of loops, especially with modern architectures that use a large number of registers (such as 64, 128, 256) and/or out-of-order execution. .

在線程感知硬體架構中，標準情況為硬體針對有限數量的硬體支援線程支援多個脈絡狀態。在此情況中，硬體針對每一支援線程複製所有架構狀態元件。這消除了當執行一新的線程時脈絡切換的需要。然而，這仍具有多個缺點，即針對硬體中所支援之每一額外線程而複製所有架構狀態元件(即暫存器)之面積、功率、及複雜度。此外，若軟體線程的數量超過所明確支援之硬體線程的數量，則仍必須執行脈絡切換。 In a thread-aware hardware architecture, the standard case is for hardware to support multiple contexts for a limited number of hardware support threads. In this case, the hardware is for each The support thread copies all architectural state components. This eliminates the need for context switching when executing a new thread. However, this still has several drawbacks, namely copying the area, power, and complexity of all architectural state elements (ie, scratchpads) for each additional thread supported in the hardware. In addition, if the number of software threads exceeds the number of hardware threads explicitly supported, then a thread switch must still be performed.

這已成為普遍的，如同需要大量線程之精細粒度基礎上所需之平行化。硬體線程感知架構與複製的脈絡狀態硬體儲存無助於非線程軟體編碼，且僅降低線程化軟體之脈絡切換的數量。然而，那些線程通常係針對粗粒平行化(coarse grain parallelism)而建構，且導致沉重的軟體花費以初始化及同步化，捨棄細粒平行化，例如功能呼叫及迴圈平行執行，而無有效的線程初始化/自動產生。這類所描述的花費係也會造成針對非明確/簡易平行化/線程軟體編碼使用最新式的編譯器或使用者平行化技術的這類編碼之自動平行化變為困難。 This has become common, as is the parallelism required on the fine-grained scale of a large number of threads. Hardware thread-aware architecture and replication context storage hardware does not help non-threaded software encoding, and only reduces the number of thread switching of threaded software. However, those threads are usually constructed for coarse grain parallelism and result in heavy software costs to initialize and synchronize, discarding fine grain parallelism, such as functional calls and loop parallel execution, without effective Thread initialization / automatic generation. The cost of this type of description also makes it difficult to auto-parallel such encoding for non-clear/simple parallel/threaded software encoding using the latest compiler or user parallelization techniques.

在一具體實施例中，本發明係實施為用以在一微處理器中加速編碼最佳化之方法。此方法包含：使用一指令抓取構件(instruction fetch component)抓取一收到巨集指令序列以及轉移所抓取之巨集指令至一解碼構件以解碼為微指令。最佳化程序係藉由重排序微指令序列為包含複數個相依性編碼群組之一最佳化微指令序列而執行。該複數個相依性編碼群組隨後被輸出至該微處理器之複數個引擎以並行地執行。最佳化微指令序列之一複本係儲存至一序列快取，供在對最佳化微指令序列之一後續點擊時的後續使用。 In one embodiment, the invention is embodied as a method for accelerating encoding optimization in a microprocessor. The method includes: using an instruction fetch component to grab a received macro instruction sequence and transferring the captured macro instruction to a decoding component for decoding into a microinstruction. The optimization process is performed by reordering the microinstruction sequence to optimize the microinstruction sequence for one of a plurality of dependent coding groups. The plurality of dependency coding groups are then output to a plurality of engines of the microprocessor for execution in parallel. A copy of the optimized microinstruction sequence is stored to a sequence of caches for subsequent use in subsequent clicks on one of the optimized microinstruction sequences.

前述為發明內容，因此必然包含簡化、概括且省略細節。因此，熟此技藝者將了解，發明內容僅為說明性而無意圖做任何方式的限制。本發明的其他態樣、發明特徵及優點係僅由申請專利範圍所定義，且在以下所提出之非限制性詳細說明中將變得明顯。 The foregoing is a summary of the invention, and thus is inclusive of Therefore, it is to be understood that the invention may be Other aspects, features, and advantages of the invention will be set forth in the description of the appended claims.

雖然本發明已相關於具體實施例而描述，本發明並不意欲限制於在本文中所提出之特定形式。相反地，本發明係意圖涵蓋可合理地包括於由後附申請專利範圍所定義之本發明範疇內之替代、修改及等效物。 Although the present invention has been described in connection with the specific embodiments, the present invention is not intended to be limited to the specific forms set forth herein. Rather, the invention is intended to cover alternatives, modifications, and equivalents of the inventions that are included in the scope of the invention as defined by the appended claims.

在以下的詳細描述中，提出了許多具體的細節，像是特定的方法順序、結構、元件及連接。然而，應了解到，可不需利用這些及其他具體細節來實行本發明的具體實施例。在其他情況下，省略或沒有特別地詳細描述習知結構、元件、或連接，以避免不必要地模糊了說明內容。 In the following detailed description, numerous specific details are set forth, such as the particular method of the embodiments, It should be understood, however, that the specific embodiments of the invention may be practiced In other instances, well-known structures, components, or connections are omitted or not described in detail to avoid unnecessarily obscuring the description.

說明書中所提到的「一個具體實施例(one embodiment)」或「一具體實施例(an embodiment)」係意指，關聯於一具體實施例而描述的特定特徵、結構或特性係包括於本發明之至少一個具體實施例中的一具體實施例。在本說明書中許多地方所出現的「在一個具體實施例中」一詞並不需全指相同的具體實施例，也不需是與其他具體實施例互相排斥的單獨或另外的具體實施例。此外，所描述的許多特徵可能會在某些具體實施例中呈現而沒有在其他具體實施例中呈現。類似地，所描述的許多需求可能為某些具體實施例的需求，但不為其他具體實施例的需求。 The phrase "one embodiment" or "an embodiment" is used in the specification to mean that a particular feature, structure, or characteristic described in connection with a particular embodiment is included in the present invention. A specific embodiment of at least one embodiment of the invention. The word "in a particular embodiment" is used in many places in the specification and is not intended to refer to the specific embodiments. Moreover, many of the features described may be presented in certain embodiments. It is not presented in other specific embodiments. Similarly, many of the needs described may be desirable for certain embodiments, but are not required by other specific embodiments.

詳細說明的某些部分將於下文中以電腦記憶體內之資料位元上操作的程序、步驟、邏輯塊、處理、及其他符號表示。這些描述及表示為熟習資料處理技藝者所使用的手段，以最有效地將其工作的實質內容傳達給其他熟此技藝者。程序、電腦執行步驟、邏輯塊、處理等等於此處且一般而言係設想為導致所需結果之自相一致順序的步驟或指令。這些步驟為需要實體量的實體操作。通常情況下，但非必須，這些量的形式為電腦可讀儲存媒體的電性或磁性訊號，且能夠被儲存、轉移、結合、比較、或操作於電腦系統中。已證實有時(主要為了平常使用)將此等訊號稱作位元、值、元件、符號、字元、術語、數字或其類似者係便利的。 Portions of the detailed description are set forth below in terms of procedures, steps, logic blocks, processing, and other symbols that operate on the data bits in the computer memory. These descriptions and representations are the means used by those skilled in the art to best convey the substance of their work to those skilled in the art. Programs, computer-executed steps, logic blocks, processes, and the like are here and in general are contemplated as steps or instructions that result in a consistent sequence of desired results. These steps are operations for entities that require a physical quantity. Typically, but not necessarily, these quantities are in the form of electrical or magnetic signals of a computer readable storage medium and can be stored, transferred, combined, compared, or manipulated in a computer system. It has proven convenient at times (primarily for normal use) to refer to such signals as bits, values, elements, symbols, characters, terms, numbers or the like.

然而，應注意，所有這些或類似用語與適當實體量有關，且僅為應用到這些量的便利符號。於下討論中除非有特別指明，不然應知本發明中使用例如「處理(processing)」、「存取(accessing)」、「寫入(writing)」、「儲存(storing)」、「複製(replicating)」、或類似者等用語之討論，係指電腦系統或類似電子計算裝置的動作及程序，其將電腦系統暫存器及記憶體及其他電腦可讀媒體內表示為實體(電子)量的資料操控且轉換成在電腦系統記憶體或暫存器或其他此類資訊儲存、傳輸、或顯示裝置內類似地表示為實體量的其他資料。 However, it should be noted that all of these or similar terms are related to the appropriate amount of the entity and are merely convenient symbols applied to these quantities. In the following discussion, unless otherwise specified, it is to be understood that the present invention uses, for example, "processing", "accessing", "writing", "storing", "copying". The discussion of terms such as "replicating" or similar means the operation and procedure of a computer system or similar electronic computing device, which represents a physical (electronic) amount in a computer system register and memory and other computer readable media. The data is manipulated and converted into other material similarly represented as an entity in a computer system memory or scratchpad or other such information storage, transmission, or display device.

在一具體實施例中，本發明係實施為用以在微處理器中加速編碼最佳化的方法。方法包含使用一指令抓取構件而抓取一收到巨集指令序列，以及轉移所抓取之巨集指令至一解碼構件以解碼為微指令。最佳化程序係藉由重排序微指令序列為包含複數個相依編碼群組之一最佳微指令序列而實施。最佳化微指令序列係輸出至一微處理器管線供執行。最佳化微指令序列的複本係儲存於序列快取中供在對最佳化微指令序列之後續點擊時的後續使用。 In a specific embodiment, the invention is implemented to be added to a microprocessor The method of speed coding optimization. The method includes fetching a received macro instruction sequence using an instruction fetching component, and transferring the fetched macro instruction to a decoding component for decoding into a microinstruction. The optimization procedure is implemented by reordering the microinstruction sequence into a sequence of optimal microinstructions comprising one of a plurality of dependent coding groups. The optimized microinstruction sequence is output to a microprocessor pipeline for execution. A copy of the optimized microinstruction sequence is stored in the sequence cache for subsequent use in subsequent clicks on the optimized microinstruction sequence.

圖1顯示根據本發明一具體實施例之微處理器100的分派/發布階段的概要圖式。如圖1所示，微處理器100包含一抓取構件101、原生解碼構件102、以及微處理器的指令排程及最佳化器構件110及剩餘管線105。 1 shows a schematic diagram of the dispatch/release phase of microprocessor 100 in accordance with an embodiment of the present invention. As shown in FIG. 1, the microprocessor 100 includes a capture component 101, a native decoding component 102, and an instruction scheduling and optimizer component 110 of the microprocessor and a remaining pipeline 105.

在圖1的具體實施例中，巨集指令由抓取構件101所抓取且由原生解碼構件102解碼為原生微指令，其接著提供微指令至微指令快取121及指令排程及最佳化器構件110。在一具體實施例中，所抓取的巨集指令包含一指令序列，其藉由預測特定分支而匯編。 In the particular embodiment of FIG. 1, the macro instruction is captured by the capture component 101 and decoded by the native decoding component 102 into a native microinstruction, which in turn provides microinstruction to microinstruction cache 121 and instruction scheduling and optimal. The chemist member 110. In a specific embodiment, the fetched macro instruction includes a sequence of instructions that are assembled by predicting a particular branch.

巨集指令序列係由原生解碼構件102解碼為一結果微指令序列。微指令序列接著透過多工器103傳輸至指令排程及最佳化器構件110。指令排程及最佳化器構件係藉由執行最佳化程序而作用，其係例如藉由重排序微指令序列的特定指令以供更有效率的執行。這產生了一最佳化的微指令序列，其接著經由多工器104而轉移至剩餘管線105(如分派、配送、執行及引退階段等)。最佳化微指令序列造成更快且更有效率的指令執行。 The macro instruction sequence is decoded by the native decoding component 102 into a resulting microinstruction sequence. The microinstruction sequence is then transmitted through multiplexer 103 to instruction schedule and optimizer component 110. The instruction scheduling and optimizer components function by performing an optimization procedure, such as by reordering specific instructions of the microinstruction sequence for more efficient execution. This produces an optimized sequence of microinstructions that are then transferred via multiplexer 104 to the remaining pipelines 105 (e.g., dispatch, distribution, execution, and retirement phases, etc.). Optimizing the microinstruction sequence results in faster and more efficient instruction execution.

在一具體實施例中，巨集指令可為來自高階指令組架構的指令，而微指令為低階機器指令。在另一具體實施例中，巨集指令可為來自複數個不同指令組架構的客指令(如似CISC、x86、似RISC、MIPS、SPARC、ARM、似虛擬(virtual like)、JAVA、及類似者)，而微指令為低階機器指令或不同原生指令組架構的指令。類似地，在一具體實施例中，巨集指令可為一架構的原生指令，且微指令可為已重排序且最佳化之該相同架構的原生微指令。舉例來說，X86巨集指令以及X86微編碼微指令。 In one embodiment, the macro instruction may be an instruction from a higher order instruction set architecture and the micro instruction is a lower order machine instruction. In another embodiment, the macro instruction can be a guest instruction from a plurality of different instruction set architectures (eg, CISC, x86, RISC, MIPS, SPARC, ARM, virtual like, JAVA, and the like) And microinstructions are low-order machine instructions or instructions of different native instruction set architectures. Similarly, in one embodiment, the macro instruction can be a native instruction of an architecture, and the microinstruction can be a native microinstruction of the same architecture that has been reordered and optimized. For example, X86 macro instructions and X86 microcoded micro instructions.

在一具體實施例中，為加速經常遇到之編碼(如熱編碼(hot codes))的執行效率，經常遇到的微指令序列之複本係快取於微指令快取121中，且經常遇到之最佳化微指令序列係快取於序列快取122中。當編碼被抓取、解碼、最佳化、及執行，可根據序列快取的尺寸而經由所繪示之逐出(eviction)及填充路徑130而逐出或抓取特定的最佳化微指令序列。此逐出及填充路徑允許最佳化微指令序列在微處理器的記憶體階層(如L1快取、L2快取、特定可快取記憶體範圍、或類似者)來回轉移。 In a specific embodiment, in order to speed up the execution efficiency of frequently encountered codes (such as hot codes), a frequently encountered copy of the microinstruction sequence is cached in the microinstruction cache 121, and often encounters The optimized microinstruction sequence is cached in the sequence cache 122. When the code is captured, decoded, optimized, and executed, the particular optimized microinstruction may be evicted or fetched via the depicted eviction and fill path 130 based on the size of the sequence cache. sequence. This eviction and fill path allows the optimized microinstruction sequence to be transferred back and forth across the memory level of the microprocessor (eg, L1 cache, L2 cache, specific cacheable memory range, or the like).

應注意，在一具體實施例中，可省略微指令快取121。在這樣的具體實施例中，熱編碼的加速係藉由最佳化微指令序列在序列快取122內的儲存而提供。舉例來說，因省略微指令快取121所省下的空間可用以例如實現較大的序列快取122。 It should be noted that in a particular embodiment, the microinstruction cache 121 may be omitted. In such a specific embodiment, the thermally encoded acceleration is provided by optimizing the storage of the microinstruction sequences within the sequence cache 122. For example, the space saved by omitting the microinstruction cache 121 can be used, for example, to implement a larger sequence cache 122.

圖2顯示描述根據本發明具體實施例之最佳化程序的概要圖式。圖2的左邊顯示一收到微指令序列，從例如原生解碼構件102或微指令快取121所接收。一開始接收這些指令時，它們並未最佳化。 2 shows a schematic diagram depicting an optimization procedure in accordance with an embodiment of the present invention. The left side of Figure 2 shows a sequence of received microinstructions, such as native decoding. The piece 102 or the microinstruction cache 121 is received. They are not optimized when they are initially received.

最佳化程序的一目的為定位及識別彼此相依的指令，且將其移動至其個別的相依性群組，使其可更有效率地執行。在一具體實施例中，相依指令的群組可一起配送，使其可更有效率地執行，因為其個別的來源及目的地係針對地域性(locality)而群組在一起。應注意，最佳化程序可用於無序處理器以及按順序的處理器兩者。舉例來說，在按順序的處理器內，指令係依序配送。然而，指令可四處移動，使得相依指令係設置於個別群組，使群組可接著獨立地執行，如上所述。 One purpose of the optimization process is to locate and identify interdependent instructions and move them to their individual dependency groups so that they can be executed more efficiently. In a specific embodiment, groups of dependent instructions can be distributed together so that they can be executed more efficiently because their individual sources and destinations are grouped together for locality. It should be noted that the optimization procedure can be used for both out-of-order processors and sequential processors. For example, in a sequential processor, the instructions are delivered sequentially. However, the instructions can be moved around so that the dependent instructions are placed in individual groups so that the group can then be executed independently, as described above.

舉例來說，收到指令包含載入、操作及儲存。舉例來說，指令1包含加入來源暫存器(如暫存器9及暫存器5)之一操作以及儲存於暫存器5中的結果。因此，暫存器5為目的地，而暫存器9及暫存器5為來源。在此方式中，16個指令之序列包含目的地暫存器以及來源暫存器，如圖所示。 For example, received instructions include loading, manipulation, and storage. For example, the instruction 1 includes the operation of one of the source registers (such as the scratchpad 9 and the scratchpad 5) and the result stored in the scratchpad 5. Therefore, the register 5 is the destination, and the register 9 and the register 5 are the source. In this mode, the sequence of 16 instructions contains the destination register and the source register, as shown.

圖2的具體實施例執行指令的重排序，以產生相依性群組，其中屬於一群組的指令係彼此相依。為完成此，執行一演算法，其執行關於16個收到指令之載入及儲存的危害檢查。舉例來說，在無相依性檢查下，儲存無法越過較早的載入。儲存無法通過較早的儲存。在無相依性檢查下，載入無法通過較早的儲存。載入可通過載入。指令可藉由使用一更名技術而通過先前的路徑預測分支(例如動態建構的分支)。在非動態預側之分支的情況下，指令的移動需要考慮分支的範圍。每一個上述規則可藉由加入虛擬的相依性而實施(例如藉由人為地加入虛擬來源或目的地至指令以執行規則)。 The embodiment of Figure 2 performs reordering of instructions to generate a dependency group, wherein the commands belonging to a group are dependent on each other. To accomplish this, an algorithm is executed that performs a hazard check on the loading and storage of the 16 received instructions. For example, under the uncorrelated check, the store cannot cross the earlier load. Storage cannot pass earlier storage. Under the non-dependency check, the load cannot pass the earlier storage. Loading can be done by loading. The instructions can predict branches (eg, dynamically constructed branches) through previous paths by using a rename technique. In the case of a non-dynamic pre-branch branch, the movement of the instructions needs to take into account the extent of the branch. Each of the above rules can Implemented by adding virtual dependencies (eg, by artificially joining a virtual source or destination to an instruction to execute a rule).

仍參考圖2，如上所述，最佳化程序的一目的係定位相依指令並將其移入一共同的相依性群組。此程序必須根據危害檢查演算法而完成。最佳化演算法係尋找指令相依性。指令相依性更包含真實相依性、輸出相依性、及反相依性。 Still referring to FIG. 2, as described above, one purpose of the optimization procedure is to locate dependent instructions and move them into a common dependency group. This procedure must be done according to the hazard check algorithm. The optimization algorithm is looking for instruction dependencies. Instruction dependencies include real dependencies, output dependencies, and inverse dependencies.

演算法首先由尋找真實相依性開始。為識別真實相依性，16個指令序列的每一目的地係與在16個指令序列中較晚發生的其他序列來源比較。真實相依於一較早指令的後續指令係標記為「_1」以表示其真實相依性。這係以指令數字顯示於圖2中，其從左到右遍及16個指令序列而進行。舉例來說，考慮指令數字4，目的地暫存器R3係相較於後續指令的來源，且每一後續來源係標記為「_1」以指示該指令的真實相依性。在此情況中，指令6、指令7、指令11及指令15係標記為「_1」。 The algorithm begins by looking for true dependencies. To identify true dependencies, each destination of the 16 instruction sequences is compared to other sequence sources that occurred later in the 16 instruction sequences. Subsequent instructions that are true to an earlier instruction are marked as "_1" to indicate their true dependencies. This is shown in Figure 2 as an instruction number, which is performed from left to right across 16 sequences of instructions. For example, considering instruction number 4, destination register R3 is compared to the source of subsequent instructions, and each subsequent source is labeled "_1" to indicate the true dependencies of the instruction. In this case, the instruction 6, the instruction 7, the instruction 11, and the instruction 15 are marked as "_1".

演算法接著尋找輸出相依性。為識別輸出相依性，每一目的地係與其他後續指令的目的地比較。而且，針對16個指令的每一個，匹配的每一後續目的地係標記為「1_」(例如，有時表示為紅色)。 The algorithm then looks for output dependencies. To identify output dependencies, each destination is compared to the destination of other subsequent instructions. Moreover, for each of the 16 instructions, each subsequent destination of the match is marked as "1_" (eg, sometimes represented as red).

演算法接著尋找反相依性(anti-dependency)。為識別反相依性，針對16個指令的每一個，每一來源係與較早指令的來源比較以識別匹配。若發生匹配，則考慮中的指令將其本身標記為「1_」(例如，有時表示為紅色)。 The algorithm then looks for anti-dependency. To identify the inverse dependency, for each of the 16 instructions, each source is compared to the source of the earlier instruction to identify a match. If a match occurs, the instruction under consideration marks itself as "1_" (for example, sometimes in red).

在此方式中，演算法針對16個指令的序列集結具有列及行的一相依性矩陣。相依性矩陣包含標記，其係針對16個指令之每一個指示不同的相依性類型。在一具體實施例中，相依性矩陣係使用CAM匹配硬體及適當的廣播邏輯而於一循環中集結。舉例來說，目的地係經由剩餘指令向下廣播，以相較於後續指令的來源(如真實相依性)及後續指令的目的地(如輸出相依性)，而目的地可經由先前指令向上廣播，以相較於先前指令的來源(如反相依性)。 In this manner, the algorithm has a dependency matrix of columns and rows for a sequence of 16 instructions. The dependency matrix contains tags that indicate different dependency types for each of the 16 instructions. In one embodiment, the dependency matrix is assembled in a loop using CAM matching hardware and appropriate broadcast logic. For example, the destination is broadcast down via the remaining instructions, compared to the source of subsequent instructions (eg, true dependencies) and the destination of subsequent instructions (eg, output dependencies), while the destination can be broadcast up via previous instructions To the source of the previous instruction (such as reverse phase dependency).

最佳化演算法使用相依性矩陣以選擇哪些指令將一起移動至共同相依性群組。較佳係將彼此真實相依的指令移動至相同群組。暫存器更名係用以消除反相依性，以允許那些反相依指令被移動。移動係根據上述規則及危害檢查而完成。舉例來說，在無相依性檢查的情況下，儲存無法通過較早的載入。儲存無法通過較早的儲存。在無相依性檢查下，載入無法通過較早的儲存。載入可通過載入。指令可藉由使用更名技術而通過先前的路徑預測分支(例如動態建構的分支)。在非動態預側之分支的情況下，指令的移動需要考慮分支的範圍。 The optimization algorithm uses a dependency matrix to select which instructions will move together to the common dependency group. Preferably, the instructions that are truly dependent on each other are moved to the same group. The register rename is used to eliminate the inverse dependency to allow those inversions to be moved according to the instruction. The movement is done according to the above rules and hazard checks. For example, in the case of a non-dependency check, the store cannot pass the earlier load. Storage cannot pass earlier storage. Under the non-dependency check, the load cannot pass the earlier storage. Loading can be done by loading. The instructions can predict branches (eg, dynamically constructed branches) through previous paths by using a rename technique. In the case of a non-dynamic pre-branch branch, the movement of the instructions needs to take into account the extent of the branch.

在一具體實施例中，可實現優先性編碼器，以決定哪些指令被移動而與其他指令成組。優先性編碼器將根據相依性矩陣所提供之資訊而作用。 In a specific embodiment, a priority encoder can be implemented to determine which instructions are moved to be grouped with other instructions. The priority encoder will act according to the information provided by the dependency matrix.

圖3及圖4顯示根據本發明一具體實施例之一多步驟最佳化程序。在一具體實施例，最佳化程序為疊代(iteration)的，其中在指令藉由移動其相依性行而移入第一通道後，相依性矩陣係再次集結並再次審查新的機會以移動指令。在一具體實施例，此相依性矩陣集結程序係重複三次。這顯示於圖4中，其顯示已被移動且接著再次審查的指令，尋找機會以移動其他指令。在16個指令之每一個之右側上的數字序列顯示指令在開始程序所在之群組以及指令在程序結束時所在的群組，其具有中介群組於其間。舉例來說，圖4顯示指令6如何開始於群組4但移動至群組1中。 3 and 4 illustrate a multi-step optimization procedure in accordance with an embodiment of the present invention. In a specific embodiment, the optimization procedure is iteration, After the instruction moves into the first channel by moving its dependency row, the dependency matrix is again assembled and the new opportunity is reviewed again to move the instruction. In a specific embodiment, the dependency matrix aggregation procedure is repeated three times. This is shown in Figure 4, which shows the instructions that have been moved and then reviewed again, looking for opportunities to move other instructions. The sequence of numbers on the right side of each of the 16 instructions shows the group in which the instruction is located and the group in which the instruction is at the end of the program, with a mediation group in between. For example, Figure 4 shows how instruction 6 begins in group 4 but moves into group 1.

在此方式中，圖2至圖4描述根據本發明一具體實施例之一最佳化演算法之操作。應注意，雖然圖2至圖4描述一分派/發布階段，此功能也可實施於一本地排程器/配送階段中。 In this manner, Figures 2 through 4 depict the operation of an optimization algorithm in accordance with one embodiment of the present invention. It should be noted that although Figures 2 through 4 depict a dispatch/release phase, this functionality can also be implemented in a local scheduler/distribution phase.

圖5顯示根據本發明一具體實施例之一範例硬體最佳化程序500的步驟流程圖。如圖5所示，根據本發明一具體實施例，流程圖顯示最佳化程序之操作步驟為實施於一微處理器之一分派/發布階段中。 FIG. 5 shows a flow chart of the steps of an exemplary hardware optimization process 500 in accordance with an embodiment of the present invention. As shown in FIG. 5, in accordance with an embodiment of the present invention, the flowchart shows that the operational steps of the optimization procedure are implemented in a dispatch/release phase of a microprocessor.

程序500開始於步驟501，其中一收到巨集指令序列係使用一指令抓取構件(如圖1的抓取構件101)而抓取。如上述，所抓取的指令包含藉由預測特定指令分支而匯編之一序列。 The process 500 begins in step 501, in which a received macro instruction sequence is captured using an instruction fetching component (such as the grabbing component 101 of FIG. 1). As mentioned above, the fetched instructions include compiling a sequence by predicting a particular instruction branch.

在步驟502中，所抓取的巨集指令係轉移至一解碼構件以解碼為微指令。巨集指令序列係根據分支預測而解碼為微指令序列。在一具體實施例中，微指令序列接著儲存至一微指令快取中。 In step 502, the fetched macro instruction is transferred to a decoding component for decoding into a microinstruction. The macro instruction sequence is decoded into a microinstruction sequence based on the branch prediction. In one embodiment, the microinstruction sequence is then stored into a microinstruction cache.

在步驟503中，接著藉由重排序包含序列之微指令為相依性群組而對微指令序列進行最佳化程序。重排序係藉由一指令重排序構件(如指令排程及最佳化器構件110)而執行。此程序係描述於圖2至圖4中。 In step 503, the microinstruction sequence is then optimized by reordering the microinstructions containing the sequence into a dependency group. Reordering is performed by an instruction reordering component, such as instruction scheduling and optimizer component 110. This procedure is described in Figures 2 to 4.

在步驟504中，最佳化微指令序列為一輸出至微處理管線供執行。如上述，最佳化微指令序列係轉呈至機器的剩餘部分供執行(如剩餘管線105)。 In step 504, the optimized microinstruction sequence is an output to the microprocessor pipeline for execution. As described above, the optimized microinstruction sequence is forwarded to the remainder of the machine for execution (e.g., remaining pipeline 105).

隨後，在步驟505中，最佳化微指令序列的一複本係儲存至一序列快取中供在對該序列之後續點擊時的後續使用。在此方式中，序列快取在對這些序列的後續點擊時致能對最佳化微指令序列的存取，藉此加速熱編碼。 Subsequently, in step 505, a copy of the optimized microinstruction sequence is stored in a sequence of caches for subsequent use in subsequent clicks on the sequence. In this manner, the sequence cache enables access to the optimized microinstruction sequence upon subsequent clicks on the sequences, thereby accelerating the thermal encoding.

圖6顯示根據本發明一具體實施例之另一範例硬體最佳化程序600的步驟流程圖。如圖6所示，根據本發明另一具體實施例，流程圖顯示最佳化程序之操作步驟為實施於一微處理器之一分派/發布階段中。 FIG. 6 shows a flow chart of the steps of another example hardware optimization process 600 in accordance with an embodiment of the present invention. As shown in FIG. 6, in accordance with another embodiment of the present invention, the flowchart shows that the operational steps of the optimization procedure are implemented in a dispatch/release phase of a microprocessor.

程序600開始於步驟601，其中一收到巨集指令序列係使用一指令抓取構件(如圖1的抓取構件101)而抓取。如上述，所抓取的指令包含藉由預測特定指令分支而匯編之一序列。 The process 600 begins in step 601, in which a received macro instruction sequence is captured using an instruction fetching component (such as the grabbing component 101 of FIG. 1). As mentioned above, the fetched instructions include compiling a sequence by predicting a particular instruction branch.

在步驟602中，所抓取的巨集指令係轉移至一解碼構件以解碼為微指令。巨集指令序列係根據分支預測而解碼為微指令序列。在一具體實施例中，微指令序列接著儲存至一微指令快取中。 In step 602, the fetched macro instruction is transferred to a decoding component for decoding into a microinstruction. The macro instruction sequence is decoded into microinstructions according to branch prediction sequence. In one embodiment, the microinstruction sequence is then stored into a microinstruction cache.

在步驟603中，所解碼的微指令係儲存於在微指令序列快取中之序列中。在微指令快取中之序列係根據基本區塊邊界而形成以開始。這些序列在此時並未最佳化。 In step 603, the decoded microinstructions are stored in a sequence in the microinstruction sequence cache. The sequence in the microinstruction cache is formed based on the basic block boundaries to begin. These sequences are not optimized at this time.

在步驟604中，接著藉由重排序包含序列之微指令為相依性群組而對微指令序列進行最佳化程序。重排序係藉由一指令重排序構件(如指令排程及最佳化器構件110)而執行。此程序係描述於圖2至圖4中。 In step 604, the microinstruction sequence is then optimized by reordering the microinstructions containing the sequence into a dependency group. Reordering is performed by an instruction reordering component, such as instruction scheduling and optimizer component 110. This procedure is described in Figures 2 to 4.

在步驟605中，最佳化微指令序列為一輸出至微處理管線供執行。如上述，最佳化微指令序列係轉呈至機器的剩餘部分供執行(如剩餘管線105)。 In step 605, the optimized microinstruction sequence is an output to the microprocessor pipeline for execution. As described above, the optimized microinstruction sequence is forwarded to the remainder of the machine for execution (e.g., remaining pipeline 105).

隨後，在步驟606中，最佳化微指令序列的一複本係儲存至一序列快取中供在對該序列之後續點擊時的後續使用。在此方式中，序列快取在對這些序列的後續點擊時致能對最佳化微指令序列的存取，藉此加速熱編碼。 Subsequently, in step 606, a copy of the optimized microinstruction sequence is stored in a sequence of caches for subsequent use in subsequent clicks on the sequence. In this manner, the sequence cache enables access to the optimized microinstruction sequence upon subsequent clicks on the sequences, thereby accelerating the thermal encoding.

圖7顯示一圖表，其顯示根據本發明一具體實施例之分派/發布階段之CAM匹配硬體及優先性編碼硬體的操作。如圖7所繪示，指令的目的地係從左廣播收到CAM陣列。顯示三個範例指令目的地。較輕陰影的CAM(例如綠色)係針對真實相依性匹配以及輸出相依性匹配，因此目的地係向下廣播。較暗陰影 CAM(例如藍色)為反相依性匹配，因此目的地係向上廣播。這些匹配集結一相依性矩陣，如上述。優先性編碼器係顯示於右，且其藉由掃描CAM的列而作用以找出第一匹配，「_1」或「1_」。如前文中對圖2至圖4的討論，程序可為疊代的執行。舉例來說，若「_1」由「1_」所阻擋，則可重新命名且移動該目的地。 Figure 7 shows a diagram showing the operation of the CAM matching hardware and the priority encoding hardware of the dispatch/release phase in accordance with an embodiment of the present invention. As illustrated in Figure 7, the destination of the command is to receive the CAM array from the left broadcast. Three sample instruction destinations are displayed. Lighter shaded CAMs (eg, green) are matched for true dependencies and output dependencies, so the destination is broadcast down. Darker shadow The CAM (eg blue) is inverted for affinity matching, so the destination is broadcast up. These matches aggregate a dependency matrix as described above. The priority encoder is shown on the right and it acts by scanning the columns of the CAM to find the first match, "_1" or "1_". As discussed above with respect to Figures 2 through 4, the program can be an iterative implementation. For example, if "_1" is blocked by "1_", the destination can be renamed and moved.

圖8顯示描述根據本發明一具體實施例之在一分支前之最佳化排程指令的一圖表。如圖8所示，一硬體最佳化範例係與一傳統及時編譯器範例並排地繪示。圖8的左邊顯示原始未最佳化編碼，其包括未取分支偏誤「分支C至L1」。圖8的中間一行顯示傳統及時編譯器最佳化，其中暫存器係重新命名且指令係移動至分支之前。在此範例中，及時編譯器插入補償編碼以說明分支偏誤決定為錯之那些情況(例如實際上取得分支，與未取相反)。相反地，圖8之右邊的一行顯示硬體展開最佳化。在此情況中，暫存器係重新命名且指令係移動至分支之前。然而，應注意，沒有插入補償編碼。硬體持續追蹤分支偏誤決定是否為真。在失誤預測分支的情況中，硬體自動地退回其狀態以執行正確的指令序列。硬體最佳化器解決方案能夠避免使用補償編碼，因為在分支為失誤預測的那些情況中，硬體跳至記憶體中的原始碼並從那裡執行正確的序列，並排清(flushing)失誤預測的指令序列。 Figure 8 shows a diagram depicting an optimized scheduling instruction prior to a branch in accordance with an embodiment of the present invention. As shown in Figure 8, a hardware optimization example is shown side by side with a conventional timely compiler example. The left side of Figure 8 shows the original unoptimized code, which includes the branching error "Branch C to L1". The middle row of Figure 8 shows a traditional timely compiler optimization where the scratchpad is renamed and the instruction is moved before the branch. In this example, the timely compiler inserts a compensation code to account for those cases where the branch bias is determined to be wrong (eg, actually taking a branch, as opposed to not taking it). Conversely, the line to the right of Figure 8 shows the hardware expansion optimization. In this case, the scratchpad is renamed and the command is moved before the branch. However, it should be noted that no compensation code is inserted. The hardware continues to track whether the branch bias is true. In the case of a fault prediction branch, the hardware automatically returns to its state to execute the correct sequence of instructions. The hardware optimizer solution avoids the use of compensation coding because in those cases where the branch is a prediction of the error, the hardware jumps to the source code in the memory and executes the correct sequence from there, and flushes the error prediction. Sequence of instructions.

圖9顯示一圖表，其描述根據本發明一具體實施例之在一儲存之前最佳化排程一載入。如圖9所示，硬體最佳化範例係沿一傳統及時編譯器範例而繪示。圖9的左邊顯示包含儲存之一原始未最佳化編碼，「R3<-LD[R5]」。圖9中間之一行顯示傳統及時編譯器最佳化，其中暫存器係重新命名且載入係移動至儲存之前。在此範例中，及時編譯器插入補償編碼以說明載入指令之位址係化名為儲存指令之位址的那些情況(例如，載入移動至儲存之前為不適當的情況)。相反地，圖9右邊的一行顯示硬體展開最佳化。在此情況中，暫存器係重新命名且載入也移動至儲存之前。然而，應注意到，沒有插入補償編碼。在移動載入至儲存之前為錯誤的情況中，硬體係自動地退回其狀態，以執行正確的指令序列。硬體最佳化器解決方案能夠避免使用補償編碼，因為在位址別名-檢查分支為失誤預測的情況中，硬體跳至記憶體中的原始編碼，且由該處執行正確序列，同時排清失誤預測的指令序列。在此情況中，序列係假設無別名。應注意，在一具體實施例中，圖9所示的功能可由圖1的指令排程及最佳化器構件110所執行。類似地，應注意到，在一具體實施例中，圖9所示的功能可由以下圖10所描述之軟體最佳化器1000所執行。 Figure 9 shows a diagram depicting an optimized schedule-loading prior to storage in accordance with an embodiment of the present invention. As shown in Figure 9, the hardware optimization paradigm is depicted along a conventional timely compiler paradigm. The left side of Figure 9 shows the original unoptimized code containing "R3<-LD[R5]". One line in the middle of Figure 9 shows Traditional and timely compiler optimization, where the scratchpad is renamed and the loader is moved to storage. In this example, the timely compiler inserts a compensation code to indicate that the address of the load instruction is the one that names the address of the store instruction (eg, the load is not appropriate before moving to storage). Conversely, the line to the right of Figure 9 shows the hardware expansion optimization. In this case, the scratchpad is renamed and the load is also moved to before storage. However, it should be noted that no compensation code is inserted. In the case of an error before the move is loaded into storage, the hard system automatically returns to its state to execute the correct sequence of instructions. The hardware optimizer solution can avoid the use of compensation coding, because in the case of the address alias-check branch is error prediction, the hardware jumps to the original code in the memory, and the correct sequence is executed there, simultaneously Clear the sequence of instructions for error prediction. In this case, the sequence is assumed to have no aliases. It should be noted that in one particular embodiment, the functionality illustrated in FIG. 9 may be performed by the instruction schedule and optimizer component 110 of FIG. Similarly, it should be noted that in a particular embodiment, the functionality illustrated in Figure 9 can be performed by software optimizer 1000 as described below in Figure 10.

此外，關於動態展開(unroll)序列，應注意指令可藉由使用重新命名而通過先前路徑預測分支(如動態建構的分支)。在非動態預測分支的情況中，指令的移動應考慮分支的範圍。迴圈可展開至想要的範圍，且最佳化可應用在整個序列。舉例來說，這可藉由重新命名穿過分支而移動之指令的目的地暫存器而實施。此特徵的其中一好處為不需要補償編碼或是對分支範圍的廣泛分析。此特徵因此大大地加速且簡化了最佳化程序。 Furthermore, with regard to dynamic unrolling sequences, it should be noted that instructions can predict branches (such as dynamically constructed branches) through previous paths by using renames. In the case of a non-dynamic predictive branch, the movement of the instruction should take into account the extent of the branch. The loop can be expanded to the desired range and optimization can be applied throughout the sequence. This can be done, for example, by renaming the destination register of the instruction that moves through the branch. One of the benefits of this feature is that it does not require compensation coding or extensive analysis of the range of branches. This feature therefore greatly speeds up and simplifies the optimization process.

有關分支預測及指令序列之匯編的額外資訊可於美國專利申請號61/384,198中找到，其標題為「包含針對早期遠分支預測之陰影快取之單一循環多分支預測(Single Cycle Multi-Branch Prediction Including Shadow Cache For Early Far Branch Prediction)」，由Mohammad A.Abdallah於2010年9月17日申請，其整體內容係併入本文。 Additional information regarding the compilation of branch predictions and sequences of instructions can be found in U.S. Patent Application Serial No. 61/384,198, entitled "Included for Early Dividing "Single Cycle Multi-Branch Prediction Including Shadow Cache For Early Far Branch Prediction", applied by Mohammad A. Abdallah on September 17, 2010, the overall content of which is incorporated This article.

圖10顯示根據本發明一具體實施例之範例軟體最佳化程序的示意圖。在圖10的具體實施例中，指令排程及最佳化器構件(如圖1的構件110)係由基於軟體之最佳化器1000所取代。 10 shows a schematic diagram of an exemplary software optimization procedure in accordance with an embodiment of the present invention. In the particular embodiment of FIG. 10, the command schedule and optimizer components (such as member 110 of FIG. 1) are replaced by a software-based optimizer 1000.

在圖10的具體實施例中，軟體最佳化器1000執行由基於硬體之指令排程及最佳化器構件110所執行之最佳化程序。軟體最佳化器維持在記憶體階層(例如L1、L2、系統記憶體)中之最佳化序列的複本。相較於序列快取中所儲存，這允許軟體最佳化器維持最佳化序列之更大的收集。 In the particular embodiment of FIG. 10, software optimizer 1000 performs an optimization procedure performed by hardware-based instruction scheduling and optimizer component 110. The software optimizer maintains a copy of the optimized sequence in the memory hierarchy (eg, L1, L2, system memory). This allows the software optimizer to maintain a larger collection of optimized sequences than is stored in the sequence cache.

應注意到，軟體最佳化器1000可包含常駐於記憶體階層中的編碼，如對最佳化的輸入以及來自最佳化程序的輸出。 It should be noted that the software optimizer 1000 can include code that resides in the memory hierarchy, such as input for optimization and output from the optimization program.

應注意，在一具體實施例中，可省略微指令快取。在此一具體事實例中，僅快取最佳化微指令序列。 It should be noted that in a particular embodiment, the microinstruction cache may be omitted. In this specific case example, only the optimized microinstruction sequence is cached.

圖11顯示根據本發明一具體實施例之SIMD基於軟體之最佳化程序的流程圖。圖11的頂部顯示基於軟體的最佳化器如何審查輸入指令序列的每一指令。圖11顯示SIMD比較如何可用以匹配一對多(例如SIMD位元組比較第一來源「Src1」與所有第二來源位元組「Src2」)。在一具體實施例中，Src1包含任何指令的目的地暫存器，且Src2包含來自每一其他序列指令的一來源。針對每一目的地與所有後序指令來源完成匹配(如真實相依性檢查)。這是一配對匹配，其指示指令所需的群組。匹配係完成於每一目的地與每一後續指令目的地之間(例如輸出相依性檢查)。這是一阻擋匹配，其可以重新命名解決。匹配係完成於每一目的地與每一先前指令來源之間(例如反相依性檢查)。這是阻擋匹配，其可以重新命名解決。結果可用以集結相依性矩陣的列與行。 11 shows a flow diagram of a SIMD software-based optimization procedure in accordance with an embodiment of the present invention. The top of Figure 11 shows how the software-based optimizer reviews each instruction of the input instruction sequence. Figure 11 shows how SIMD comparisons can be used to match one-to-many (e.g., SIMD byte comparison first source "Src1" with all second source bytes "Src2"). In a specific embodiment, Src1 contains any The destination scratchpad of the instruction, and Src2 contains a source from each of the other sequence instructions. A match is completed for each destination with all subsequent source of instructions (eg, a true dependency check). This is a pairing match that indicates the group required for the instruction. The matching is done between each destination and each subsequent instruction destination (eg, an output dependency check). This is a blocking match that can be renamed to resolve. The matching is done between each destination and each of the previous instruction sources (eg, reverse-phase checking). This is blocking the match, which can be renamed to resolve. The result can be used to assemble columns and rows of the dependency matrix.

圖12顯示根據本發明一具體實施例之一範例SIMD基於軟體之最佳化程序1200的流程圖。程序1200係在圖9流程圖的背景下描述。 12 shows a flow diagram of an exemplary SIMD software-based optimization procedure 1200, in accordance with an embodiment of the present invention. The program 1200 is described in the context of the flow chart of FIG.

在步驟1201中，指令的輸入序列係藉由使用實體化(instantiated)於記憶體之一基於軟體之最佳化器而存取。 In step 1201, the input sequence of instructions is accessed by using a software-based optimizer that is instantiated to one of the memories.

在步驟1202中，使用SIMD指令而集結(populate)一相依性矩陣，其具有藉由使用SIMD比較指令之一序列而從指令的輸入序列擷取之相依性資訊。 In step 1202, a dependency matrix is populated using a SIMD instruction having dependency information retrieved from the input sequence of instructions by using a sequence of SIMD comparison instructions.

在步驟1203中，針對第一匹配(如相依性標記)而從右至左掃描矩陣的列。 In step 1203, the columns of the matrix are scanned from right to left for the first match (eg, the dependency flag).

在步驟1204中，分析每一第一匹配以決定匹配的類型。 In step 1204, each first match is analyzed to determine the type of match.

在步驟1205中，若第一標記匹配為一阻擋相依性(blocking dependency)，則針對此目的地完成重新命名。 In step 1205, if the first flag matches as a blocking dependency (blocking) Requires a complete renaming for this destination.

在步驟1206中，識別針對矩陣每一列的所有第一匹配，且該匹配所對應的行係移動至給定的相依性群組。 In step 1206, all first matches for each column of the matrix are identified, and the row corresponding to the match is moved to a given dependency group.

在步驟1207中，重複掃描程序數次，以重排序包含輸入序列的指令，以產生一最佳化輸出序列。 In step 1207, the scanning process is repeated several times to reorder the instructions containing the input sequence to produce an optimized output sequence.

在步驟1208中，最佳化指令序列係輸出至微處理器的執行管線供執行。 In step 1208, the optimized instruction sequence is output to the execution pipeline of the microprocessor for execution.

在步驟1209中，最佳化輸出序列係儲存於一序列快取中供後續的消耗(如加速熱編碼)。 In step 1209, the optimized output sequence is stored in a sequence of caches for subsequent consumption (eg, accelerated thermal coding).

應注意，軟體最佳化可使用SIMD指令而連續地(serially)完成。舉例來說，最佳化可藉由在一時間掃描指令的來源及目的地處理一指令而實施(例如，從在一序列中較早的指令至後續的指令)。根據上述最佳化演算法及SIMD指令，軟體使用SIMD指令以並聯地比較目前的指令來源及目的地與先前的指令來源及目的地(例如，偵測真實相依性、輸出相依性及反相依性)。 It should be noted that software optimization can be done serially using SIMD instructions. For example, optimization can be implemented by processing an instruction at the source and destination of a scan instruction at a time (eg, from an earlier instruction in a sequence to a subsequent instruction). According to the above optimization algorithm and SIMD instruction, the software uses SIMD instructions to compare current command sources and destinations with previous instruction sources and destinations in parallel (eg, detecting true dependencies, output dependencies, and inverse dependencies). ).

圖13顯示根據本發明一具體實施例之基於軟體之相依性廣播程序。圖13具體實施例顯示一範例軟體排程程序的流程圖，其處理指令的群組而不需有如上述之完整並聯硬體實施的花費。然而，圖13具體實施例仍可使用SIMD以並聯地處理較小的指令群組。 Figure 13 shows a software-based dependency broadcast procedure in accordance with an embodiment of the present invention. The embodiment of Figure 13 shows a flow diagram of an exemplary software scheduling program that processes groups of instructions without the expense of implementing a complete parallel hardware implementation as described above. However, the FIG. 13 embodiment may still use SIMD to process smaller groups of instructions in parallel.

圖13的軟體排程程序係進行如下。首先，程序初始化三個暫存器。程序取得指令數字並將其載入第一暫存器。接著，程序取得目的地暫存器數字並將其載入第二暫存器。接著，程序取得在第一暫存器的數值並將根據在第二暫存器中的一位置數字而將其廣播至在第三結果暫存器中的一位置。程序接著在第二暫存器中從左到右複寫，在廣播係進行至結果暫存器中之相同位置的那些情況中，最左邊的數值將複寫一右邊的數值。在第三暫存器中尚未被寫入的位置將被略過(bypassed)。此資訊息用以集結一相依性矩陣。 The software scheduling procedure of Fig. 13 is performed as follows. First, the program initializes three registers. The program takes the instruction number and loads it into the first scratchpad. The program then takes the destination scratchpad number and loads it into the second scratchpad. Next, the program takes the value in the first register and broadcasts it to a location in the third result register based on a location number in the second register. The program is then overwritten from left to right in the second register, and in those cases where the broadcaster proceeds to the same location in the result register, the leftmost value will overwrite the value to the right. The location that has not been written in the third scratchpad will be bypassed. This information is used to assemble a dependency matrix.

圖13具體實施例也顯示指令的輸入序列可處理為複數個群組的方式。舉例來說，16個指令輸入序列可處理為8個指令的第一群組以及8個指令的第二群組。以第一群組，指令數字係載入至第一暫存器、指令目的地數字係載入第二暫存器、且在第一暫存器中之數值係根據第二暫存器中的位置數字而廣播(broadcast)至第三暫存器(例如結果暫存器)中的位置(例如群組廣播)。在第三暫存器中尚未被寫入的位置將被略過。第三暫存器現在變成處理第二群組的基礎。舉例來說，來自群組1的結果暫存器現在變成處理群組二的結果暫存器。 The particular embodiment of Figure 13 also shows the manner in which the input sequence of instructions can be processed into a plurality of groups. For example, 16 instruction input sequences can be processed as a first group of 8 instructions and a second group of 8 instructions. In the first group, the instruction number is loaded into the first register, the instruction destination number is loaded into the second register, and the value in the first register is based on the second register. The location number is broadcasted to a location in the third register (eg, the result register) (eg, a group broadcast). The location that has not been written in the third scratchpad will be skipped. The third register is now the basis for processing the second group. For example, the result register from group 1 now becomes the result register for group two.

以第二群組，指令數字係載入第一暫存器、指令目的地數字係載入第二暫存器、且在第一暫存器中的數值係根據第二暫存器中的位置數字而廣播至第三暫存器(如結果暫存器)中的位置。第三暫存器中的位置可複寫在第一群組處理過程中所寫入的結果。在第三暫存器中尚未被寫入的位置將被略過。在此方式中，第二群組更新來自第一群組的基礎，並因而產生新的基礎供第三群組的處理等等。 In the second group, the instruction number is loaded into the first register, the instruction destination number is loaded into the second register, and the value in the first register is based on the location in the second register. The number is broadcast to the location in the third register (such as the result register). The location in the third register can overwrite the result written during the first group processing. The location that has not been written in the third scratchpad will be skipped. On this side In the formula, the second group updates the basis from the first group, and thus generates a new basis for the processing of the third group, and the like.

在第二群組中的指令可繼承在第一群組的處理中所產生的相依性資訊。應注意，並不需要處理全部的第二群組以更新在結果暫存器中的相依性。舉例來說，指令12的相依性可在第一群組的處理中產生，接著處理在第二群組中直到指令11的指令。這更新了結果暫存器至直到指令12的一狀態。在一具體實施例中，可使用遮罩以避免第二群組之剩餘指令(例如指令12至16)的更新。為決定指令12的相依性，結果暫存器係針對R2及R5而審查。R5將以指令1更新，而R2將以指令11更新。應注意，在群組2全部被處理的情況中，R2將以指令15更新。 The instructions in the second group may inherit the dependency information generated in the processing of the first group. It should be noted that it is not necessary to process all of the second groups to update the dependencies in the result register. For example, the dependencies of the instructions 12 may be generated in the processing of the first group, followed by the instructions in the second group up to the instruction 11. This updates the result register to a state up to instruction 12. In a particular embodiment, a mask can be used to avoid updating of the remaining instructions of the second group (eg, instructions 12 through 16). To determine the dependencies of instruction 12, the resulting registers are reviewed for R2 and R5. R5 will be updated with instruction 1, and R2 will be updated with instruction 11. It should be noted that in the case where group 2 is all processed, R2 will be updated with instruction 15.

此外，應注意，第二群組的所有指令(如指令9至16)可彼此獨立地處理。在此情況中，第二群組的指令僅依賴第一群組的結果暫存器。一旦結果暫存器從第一群組的處理更新，則第二群組的指令可並聯地處理。在此方式中，指令群組可接連地並聯處理。在一具體實施例中，每一群組係使用SIMD指令(如SIMD廣播指令)處理，藉此而並聯地處理該每一群組的所有指令。 Furthermore, it should be noted that all instructions of the second group, such as instructions 9 through 16, can be processed independently of each other. In this case, the instructions of the second group rely only on the result register of the first group. Once the result register is updated from the processing of the first group, the instructions of the second group can be processed in parallel. In this manner, the group of instructions can be processed in parallel in parallel. In one embodiment, each group is processed using SIMD instructions (such as SIMD broadcast instructions) whereby all instructions for each group are processed in parallel.

圖14顯示一範例流程圖，其顯示根據本發明一具體實施例之指令的相依性群組如何可用以建立相依性指令的可變界限群組。在圖2至圖4的描述中，群組的尺寸係受限，在那些情況中每群組為三個指令。圖14顯示指令如何可重排序為可變尺寸的群組，其可接著分派至複數個計算引擎。舉例來說，圖14顯示4個引擎。由於群組可根據其特徵而可變地調整尺寸，引擎1可分派比例如引擎2大的群組。這可例如發生在引擎2具有未特別地相依於該群組中其他指令之一指令的情況中。 14 shows an example flow diagram showing how a dependency group of instructions can be used to establish a variable bounding group of dependencies instructions in accordance with an embodiment of the present invention. In the description of Figures 2 through 4, the size of the group is limited, in each case three instructions per group. Figure 14 shows how the instructions can be reordered into groups of variable sizes, which can then be dispatched to a plurality of computing engines. For example, Figure 14 shows Show 4 engines. Since the group can be variably sized according to its characteristics, the engine 1 can assign a group larger than, for example, the engine 2. This may occur, for example, in the case where the engine 2 has instructions that are not specifically dependent on one of the other instructions in the group.

圖15顯示一流程圖，其繪示根據本發明一具體實施例之指令的階層排程。如前述，指令的相依性群組可用以建立可變界限群組。圖15顯示其中各種等級的相依性存在於一相依性群組內的特徵。舉例來說，指令1並未相依於在此指令序列內的任何其他指令，因此使指令1為L0相依性等級。然而，指令4係相依於指令1，因此使指令4為L1相依性等級。在此方式中，指令序列的每一指令係分派一相依性等級，如圖所示。 Figure 15 shows a flow diagram illustrating hierarchical scheduling of instructions in accordance with an embodiment of the present invention. As mentioned previously, a dependency group of instructions can be used to establish a variable bounding group. Figure 15 shows features in which various levels of dependencies exist within a dependency group. For example, instruction 1 does not depend on any other instruction within this sequence of instructions, thus making instruction 1 a L0 dependency level. However, instruction 4 is dependent on instruction 1, thus making instruction 4 an L1 dependency level. In this manner, each instruction of the sequence of instructions assigns a dependency level as shown.

每一指令的相依性等級係由第二等級階層排程器所使用，以此一方式配送指令以確保有可用的資源供相依性指令以執行。舉例來說，在一具體實施例中，L0指令係載入由第二等級排程器1至4所處理之指令佇列。L0指令係載入使得其在每一佇列前、L1指令係載入使得接在每一佇列之後、L2指令係接在其後等。在圖15中，這藉由相依性等級而顯示，從L0至Ln。排程器1至4的階層排程係有利地利用時間局部性(locality-in-time)及指令對指令相依性而以一最佳方式做出排程決定。 The dependency level for each instruction is used by the second level hierarchy scheduler to distribute the instructions in such a way as to ensure that available resources are available for the dependency instructions to execute. For example, in one embodiment, the L0 command loads the array of instructions processed by the second level schedulers 1 through 4. The L0 instruction is loaded so that it is loaded before each queue, after the L1 instruction is loaded, after each queue, after the L2 instruction is tied, and so on. In Figure 15, this is shown by the dependency level, from L0 to Ln. The hierarchical scheduling of schedulers 1 through 4 advantageously utilizes locality-in-time and instruction-dependent dependencies to make scheduling decisions in an optimal manner.

在此方式中，本發明的具體實施例顯示指令序列之指令的相依性群組槽分派(slot allocation)。舉例來說，實施一無序微架構，指令序列之指令的配送為無序。在一具體實施例中，在每一循環中，檢查指令的準備狀態。若一指令所相依的所有指令在先前已配送，則此指令為準備好的。排程器結構由檢查那些相依性而作用。在一具體實施例中，排程器為一聯合排程器，且所有相依性檢查係在聯合排程器結構中執行。在另一具體實施例中，排程器功能係分散在複數個引擎之執行單元的配送佇列。因此，在一具體實施例中排程器為聯合的，而在另一具體實施例中排程器為分散的。以這兩個解決方案，在每一循環，對配送指令的目的地檢查每一指令來源。 In this manner, embodiments of the present invention display dependency group slot allocation of instructions of a sequence of instructions. For example, implementing an out-of-order micro-architecture, the instruction of the instruction sequence is unordered. In a specific embodiment, the readiness of the instructions is checked in each cycle. If all instructions are dependent on one instruction This order is ready when it has been previously delivered. The scheduler structure acts by examining those dependencies. In a specific embodiment, the scheduler is a joint scheduler and all dependencies are performed in the joint scheduler structure. In another embodiment, the scheduler function is distributed across a distribution queue of execution units of a plurality of engines. Thus, in one embodiment the scheduler is united, while in another embodiment the scheduler is decentralized. With both solutions, each instruction source checks the source of each instruction at each destination.

因此，圖15顯示由本發明具體實施例所執行之階層排程。如前述，指令首先分組以形成相依性鏈(如相依性群組)。這些相依性鏈的資訊可藉由軟體或硬體而靜態地或動態地完成。一旦這些相依性鏈已經形成，其可分布/配送至一引擎。在此方式中，藉由相依性的分組允許依序形成之群組的無序排程。藉由相依性的分組也將全部的相依性群組分散至複數個引擎(例如核心或線程)。藉由相依性的分組也有助於階層排程，如前所述，其中相依性指令在第一步驟中分組並在第二步驟中排程。 Thus, Figure 15 shows a hierarchical schedule performed by a particular embodiment of the present invention. As before, the instructions are first grouped to form a dependency chain (eg, a dependency group). Information about these dependency chains can be done statically or dynamically by software or hardware. Once these dependency chains have been formed, they can be distributed/distributed to an engine. In this manner, the grouping by dependency allows for out-of-order scheduling of groups that are sequentially formed. The group of dependencies also spreads all of the dependency groups to a plurality of engines (such as cores or threads). Grouping by dependency also contributes to hierarchical scheduling, as described above, wherein the dependency instructions are grouped in the first step and scheduled in the second step.

應注意，圖14至19所示的功能可獨立於指令藉以分組的任何方法而作用(例如分組功能是否由硬體、軟體等所執行)。此外，圖14至19所示的相依性群組可包含相依性群組之一矩陣，其中每一群組更包含相依性指令。此外，應注意，排程器也可為引擎。在此具體實施例中，排程器1至4的每一個可併入其個別引擎內(如圖22所示，其中每一區段包含一共同劃分排程器)。 It should be noted that the functions illustrated in Figures 14 through 19 may function independently of any method by which the instructions are grouped (e.g., whether the grouping function is performed by hardware, software, etc.). In addition, the dependency group shown in FIGS. 14 to 19 may include a matrix of dependency groups, wherein each group further includes a dependency instruction. In addition, it should be noted that the scheduler can also be an engine. In this particular embodiment, each of schedulers 1 through 4 can be incorporated into its individual engines (as shown in Figure 22, where each segment contains a common partition scheduler).

圖16顯示一流程圖，其繪示根據本發明一具體實施例之三槽相依性群組指令的階層排程。如前所述，指令的相依性群組可用以建立可變界限群組。在此具體實施例中，相依性群組包含三個槽。圖16顯示相依性的各種等級平均的在三槽相依性群組中。如前所述，指令1並無相依於在此指令序列中的任何其他指令，因此使得指令1為L0相依性等級。然而，指令4係相依於指令1，因此使得指令4為L1相依性等級。在此方式中，指令序列的每一指令係分派一相依性等級，如圖所示。 Figure 16 shows a flow chart showing three according to an embodiment of the present invention. The hierarchical schedule of the slot dependency group instruction. As previously mentioned, a dependency group of instructions can be used to establish a variable bounding group. In this particular embodiment, the dependency group contains three slots. Figure 16 shows the various levels of dependency averaged in a three-slot dependency group. As previously mentioned, instruction 1 does not depend on any other instructions in this sequence of instructions, thus making instruction 1 a L0 dependency level. However, instruction 4 is dependent on instruction 1, thus making instruction 4 an L1 dependency level. In this manner, each instruction of the sequence of instructions assigns a dependency level as shown.

如前所述，每一指令的相依性等級係由第二等級階層排程器所使用，以此一方式配送指令以確保有可用的資源供相依性指令以執行。L0指令係載入由第二等級排程器1至4所處理之指令佇列。L0指令係載入使得其在每一佇列前、L1指令係載入使得其在每一佇列之後、L2指令係接在其後等，如相依性等級所示，從圖16的L0至Ln。應注意，群組數字四(例如從頂部的第四群組)係開始於L2，即使其為一獨立群組。這是因為指令7相依於指令4，其又相依於指令1，因而給定指令7為L2相依性。 As previously mentioned, the dependency level of each instruction is used by the second level hierarchy scheduler to distribute the instructions in such a way as to ensure that available resources are available for the dependency instructions to execute. The L0 command loads the array of instructions processed by the second level schedulers 1 through 4. The L0 instruction is loaded so that it is loaded before each queue, the L1 instruction is loaded after each queue, the L2 instruction is connected after it, etc., as shown by the dependency level, from L0 of Figure 16 to Ln. It should be noted that the group number four (eg, from the top fourth group) begins at L2, even if it is an independent group. This is because instruction 7 is dependent on instruction 4, which in turn depends on instruction 1, and thus instruction 7 is L2 dependent.

在此方式中，圖16顯示每三個相依性指令係共同地排程於給定之排程器1至4之其中之一上。第二等級群組係排程於第一等級群組之後，接著旋轉群組。 In this manner, Figure 16 shows that every three dependency commands are collectively scheduled on one of the given schedulers 1 through 4. The second level group is scheduled after the first level group, and then the group is rotated.

圖17顯示一流程圖，其繪示根據本發明一具體實施例之三槽相依性群組指令的階層移動視窗排程。在此具體實施例中，三槽相依性群組的階層排程係經由一聯合移動視窗排程器而執行。移動視窗排程器處理在佇列中的指令，以此一方式配送指令以確保有可用的資源供相依性指令以執行。如前述，L0指令係載入由第二等級排程器1至4所處理之指令佇列。L0指令係載入使得其在每一佇列前、L1指令係載入使得其在每一佇列之後、L2指令係接在其後等，如相依性等級所示，從圖17的L0至Ln。移動視窗描述L0指令如何可從每一佇列配送，即使其在一佇列中可多於另一佇列。在此方式中，移動視窗排程器係配送指令如由左至右的佇列流，如圖17所示。 17 shows a flow diagram illustrating a hierarchical movement window schedule for a three-slot dependency group instruction in accordance with an embodiment of the present invention. In this particular embodiment, the hierarchical scheduling of the three-slot dependency group is performed via a joint mobile window scheduler. The mobile window scheduler processes the instructions in the queue and distributes the instructions in one way. Order to ensure that resources are available for dependencies to execute. As previously mentioned, the L0 command loads the array of instructions processed by the second level schedulers 1 through 4. The L0 instruction is loaded so that it is loaded before each queue, the L1 instruction is loaded after each queue, the L2 instruction is connected after it, etc., as shown by the dependency level, from L0 of Figure 17 to Ln. The moving window describes how the L0 command can be dispatched from each queue, even though it can be more than one queue in a queue. In this manner, the mobile window scheduler is a distribution instruction such as a left-to-right queue flow, as shown in FIG.

圖18根據本發明一具體實施例顯示指令的可變尺寸相依性鏈(例如可變界限群組)如何分派至複數個計算引擎。 Figure 18 illustrates how a variable size dependency chain (e.g., a variable bounding group) of instructions is dispatched to a plurality of computing engines in accordance with an embodiment of the present invention.

如圖18所示，處理器包含一指令排程器構件10及複數個引擎11至14。指令排程器構件產生編碼區塊及遺傳向量，以支援相依性編碼區塊(如可變界限群組)於其個別引擎上的執行。每一相依性編碼區塊可屬於相同邏輯核心/線程或屬於不同的邏輯核心/線程。指令排程器構件將處理相依性編碼區塊及個別遺傳向量。這些相依性編碼區塊及個別遺傳向量係分派至特定引擎11至14，如圖所示。通用互連30支援對引擎11至14之每一個的必要通訊。應注意，有關指令的相依性分組以建立相依性指令的可變界限群組(如上述對圖14的討論)係由圖18具體實施例之指令排程器構件10所執行。 As shown in FIG. 18, the processor includes an instruction scheduler component 10 and a plurality of engines 11 to 14. The instruction scheduler component generates code blocks and genetic vectors to support the execution of dependent code blocks (eg, variable bound groups) on their respective engines. Each dependency coding block may belong to the same logical core/thread or belong to a different logical core/thread. The instruction scheduler component will process the dependent coding block and the individual genetic vectors. These dependency coding blocks and individual genetic vector systems are assigned to specific engines 11 through 14, as shown. The universal interconnect 30 supports the necessary communication for each of the engines 11 through 14. It should be noted that the grouping of dependencies relating to instructions to establish a variable bounding group of dependencies instructions (as discussed above with respect to FIG. 14) is performed by the command scheduler component 10 of the FIG. 18 embodiment.

圖19顯示一流程圖，其繪示根據本發明一具體實施例之區塊分派至排程佇列及三槽相依性群組指令的階層移動視窗排程。如前述，三槽相依性群組的階層排程可經由一聯合移動視窗排程器執行。圖19顯示相依性群組如何變成載入排程佇列的區塊。在圖19的具體實施例中，兩個相依性群組可載入每一佇列作為半區塊。這顯示在圖19的頂部，其中群組1形成一半區塊而群組4形成另一半區塊，其係載入第一排程佇列。 FIG. 19 shows a flow chart illustrating a hierarchical movement window schedule for a block assignment to a schedule queue and a three slot dependency group instruction in accordance with an embodiment of the present invention. As mentioned above, the hierarchical scheduling of the three-slot dependency group can be performed via a joint mobile window scheduler. Figure 19 shows how the dependency group becomes loaded into the schedule queue. Block. In the particular embodiment of Figure 19, two dependency groups can load each queue as a half block. This is shown at the top of Figure 19, where group 1 forms a half block and group 4 forms another half block that is loaded into the first schedule queue.

如前所述，移動視窗排程器處理在佇列中的指令，以此一方式配送指令以確保有可用的資源供相依性指令以執行。圖19的底部顯示L0指令如何載入至由第二等級排程器所處理之指令佇列。 As previously mentioned, the mobile window scheduler processes the instructions in the queue, distributing the instructions in such a way as to ensure that available resources are available for the dependency instructions to execute. The bottom of Figure 19 shows how the L0 instruction is loaded into the array of instructions processed by the second level scheduler.

圖20根據本發明一具體實施例顯示相依性編碼區塊(如相依性群組或相依性鏈)如何執行於引擎11至14上。如前述，指令排程器構件產生編碼區塊及遺傳向量(inheritance vector)，以支援相依性編碼區塊(如可變界限群組、三槽群組等)於其個別引擎上的執行。如圖19所述，圖20更顯示兩個獨立群組如何可載入至每一引擎作為編碼區塊。圖20顯示這些編碼區塊如何配送至引擎11至14，其中相依性指令執行於每一引擎的堆疊(如串列地連接)執行單元。舉例來說，在圖20的左上方，在第一相依性群組或編碼區塊中，指令係配送至引擎11，其中其係以其相依性順序堆疊於執行單元上，使得L0係堆疊於L1的頂部，而L1係堆疊於L2上。如此，L0的結果係流向L1的執行單元，接著其可流向L2的執行。 20 illustrates how dependency coding blocks, such as dependency groups or dependency chains, are executed on engines 11 through 14, in accordance with an embodiment of the present invention. As previously described, the instruction scheduler component generates code blocks and inheritance vectors to support the execution of dependent code blocks (eg, variable bound groups, three-slot groups, etc.) on their respective engines. As illustrated in Figure 19, Figure 20 further shows how two independent groups can be loaded into each engine as a coded block. Figure 20 shows how these coded blocks are distributed to engines 11 through 14, where the dependency instructions are executed on a stack (e.g., serially connected) execution units of each engine. For example, in the upper left of FIG. 20, in the first dependency group or the code block, the instructions are distributed to the engine 11, wherein they are stacked on the execution unit in the order of their dependencies, so that the L0 system is stacked on The top of L1, while the L1 is stacked on L2. As such, the result of L0 flows to the execution unit of L1, which in turn can flow to the execution of L2.

在此方式中，圖20所示的相依性群組可包含獨立群組的一矩陣，其中每一群組更包含相依性指令。群組為獨立的好處是並聯地將其配送及執行的能力及屬性，藉此最小化在引擎之間橫越互連之通訊的需求。此外，應注意，顯示於引擎11至14 中之執行單元可包含CPU或GPU。 In this manner, the dependency group shown in FIG. 20 may include a matrix of independent groups, wherein each group further includes a dependency instruction. The benefit of group independence is the ability and attributes to distribute and execute them in parallel, thereby minimizing the need to traverse interconnected communications between engines. In addition, it should be noted that it is displayed on engines 11 to 14 The execution unit in the middle can include a CPU or a GPU.

根據本發明一具體實施例，應了解到，指令係根據其相依性而提取收到相依性群組或區塊或指令矩陣。根據其相依性的分組指令有助於具有較大指令視窗(如較大的指令輸入序列)之更簡化的排程程序。前述的分組係移除指令變動且均勻地提取此變動，因而允許簡單、同質及均勻的排程決定。上述的分組功能提高了排程器的產量，而無增加排程器的複雜度。舉例來說，在針對四個引擎的一排程器中，排程器可配送四個群組，其中每一群組具有三個指令。如此，排程器僅處理四線道的超標量複雜度，同時配送12個指令。此外，每一方塊可含有並行的獨立群組，其更增加配送指令的數量。 In accordance with an embodiment of the present invention, it is understood that the instruction extracts a received dependency group or block or instruction matrix based on its dependencies. Grouping instructions based on their dependencies facilitate a more streamlined scheduling procedure with larger instruction windows, such as larger instruction input sequences. The aforementioned grouping removes the instruction changes and uniformly extracts the variation, thus allowing for simple, homogeneous, and uniform scheduling decisions. The above grouping function increases the throughput of the scheduler without increasing the complexity of the scheduler. For example, in a scheduler for four engines, the scheduler can deliver four groups, with each group having three instructions. In this way, the scheduler only handles the superscalar complexity of the four lanes and delivers 12 instructions at the same time. In addition, each block can contain parallel independent groups that increase the number of delivery instructions.

圖21顯示根據本發明一具體實施例之複數個引擎及其構件的概圖，其包含一多核心處理器之一通用前端抓取&排程器及暫存器檔案、通用互連、及一片段記憶體次系統。如圖21所示，顯示了四個記憶體片段101至104。記憶體片段階層在每一快取階層為相同(例如L1快取、L2快取、及載入儲存緩衝)。資料可經由記憶體通用互連110a而在每一L1快取、每一L2快取、及每一載入儲存緩衝之間交換。 21 shows an overview of a plurality of engines and their components, including a multi-core processor, a universal front-end grab & scheduler, and a scratchpad file, a universal interconnect, and a component, in accordance with an embodiment of the present invention. Fragment memory subsystem. As shown in Fig. 21, four memory segments 101 to 104 are displayed. The memory fragment hierarchy is the same at each cache level (eg L1 cache, L2 cache, and load store buffer). Data can be exchanged between each L1 cache, each L2 cache, and each load storage buffer via memory universal interconnect 110a.

記憶體通用互連包含一路由矩陣，其允許複數個核心(如位址計算及執行單元121-124)存取資料，資料可儲存於片段快取階層(如L1快取、載入儲存緩衝及L2快取)中的任一點。圖21也繪示每一片段101至104可由位址計算及執行單元121至124經由記憶體通用互連110a而存取的方式。 The memory universal interconnect includes a routing matrix that allows a plurality of cores (such as address calculation and execution units 121-124) to access data, and the data can be stored in the segment cache hierarchy (eg, L1 cache, load storage buffer, and Any point in L2 cache). Figure 21 also illustrates the manner in which each of the segments 101-104 can be accessed by the address calculation and execution units 121-124 via the memory universal interconnect 110a.

執行通用互連110b類似地包含一路由矩陣，允許複數個核心(如位址計算及執行單元121至124)存取儲存於任何區段暫存器檔案的資料。因此，核心具有經由記憶體通用互連110a或執行通用互連110b之對儲存於任何片段中之資料及儲存於任何區段中之資料的存取。 Execution universal interconnect 110b similarly includes a routing matrix that allows a plurality of cores (e.g., address calculation and execution units 121-124) to access data stored in any sector register file. Thus, the core has access to the data stored in any of the segments and the data stored in any of the segments via the memory universal interconnect 110a or the universal interconnect 110b.

圖21更顯示一通用前端抓取/排程器，其具有對整體機器的觀察且其管理暫存器檔案區段及分段記憶體次系統的使用。位址產生包含針對片段定義的基礎。通用前端抓取&排程器係藉由分派指令序列至每一區段而作用。 Figure 21 further shows a generic front-end grab/schedule that has an observation of the overall machine and that manages the use of the scratchpad file section and the segment memory subsystem. The address generation contains the basis for the definition of the fragment. The generic front-end crawler & scheduler acts by dispatching a sequence of instructions to each section.

圖22顯示根據本發明一具體實施例之複數個區段、複數個區段共同劃分排程器及互連及至區段的埠。如圖22所示，每一區段係顯示為具有一共同劃分排程器。共同劃分排程器藉由排程在其個別區段內的指令而作用。這些指令接著從通用前端抓取及排程器接收。在此具體實施例中，通用劃分排程器係組態以與通用前端抓取及排程器聯合作用。區段亦顯示為具有4個讀取寫入埠，其提供讀取/寫入存取至運算元/結果緩衝器、線程暫存器檔案、及共同劃分排程器。 Figure 22 shows a plurality of segments, a plurality of segments sharing a scheduler and interconnects and a segment to a segment, in accordance with an embodiment of the present invention. As shown in Figure 22, each segment is shown as having a common partition scheduler. The common partition scheduler acts by scheduling instructions within its individual sections. These instructions are then received from the general front end grab and scheduler. In this particular embodiment, the general purpose scheduler is configured to work in conjunction with a generic front end grab and scheduler. The section is also shown with four read writes, which provide read/write access to the operand/result buffer, thread register file, and common partition scheduler.

在一具體實施例中，執行一非集中式存取程序以使用互連且本地互連使用保留加法器及臨界限制器控制對每一競爭資源的存取，在此情況中，至每一區段的埠。在此具體實施例中，為存取一資源，一核心需保留所需的匯流排及保流所需的埠。 In a specific embodiment, a decentralized access procedure is performed to control access to each competing resource using a reserved adder and a critical limiter using interconnects and local interconnects, in this case, to each zone The paradox of the paragraph. In this embodiment, in order to access a resource, a core needs to retain the required bus and the bandwidth required for the flow.

圖23顯示根據本發明一具體實施例之範例微處理器管線2200的圖表。微處理器管線2200包含一抓取模組2301，其執行程序之功能以識別及擷取包含一執行之指令，如前述。在圖23的具體實施例中，抓取模組係繼之以一解碼模組2202、分派模組2203、配送模組2204、執行模組2205、及引退模組2206。應注意，微處理器管線2200只是管線的一範例，其執行上述之本發明具體實施例的功能。熟此技藝者將了解到其他微處理器管線可實施為包含上述解碼模組之功能。 23 shows a diagram of an example microprocessor pipeline 2200 in accordance with an embodiment of the present invention. Microprocessor pipeline 2200 includes a capture module 2301 that performs the functions of the program to identify and retrieve instructions that include an execution, as previously described. In the specific embodiment of FIG. 23, the capture module is followed by a decoding module 2202, a dispatch module 2203, a distribution module 2204, an execution module 2205, and a retirement module 2206. It should be noted that microprocessor pipeline 2200 is merely an example of a pipeline that performs the functions of the specific embodiments of the invention described above. Those skilled in the art will appreciate that other microprocessor pipelines can be implemented to include the functionality of the above described decoding modules.

為解釋目的，前文之描述係指特定具體實施例，其無意為詳盡或限制本發明。在符合上述教示下可能有許多修改及變化。具體實施例係選擇及描述以最佳地解釋本發明的原理及其實際應用，以致能其他熟此技藝者最佳地使用本發明及其具有可適合於特定使用之各種修改的各種具體實施例。 For the purposes of explanation, the foregoing description refers to specific embodiments, which are not intended to be exhaustive or limiting. There may be many modifications and variations in compliance with the above teachings. The present invention has been chosen and described in order to best explain the embodiments of the invention .

10‧‧‧指令排程器構件 10‧‧‧Instruction Scheduler Components

11‧‧‧引擎 11‧‧‧ Engine

12‧‧‧引擎 12‧‧‧ Engine

13‧‧‧引擎 13‧‧‧ Engine

14‧‧‧引擎 14‧‧‧ engine

30‧‧‧通用互連 30‧‧‧Common Interconnect

100‧‧‧微處理器 100‧‧‧Microprocessor

101‧‧‧抓取構件 101‧‧‧Grab components

102‧‧‧原生解碼構件 102‧‧‧ Native decoding components

103‧‧‧多工器 103‧‧‧Multiplexer

104‧‧‧多工器 104‧‧‧Multiplexer

105‧‧‧剩餘管線 105‧‧‧Remaining pipeline

110‧‧‧指令排程及最佳化器構件 110‧‧‧Instruction scheduling and optimizer components

110a‧‧‧記憶體通用互連 110a‧‧‧Memory General Interconnect

110b‧‧‧執行通用互連 110b‧‧‧Executing universal interconnection

121‧‧‧微指令快取 121‧‧‧Micro-instruction cache

122‧‧‧序列快取 122‧‧‧Sequence cache

123‧‧‧位址計算及執行單元 123‧‧‧Address calculation and execution unit

124‧‧‧位址計算及執行單元 124‧‧‧Address calculation and execution unit

130‧‧‧逐出及填充路徑 130‧‧‧Exit and fill path

1000‧‧‧軟體最佳化器 1000‧‧‧Software Optimizer

2200‧‧‧微處理器管線 2200‧‧‧Microprocessor pipeline

2201‧‧‧抓取模組 2201‧‧‧Grab module

2202‧‧‧解碼模組 2202‧‧‧Decoding module

2203‧‧‧分派模組 2203‧‧‧Distribution module

2204‧‧‧配送模組 2204‧‧‧Distribution module

2205‧‧‧執行模組 2205‧‧‧Execution module

2206‧‧‧引退模組 2206‧‧‧Retirement module

本發明係經由範例而非經由限制的方式而描述，在所附隨圖式之各圖中，類似的元件符號係指類似的元件。 The present invention is described by way of example and not by way of limitation, and in the FIG.

圖1顯示根據本發明一具體實施例之微處理器的分派/發布階段的概要圖式；圖2顯示描述根據本發明具體實施例之最佳化程序的概要圖式；圖3顯示根據本發明一具體實施例之一多步驟最佳化程序；圖4顯示根據本發明一具體實施例之一多步驟最佳化及指令移動程序；圖5顯示根據本發明一具體實施例之一範例硬體最佳化程序步驟流程圖；圖6顯示根據本發明一具體實施例之另一範例硬體最佳化程序的步驟流程圖；圖7顯示一圖表，其顯示根據本發明一具體實施例之分派/發布階段之CAM匹配硬體及優先性編碼硬體的操作；圖8顯示描述根據本發明一具體實施例之在一分支前之最佳化排程的一圖表；圖9顯示一圖表，其描述根據本發明一具體實施例之在一儲存之前的最佳化排程；圖10顯示根據本發明一具體實施例之範例軟體最佳化程序的示意圖；圖11顯示根據本發明一具體實施例之SIMD基於軟體之最佳化程序的流程圖；圖12顯示根據本發明一具體實施例之一範例SIMD基於軟體之最佳化程序的流程圖；圖13顯示根據本發明一具體實施例之基於軟體之相依性廣播程序。 1 shows a schematic diagram of a dispatch/release phase of a microprocessor in accordance with an embodiment of the present invention; FIG. 2 shows a schematic diagram depicting an optimization procedure in accordance with an embodiment of the present invention; FIG. 3 shows a schematic diagram in accordance with the present invention. A multi-step optimization program of a specific embodiment; FIG. 4 shows a multi-step optimization and instruction movement program according to an embodiment of the present invention; FIG. 5 shows an example hardware according to an embodiment of the present invention. Optimal process FIG. 6 shows a flow chart of steps of another example hardware optimization process in accordance with an embodiment of the present invention; FIG. 7 shows a diagram showing a dispatch/release phase in accordance with an embodiment of the present invention. CAM Matching Hardware and Priority Encoding Hardware Operation; FIG. 8 shows a diagram depicting an optimized schedule prior to a branch in accordance with an embodiment of the present invention; FIG. 9 shows a diagram depicting FIG. 10 shows a schematic diagram of an exemplary software optimization procedure in accordance with an embodiment of the present invention; FIG. 11 shows a SIMD based on a specific embodiment of the present invention. Flowchart of a software optimization process; FIG. 12 is a flow chart showing an SIMD software-based optimization procedure according to an embodiment of the present invention; FIG. 13 is a software-based dependency according to an embodiment of the present invention. Sex broadcast program.

圖14顯示一範例流程圖，其顯示根據本發明一具體實施例之指令的相依性群組如何用以建立相依性指令的可變界限群組；圖15顯示一流程圖，其繪示根據本發明一具體實施例之指令的階層排程；圖16顯示一流程圖，其繪示根據本發明一具體實施例之三槽相依性群組指令的階層排程；圖17顯示一流程圖，其繪示根據本發明一具體實施例之三槽相依性群組指令的階層移動視窗排程；圖18根據本發明一具體實施例顯示指令的可變尺寸相依性鏈(例如可變界限群組)如何分派至複數個計算引擎；圖19顯示一流程圖，其繪示根據本發明一具體實施例之區塊分派至排程佇列及三槽相依性群組指令的階層移動視窗排程；圖20根據本發明一具體實施例顯示相依性編碼區塊(如相依性群組或相依性鏈)如何執行於引擎上；圖21顯示根據本發明一具體實施例之複數個引擎及其構件的概圖，其包含一多核心處理器之一通用前端抓取&排程器及暫存器檔案、通用互連、及一片段記憶體次系統；圖22顯示根據本發明一具體實施例之複數個區段、複數個區段的共同劃分排程器及互連及至區段的埠；以及圖23顯示根據本發明一具體實施例之範例微處理器管線的圖表。 14 shows an example flow diagram showing how a dependency group of instructions is used to establish a variable bounding group of dependencies in accordance with an embodiment of the present invention; FIG. 15 shows a flowchart depicting FIG. 16 shows a hierarchical flowchart illustrating a hierarchical scheduling of a three-slot dependency group instruction according to an embodiment of the present invention; FIG. 17 shows a flowchart. A hierarchical movement window schedule of a three-slot dependency group instruction according to an embodiment of the present invention is illustrated; 18 illustrates how a variable size dependency chain (eg, a variable bounding group) of instructions is dispatched to a plurality of computing engines in accordance with an embodiment of the present invention; FIG. 19 shows a flowchart depicting an implementation in accordance with the present invention. The block of the example is assigned to the hierarchical movement window schedule of the scheduling queue and the three-slot dependency group instruction; FIG. 20 shows the dependency coding block (such as the dependency group or the dependency chain according to an embodiment of the invention) How to execute on the engine; Figure 21 shows an overview of a plurality of engines and their components, including a multi-core processor, a generic front-end crawler & scheduler, and a scratchpad file, in accordance with an embodiment of the present invention. , a general-purpose interconnect, and a segment memory sub-system; FIG. 22 shows a plurality of segments, a plurality of segments, a common partition scheduler, and interconnects and segments to the segments according to an embodiment of the present invention; 23 shows a diagram of an exemplary microprocessor pipeline in accordance with an embodiment of the present invention.

101‧‧‧抓取構件 101‧‧‧Grab components

102‧‧‧原生解碼構件 102‧‧‧ Native decoding components

103‧‧‧多工器 103‧‧‧Multiplexer

104‧‧‧多工器 104‧‧‧Multiplexer

105‧‧‧剩餘管線 105‧‧‧Remaining pipeline

121‧‧‧微指令快取 121‧‧‧Micro-instruction cache

122‧‧‧序列快取 122‧‧‧Sequence cache

130‧‧‧逐出及填充路徑 130‧‧‧Exit and fill path

Claims

A method for speeding up encoding optimization in a microprocessor, comprising: using an instruction fetching component to fetch a received macro instruction sequence; transferring the fetched macro instruction to a decoding component to decode a microinstruction; performing an optimization procedure by reordering the microinstruction sequence to optimize a microinstruction sequence comprising one of a plurality of dependency encoding groups; outputting the plurality of dependent encoding groups to the microprocessor A plurality of engines are executed in parallel; and a copy of the optimized microinstruction sequence is stored to a sequence of caches for subsequent use in subsequent clicks on one of the optimized microinstruction sequences.

The method of claim 1, wherein the replica of the decoded microinstruction is stored in a microinstruction cache.

The method of claim 1, wherein the optimizing program is executed using one of the dispatch and release phases of the microprocessor.

The method of claim 3, wherein the dispatching and publishing phase further comprises an instruction scheduling and optimizer component, the instruction scheduling and optimizer component reordering the microinstruction sequence as the optimization micro Instruction sequence.

The method of claim 1, wherein the optimizing program further comprises dynamically expanding the sequence of microinstructions.

The method of claim 1, wherein the optimizing the program is performed via a plurality of iterations.

The method of claim 1, wherein the optimizing program is executed by a register renaming program to enable the reordering.

A microprocessor includes: an instruction fetching component for fetching a received macro instruction sequence; a decoding component coupled to the instruction fetching component to receive the fetched macro instruction sequence and decoding a microinstruction sequence; a dispatching and issuing phase coupled to the decoding component to receive the microinstruction sequence, by reordering the microinstruction sequence to perform one of a plurality of dependent coding groups to optimize the microinstruction sequence An optimization engine; a plurality of engines of the microprocessor coupled to the dispatch and release phase to receive a plurality of dependent code groups for execution in parallel; and a sequence of caches coupled to the dispatch and release phase for receiving And storing a copy of the optimized microinstruction sequence for subsequent use in subsequent clicks on one of the optimized microinstruction sequences.

The microprocessor of claim 8 wherein the copy of the decoded microinstruction is stored in a microinstruction cache.

The microprocessor of claim 8 wherein the optimizing program is executed using one of the dispatch and release phases of the microprocessor.

The microprocessor of claim 10, wherein the dispatching and publishing stages The segment further includes an instruction scheduler and optimizer component, and the instruction scheduler and optimizer component reorders the microinstruction sequence into the optimized microinstruction sequence.

The microprocessor of claim 8, wherein the optimizing program further comprises dynamically expanding the sequence of microinstructions.

The microprocessor of claim 8, wherein the optimizing program is executed via a plurality of iterations.

The microprocessor of claim 8, wherein the optimizing program is executed by a register renaming program to enable the reordering.

A method for speeding up encoding optimization in a microprocessor, comprising: accessing an input microinstruction sequence using a software-based optimizer entityized in the memory; using the SIMD instruction to assemble the slave Inputting a dependency matrix of the dependency information of the microinstruction sequence; scanning a plurality of columns of the dependency matrix to optimize the microinstruction sequence by reordering the microinstruction sequence to one of a plurality of dependent coding groups An instruction sequence is executed to perform an optimization process; outputting the plurality of dependency coding groups to the plurality of engines of the microprocessor for execution in parallel; and storing a copy of the optimized microinstruction sequence to a sequence of caches, For subsequent use in subsequent clicks on one of the optimized microinstruction sequences.

The method of claim 15, wherein the optimizing program further comprises scanning a plurality of columns of the dependency matrix to identify matching instructions.

The method of claim 16, wherein the optimizing program further comprises analyzing the matching instruction to determine whether the matching instruction includes a blocking dependency, and wherein the renaming is performed to remove the blocking dependency.

The method of claim 17, wherein the first matching instruction sequence for each column of the dependency matrix is moved into a corresponding dependency group.

The method of claim 15 wherein the copy of the optimized microinstruction sequence is stored in a memory hierarchy of the microprocessor.

The method of claim 19, wherein the memory hierarchy comprises an L1 cache and an L2 cache.

The method of claim 20, wherein the memory hierarchy further comprises a system memory.