TW200935303A

TW200935303A - Strand-based computing hardware and dynamically optimizing strandware for a high performance microprocessor system

Info

Publication number: TW200935303A
Application number: TW97148039A
Authority: TW
Inventors: Matt T Yourst
Original assignee: Strandera Corp
Priority date: 2007-12-10
Filing date: 2008-12-10
Publication date: 2009-08-16
Also published as: WO2009076324A2; WO2009076324A3

Abstract

Strand-based computing hardware and dynamically optimizing strandware are included in a high performance microprocessor system. The system operates on real time automatically and unobservably to parallelize single-threaded software into a plurality of parallel strands for execution by cores implemented in a multi-core and/or multi-threaded microprocessor of the system. The microprocessor executes a native instruction set tailored for speculative multithreading. The strandware directs hardware of the microprocessor to collect dynamic profiling information while executing the single-threaded software. The strandware analyzes the profiling information for the parallelization, and uses binary translation and dynamic optimization to produce native instructions to store in a translation cache later accessed to execute the produced native instructions of some of the single-threaded software. The system is capable of parallelizing a plurality of single-threaded software applications (e. g. application software, device drivers, operating system routines or kernels, and hypervisors).

Description

200935303 九、發明說明：【發明所屬之技術領域】效率及使用效用之改需要電腦處理之改進以提供致能善。【先前技術】除非明確識別為公開或廣為熟知， (包括基於内容脈絡、定義或比較目的提=術及概念認此類技術與概念係先前公開熟、 ’’、解釋為承 A阉热知的或不然為先前部分。本文所引用的所有參考 ^ ^ 1右有的話），包括專利、專利申請案及公開案，無論是否明確併人均基 ^ 引用方式全文併入本文中。 ’目的以【發明内容】本發明可以採用許多方式予以實施，包括作為程序、製以物:、裝置、糸統及電腦可讀取媒體（例如光學及/或磁性大容量儲存器件(例如碟片）中之媒體或具有社參存器(例如快閃儲存器)之積體電路)。在此說明書中，此耸實施方案（或本發明可以採用的任何其他形式）可術。實施方式閣述實現以上所識別之技術領域中之=技效率及使用效用之改善的本發明之_或多個具二 t結論^更詳細所論述，本發明包含在所頌佈之巾請專利範圍之範疇内的所有可能修改及變化。【實施方式】以下隨同解說本發明之選定細節之附圖一起之-或多個具體實施例之詳細說明。關於該等具體 136758.doc 200935303 說明本發明。本文之具體實施例係明明顯不受限於本文之該等具體實施例=為;例，本發受其限制，m aa & ^ {何者或全部或且本發明包含許多替代者避免閣述千篇-律，可以應用各式各樣詞^等^者^ 不受限於)：第-、最後…、各種、二包括(但定、選摆甘-谷種、另外、其他、特使用=特別)來區分具體實施例集-本文所200935303 IX. Invention Description: [Technical field of invention] Improvement of efficiency and use efficiency Improvements in computer processing are required to provide good performance. [Prior Art] Unless explicitly identified as public or well-understood, (including the context of the context, definitions, or comparisons, the techniques and concepts are previously publicly known, '', interpreted as A Or otherwise the previous section. All references cited in this document, including the patents, patent applications, and publications, are hereby incorporated by reference in their entirety in their entirety. SUMMARY OF THE INVENTION The present invention can be implemented in a number of ways, including as a program, a device, a device, a system, and a computer readable medium (eg, an optical and/or magnetic mass storage device (eg, a disc) Medium in the medium or integrated circuit with a social storage (such as flash memory). In this specification, this embodiment (or any other form that the invention may take) may be practiced. MODE FOR CARRYING OUT THE INVENTION The present invention is embodied in a more detailed discussion of the technical efficiency and utility of the invention as identified above in the technical field identified above. All possible modifications and variations within the scope of the scope. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The following detailed description of selected embodiments of the invention, together with The invention is described with respect to such specific 136, 758.doc 200935303. The specific embodiments herein are obviously not limited to the specific embodiments herein; for example, the present disclosure is limited thereto, m aa & ^ {what or all or the invention includes many substitutes to avoid Thousands of laws - can apply a variety of words ^ and so on ^ are not limited to): first -, last ..., various, two including (but fixed, selected Gan - grain, other, other, special use = special) to distinguish the specific embodiment set - this article

偏見，而：己明顯並非意指傳達品質或任何形式之偏愛或 7見而僅用以方便地區別各別不同集。所揭示程序之某皂操作之順序係可在本發明之範疇内改變。在多個具施例用以說明程序、方法及/或程式指令特徵之變化之處，預期其他具體實施例依據一預定或動態決定之準則實行分別對應於複數個該等多個具體實施例之複數個操作模式之一之靜態及/或動態選擇。為了充分瞭解本發明，在以下說明中提出許多的特定細節。此等細節係基於範例之目的而提供且本發明可以依據申請專利範圍加以實施而無該等細節之某些或全部。基於清晰之目的，技術領域中所 ,、,、头的與本發明有關之技術材料未詳細加以說明以便不會徒然地使本發明模糊不清。概覽術語本文之揭示内谷使用各種術語。以下係至少某些術語之範例。執行緒之範例係處理器之軟體抽象化，例如處於相同架構機器狀態（例如軟體可見狀態）上後立即共用且執行的才t 136758.doc 200935303 令之動態序列。某些（所謂單執行緒）處理器經啟用以在一架構機器狀態上―二欠執行一指令序列。某些（所謂多執行緒）處理器經啟用以在N個架構機器狀態上—次執行N個指令序列。在某些系統中，作業系統在可用硬體資心立、毁壞及排程執行緒。串之範例係處理器硬體之抽象化’例如處於相同機器狀態上後立即共用且執行的動態微運算碼（例如藉由處理器Prejudice, and: Obviously it does not mean to convey quality or any form of preference or to see and distinguish only the different sets. The order of a certain soap operation of the disclosed procedures can be varied within the scope of the invention. Where a plurality of embodiments are used to illustrate changes in the characteristics of the program, method, and/or program instructions, it is contemplated that other embodiments are directed to a plurality of the specific embodiments, respectively, in accordance with a predetermined or dynamically determined criteria. Static and/or dynamic selection of one of a plurality of modes of operation. In order to fully understand the present invention, numerous specific details are set forth in the description below. The details are provided for the purposes of example and the invention may be practiced without departing from the scope of the invention. The technical material related to the present invention in the technical field, and the head is not described in detail in order to avoid obscuring the present invention in vain. Overview Terminology The terminology used in this article uses various terms. The following are examples of at least some of the terms. The example of a thread is the software abstraction of the processor, such as the dynamic sequence that is shared and executed immediately after being in the same architectural machine state (such as the software visible state). Some (so-called single-thread) processors are enabled to execute a sequence of instructions on an architectural machine state. Some (so-so-so-executive) processors are enabled to execute N instruction sequences on N architectural machine states. In some systems, the operating system is equipped with hardware, destruction, and scheduling threads. An example of a string is an abstraction of the processor hardware', such as a dynamic micro-opcode that is shared and executed immediately after being in the same machine state (e.g., by a processor)

硬體可直接執行之微操作）序列。對於某些串，機器狀態係架構機器狀態（例如架構暫存器狀態），且對於某些串機器狀態並非軟體可見的（例如已重命名暫存器狀態或效能分析暫存器）。在某些具體實施例中，若—串之機器狀態包括執行緒之所有架構機器狀態（例如通用暫存器 '軟體可存取機器狀態暫存器及記憶體狀態)則該串係作業系統可見的。在某些具體實施例中，即使一串之機器狀態包括-執行緒之所有架構機器狀態，該串也並非作業系統可見的。㈣串之範例係作業系統可見且對應於執行緒之串。推測串（例如後繼串）之範例係作業系統不可見之串。一些串僅包含隱藏機器狀態（例如預提取或逐行分析串卜 >在某些具體實施例中，串體及/或處理器硬體建立、毁壞及排程串。在某些具體實施例_ ’分又建立串。某些分 ::對指定目標位址(針對由分又所建立之串)且視需要地指定其他資訊（例如’欲作為機器狀態繼承之資料）的（上代串之）微運算碼作出回應。當執行（上代）串之微運算碼時， I36758.doc 200935303 視需要地建立一推測後繼串。在各種具體實施例及/或使用方案中，對殺除微運算碼、不可復原錯誤及串之完成（例如經由聯合）的一或多個作出回應毀壞串。在某些具體實施例及/或使用方案中，對聯合微運算碼作出回應聯合串。在某些具體實施例及/ * 或使用方案中，對一組硬體偵測條件（例如與後繼串之開始位址匹配的目前執行位址）作出回應聯合串。在各種具體實施例中’藉由串體及/或硬體之任何組合（例如對處理㈣算碼作出回應或自自地對默或程式指定條件作出回應）毀壞串。在某些使用方案中，藉由將一上代架構串之某一機器狀態與該上代之一後繼串之機器狀態合併聯合串；接著毀壞該上代且下代串視需要地變為一架構串。虛擬中央處理單元（v C P U)之範例係針對作業系統加以啟用以在任何特定時間將一執行緒排程於其上的一軟體可見執行内容脈絡。纟某些具體實施W中，—電腦系統將— ❹ 或多個VCPM 5見至作業系、统。各VCPU實施架構機器狀態之-暫存器部分，且在某些具體實施例中，在_或多個 VCPU之間共用架構記憶體狀態。概念上各包含由串體及/或硬體動態建立之一或多個串。對於各VCPU，將該等串配置於先進先出（FIF0)仔列令，其中下一欲提交串係 VCPU之架構串’且所有其他串為推測的。多核、多執行緒化及推測微處理器、多核及多執行緒化自20世紀7G年代引人第—個微處理器以來微處理器之效 136758.doc 200935303 月已增長。某些微處理器具有深管線及/或在多GHz時脈頻率運作以採用單一處理器從循序程式中擷取效能。軟體工币作為被處理器將循序及/或按順序執行的指令與操作之序列寫入某些程式。各種微處理器嘗試藉由在增加之時脈頻率運作、無序（〇◦〇)執行指+、以推測方式執行指令或其各種組合來增加程式之效能。某些指令係獨立於其他私令，因而提供指令級平行性（ILP)，且因此係可平行或The hardware can be directly executed by the micro-operation) sequence. For some strings, the machine state is the state of the machine (such as the architectural scratchpad state) and is not visible to some of the stringer states (for example, the scratchpad state or the performance analysis register). In some embodiments, if the machine state of the string includes all architectural machine states of the thread (eg, the general-purpose scratchpad 'software accessible machine state register and memory state'), the string operating system is visible of. In some embodiments, even if a string of machine states includes all of the architectural machine states of the thread, the string is not visible to the operating system. (4) The example of the string is visible to the operating system and corresponds to the thread of the thread. An example of a push string (e.g., a successor string) is an invisible string of operating systems. Some strings contain only hidden machine states (e.g., pre-fetch or line-by-line analysis)> in some embodiments, string and/or processor hardware builds, destroys, and schedules. In some embodiments _ 'Make a string again. Some points:: For the specified target address (for the string created by the branch) and optionally specify other information (such as 'data to be inherited as machine state') (upper generation string) The micro-opcode responds. When executing the (pre-generation) string of micro-ops, I36758.doc 200935303 optionally constructs a speculative successor string. In various embodiments and/or usage scenarios, the micro-ops are eliminated. Reversing the string by one or more of the unrecoverable errors and the completion of the string (eg, via federation). In some embodiments and/or usage scenarios, the joint micro-computing code is responsive to the combined string. In an embodiment and/or or usage scheme, a set of hardware detection conditions (e.g., a current execution address that matches the start address of a successor string) is echoed. In various embodiments, 'by string And/or any combination of hardware (such as responding to processing (4) calculations or responding to the conditions specified by the program or the program itself). Destroy the string. In some usage scenarios, by placing a string of a previous generation The machine state merges with the machine state of one of the previous generations; then the previous generation is destroyed and the next generation string becomes a string of architectures. The virtual central processing unit (v CPU) example is enabled for the operating system. A software-readable execution context on which a thread is scheduled at any given time. In some implementations, the computer system will see ❹ or multiple VCPMs 5 see the operating system and system. Scheduling the state of the machine - the scratchpad portion, and in some embodiments, sharing the architectural memory state between _ or multiple VCPUs. Conceptually each includes one of dynamically built by the string and/or hardware or Multiple strings. For each VCPU, the strings are configured in a first-in, first-out (FIF0) queue, where the next string of architectures to be submitted to the VCPU and all other strings are speculative. Multi-core, multi-threaded and Microprocessors, multicores, and multi-threading have been growing since the introduction of the first microprocessor in the 1960s. 136758.doc 200935303 has grown. Some microprocessors have deep pipelines and/or more The GHz clock frequency operates to extract performance from a sequential program using a single processor. The software coin is written as a sequence of instructions and operations that are executed sequentially and/or sequentially by the processor. Increasing the performance of the program by operating at increased clock frequency, out-of-order (〇◦〇) execution of instructions, speculative execution of instructions, or various combinations thereof. Some instructions are independent of other private orders and thus provide instruction levels. Parallelism (ILP), and therefore can be parallel or

〇〇〇執行。某些微處理器嘗試採用ILP來改善效能及/或增加微處理器之功能單元之利用。某些微處理器（有時稱為多核微處理器）具有一個以 (例如處理單元)。某些單一晶片實施方案具有完整多核微處理器’其在某些執行個體中具有共用快取記憶體及/或為該等核共用之其他硬體。在某些環境中，代理程式（例如串體）將計算任務分割成執行緒，且某些多核微處理器藉由在微處理器之核上平行執行該等執行緒實現較高效能。某些微處理器（例如某些多核微處理器）具有經啟用同時多執行緒化（SMT)之核。與X86指令集相容的某些微處理器（例如來自與· 之某些微處理器)具有(相肖複雜)〇〇〇核之相董"交少複製體。某些微處理器（例如來自Sun與咖之某些微處理器）具有（相對簡單）有序核之相對較多複製體。某些词服器與多媒體應用程式係加以多執行緒化，且具有相對較多核之某些微處理器在多執行緒軟體上相對較好地實行。某些多核微處理器在具有相對較高執：緒級平行性 I36758.doc 200935303 (TLP)之軟體上相對較好地實行。不過，在某些環境中，甚至當執行具有相對較高TLP之軟體時，某些多核微處理器之某些資源亦未使用。力求改善TLP之軟體工程師使用協調對共用資料之存取以避免碰撞及/或不正確行為之機制 '藉由減少或避免執行緒間之連鎖確保平穩且高效平行連鎖之機制及輔助出現在多執行緒實施方案中之錯誤之除錯之機制。〇〇〇 Execution. Some microprocessors attempt to use ILP to improve performance and/or increase the utilization of functional units of the microprocessor. Some microprocessors (sometimes referred to as multi-core microprocessors) have one (e.g., processing unit). Some single-chip implementations have a full multi-core microprocessor that has shared cache memory and/or other hardware shared by the cores in certain execution entities. In some environments, an agent (e.g., a string) divides computing tasks into threads, and some multi-core microprocessors are more efficient by executing the threads in parallel on the core of the microprocessor. Some microprocessors, such as some multi-core microprocessors, have a core that is enabled with simultaneous multi-threading (SMT). Some microprocessors that are compatible with the X86 instruction set (such as some microprocessors from and from) have a phased-to-core copy. Some microprocessors (such as some microprocessors from Sun and coffee) have relatively many replicas of (relatively simple) ordered cores. Some word processors and multi-media applications are multi-threaded, and some microprocessors with relatively many cores are relatively well implemented on multi-threaded software. Some multi-core microprocessors are relatively well implemented on software with relatively high level of parallelism, I36758.doc 200935303 (TLP). However, in some environments, certain resources of some multi-core microprocessors are not used even when executing software with relatively high TLPs. Efforts to improve TLP software engineers using mechanisms to coordinate access to shared data to avoid collisions and/or incorrect behavior 'by reducing or avoiding linkages between threads to ensure a smooth and efficient parallel chain mechanism and assist in multiple executions The mechanism for debugging errors in the implementation scheme.

相對於某些問題領域，某些編譯器將一執行緒之看似循序操作自動辨識為可分為操作之平行執行緒。某些操作序列相對於獨立性係不確定的且可能用於平行執行(例如自某些通用程式化語言(例#c、c+MJava)所產生之程式碼之部分）。軟體工程師有時使用某些特殊用途程式化語言 (或通用程式化語言之平行延伸）明顯表達平行性，及/或^ ^化多核及/或多執行緒微處理器或其部分（諸如圖形處理皁疋或GPU)。軟體工程師有時針對某些科學性、浮點及媒體處理應用程式明顯表達平行性。推測多執行緒化基本原理在某些使用方案及/或具體實施例中，推測多執4 、執行緒級推測或兩者實現更高效自動平行化。在-測多執行緒微處理器系統中編咩器軟體、串體、韌旁微碼或微處理器之硬體單开式盆乂胃早7"或其㈣組合將複數個類3 刀又指令之選定一者之一或多個 <夕個執仃個體概念插入至- 式之各種位置中。概念上，該系 —該系統開始執行在該程式户目標位址處之一（新）後繼串，甲理暫存器值（且視t I36758.doc 12 200935303 ί儲存物)自(上代)串(後繼串係自該(上代)串分又) 、’串之傳播。該傳播係經由停止後繼串直到該等值到或藉由預測該等值且梢後將預測值與由上代串所產生 .:值相比較。該系統作為-執行緒之一子集建立後繼串 , 相串自該執行緒接收架構狀態之子集及/或後繼執行緒之—指令子集）。分叉指令指定目標位址 :曰“日私暫存器(RIP)。㉟系統在各種具體實施例中經 ❹ 各種硬體元件（例如邏輯單元、有限狀態機、微碼引擎 3 電路）、各種軟體元件（例如藉由核可執行之指令、韌體二微竭、串體及其他軟體代理程式）或其各種組合實施串f理功能（例如分又與聯合）。人推測多執行緒微處理器系統採用（原始）程式順序處理聯 “呆作。考量將後繼串分又至目標位址之上代串。當上代直執行至目標位址（有時稱為交又點）時聯合出現。在 :些環境中，後繼串已完成(與上代串平行)且後繼串係立 ❹/準備聯合。在耳葬合點處，系统實行各種一致性檢查，例 . =保上代串傳播至後繼串的（可能預測）駐外暫存器值與 • 聯合點處上代串之實際值匹配。該等檢查保證具有分又串之執行結果係與無分又串之結果相同。若該等檢查之任何者失敗，則系統採取適當動作（例如藉由丟棄分又串之結 2)。上代與後繼_之聯合後，上代串終止。系統接著使 =上代串之内容脈絡可用於再使用。後繼串變為系統建立 δ亥串所針對之執行緒的架構可見執行個體。該系統使得後繼串之目前架構狀態（例如暫存器與記憶體）為微處理器内 I36758.doc -13- 200935303 之其他執订緒（例如在另—核上之執行緒卜微處理器之其他代理程式（例如DMA)及微處理器外部之器件可觀察。某些推測多執行緒系統實施巢套式串模型。例如，上代 • _P分又主要後料s，且以相方式分又子串π、似 …系統將子串巢套於上代串内。子串獨立於8且相互獨立執行。Ρ之所有子串完成後Ρ立即與s有條件地聯合。相八他推測多執行緒系統實施一已嚴格程式排序的非 I套式推測多執行緒化模型。例如’各上代串ρ具有在任何時間肖未處理完畢的至多一分又後繼串Sel^s交又（導致聯合）或S不再執行之前ρ不再分又串。在某些環境中， ”=施巢套式模型相比實施非巢套式模型使用較少及/或較間單硬體。採用未修改循序程式之某些使用方案係適於結合非巢套式模型實施方案使用。某些推測多執行緒系統使用記憶體版本管理。例如， (推測）儲存至特定記憶體位置的一後繼串使用為就程式順 ❹“言比後繼串晚之串可觀察但其他串(其就程式順序而 . :係比後繼串早）不可觀察的該位置之一私有版本。在聯 7後繼與上代串❹統使得推測儲存物為其他代理程式可，觀察（以原子方式）。其他代理程式包括不同於後繼（且較晚）串、微處理器之其他串或單元（例如DMA)、微處理器外部之器件及系統之啟用用以存取記憶體之任何元件的串在某些環境中，聯合之前系統累加數千位元組之推剛儲^資料。考量其中一上代串（就程式順序而言較晚）欲寫入-記憶體位置且該上代串之一後繼串欲讀取該記憶體位 136758.doc 200935303 置的一情況。若後繼串在上代串寫入記憶體位置之前讀取 β己憶體位置，則系統令止後繼串。揭不内容有時將上述 f況稱為跨越串記憶體混淆。在某些方案中，系統藉由選擇为又點減少(或避免)跨越串記憶體混淆之發生，導致較少（或無）跨越串記憶體混淆。 =上，系統將屬於特定執行緒之串配置於已程式排序其類似於無序處理11中之重新排序緩衝器（R0B) 之個別指令。系統依程式順序處理串分又且聯合。在仔列頭部之串為架構串，且經啟用以執行聯合操作之唯一串，而後續串係推測串。在某些方案中，串包含獨立於其他串的複雜控制流(例如分支、呼叫及迴圈)。在某此環境中，串在建立（在分又點處）與終止（在聯合點處）之間執行數千個和令。在某些情況下，即使具有相對較少尚未處理完畢串相對較大數置的串級平行性係透過數千個指令可用0 某些㈣基於各式各樣的目的（例如預提取）使用推測多灯緒化，而某些系統僅針對預提取使用推測多執行緒化。例如’一特定串在執行一導致存取相對較慢^快取或主記t體之载入指令時遇到快取遺漏。系統自載入指令分 2提t串’且停止特定串。系統在等待針對（遺漏）載入之傳回資料時繼續停止特定争。與某些其他類型之串不同，遺漏載入不阻隔箱担不it 是提供預測或虛設值而償。在各種使用方案中，預提取串致 b針對具有獨立於初始遺漏載入所計算之位址之載 136758.doc 200935303 入的預提取；針對與處理已連結列表相m的預提取，調諧或預校正分支預測器；或其任何組合。春載入作出補償（例如由於預提取宰使用預測或虛^且不適於聯合至另一串）時中止對遺漏載入作出回應而分又之預提取_ ^ 人（在某些環境中’經由推測多執行緒化所獲得之效能改善Relative to some problem areas, some compilers automatically recognize a seemingly sequential operation of a thread as a parallel thread that can be divided into operations. Some operational sequences are indeterminate relative to independence and may be used for parallel execution (e. g., portions of code generated from some common stylized languages (eg #c, c+MJava)). Software engineers sometimes use some special-purpose programming languages (or parallel extensions of general-purpose programming languages) to express parallelism and/or multi-core and/or multi-threaded microprocessors or parts thereof (such as graphics processing). Saponin or GPU). Software engineers sometimes express parallelism for certain scientific, floating point, and media processing applications. Predicting the basic principles of multi-threading In some usage scenarios and/or specific embodiments, it is speculated that multi-execution 4, thread-level speculation, or both achieve more efficient auto-parallelization. In the multi-threaded microprocessor system, the compiler software, string body, tough side microcode or microprocessor hardware single open basin stomach early 7" or its (four) combination will be a plurality of class 3 knife One or more of the selected ones of the instructions are inserted into various locations of the formula. Conceptually, the system - the system begins executing one of the (new) successor strings at the target address of the program, and the value of the scratchpad value (and the storage of the I) is from the (previous generation) string. (Subsequent stringing from the (previous generation) stringing again), 'string propagation. The propagation is based on stopping the successor string until the equivalent or by predicting the value and comparing the predicted value with the value generated by the previous generation string. The system acts as a subset of the thread to establish a successor string that receives a subset of the architectural state and/or subsequent threads from the thread. The fork instruction specifies the target address: 日 "RIP". The 35 system is implemented in various embodiments by various hardware components (eg, logic unit, finite state machine, microcode engine 3 circuit), various Software components (such as kernel-executable instructions, firmware two-exhaustion, serials, and other software agents) or various combinations thereof implement string functions (such as splits and unions). People speculate multi-thread micro-processing The system uses the (original) program order to process the "stay." Consider the succession of the string and the string of the destination address. Co-occurs when the previous generation executes directly to the target address (sometimes called the intersection). In some environments, the successor string has been completed (parallel to the previous generation string) and the subsequent string is erected/prepared. At the ear burial point, the system performs various consistency checks, for example. = The (possibly predicted) external register value of the succession string is transmitted to the successor string at the joint point. These checks ensure that the results of the execution of the series and the string are the same as those of the undivided and string. If any of these checks fail, the system takes the appropriate action (for example, by discarding the separate string 2). After the combination of the previous generation and the successor _, the previous generation string is terminated. The system then makes the context of the = previous generation string available for reuse. The successor string becomes the system establishment. The architecture of the thread to which the δ 串 string is directed can be seen as an execution entity. The system makes the current architectural state of the successor string (such as the scratchpad and the memory) the other bindings of the I36758.doc -13- 200935303 in the microprocessor (for example, the executor of the microprocessor on the other core) Other agents (such as DMA) and devices external to the microprocessor can be observed. Some speculative multi-threaded systems implement nested string models. For example, the previous generation • _P points are mainly post-materials, and are separated by phase. The string π, like... system nests the substrings in the upper generation string. The substrings are independent of 8 and are executed independently of each other. After all the substrings are completed, they are conditionally combined with s immediately. Implement a non-I-type speculative multi-threaded model with strict program ordering. For example, 'the upper generation string ρ has at most one point and the subsequent string Sel^s intersection (causing joint) or S at any time. The ρ is no longer divided before the execution. In some environments, the “= nested model uses less and/or more single hardware than the non-nested model. The unmodified sequential program is used. Some usage scenarios are suitable for combining non- The use of a nested model implementation. Some speculative multi-threaded systems use memory version management. For example, (presumably) a successor string stored to a specific memory location is used as a program to "speak more than a successor string." Observe but the other strings (which are in the order of the program. : The system is earlier than the successor string). The private version of this position is unobservable. In the joint 7 and the previous generation, the constellation makes the speculative storage available to other agents. Atomic mode. Other agents include different strings (or DMA) from subsequent (and later) strings, microprocessors, devices and systems external to the microprocessor to access any component of the memory. In some environments, the system accumulates thousands of bytes of data before the joint. Consider one of the upper strings (later in terms of program order) to write to the memory location and the upper generation string A subsequent string is to read the memory bit 136758.doc 200935303. If the successor string reads the position of the beta memory before the previous string is written to the memory location, the system stops the successor string. Uncovering the content sometimes refers to the above-mentioned f-state as confusing across string memory. In some scenarios, the system reduces (or avoids) the occurrence of cross-string memory confusion by selecting to reduce (or avoid) the occurrence of cross-string memory confusion. Crossing the string memory. =Up, the system configures the string belonging to the specific thread to the individual order of the reordering buffer (R0B) similar to the out-of-order processing 11. The system processes the sequence according to the program sequence. And the union. The string at the head of the queue is a skeleton string, and is enabled to perform a unique string of joint operations, and the subsequent string is a speculative string. In some schemes, the string contains complex control flows independent of other strings (eg Branches, Calls, and Loops. In some circumstances, the string performs thousands of sums between establishment (at the point and point) and termination (at the joint point). In some cases, even if there are relatively few unprocessed strings, the string parallelism is relatively large. It can be used by thousands of instructions. Some (4) use speculation based on a variety of purposes (such as pre-fetching). Multi-lighting, while some systems use speculative multi-threading only for pre-fetching. For example, a particular string encounters a cache miss when executing a load instruction that results in a relatively slow access to the cache or the master. The system self-loading instruction divides the t-string ' and stops the specific string. The system continues to stop the specific contention while waiting for the returned data to be (missing) loaded. Unlike some other types of strings, missing loads are not blocked. It is compensated for the provision of predictions or dummy values. In various usage scenarios, pre-fetching string b is pre-fetched for 136758.doc 200935303 with a location independent of the initial miss loading; for pre-fetching, tuning or pre-processing with the processed linked list Correct the branch predictor; or any combination thereof. Spring loading makes compensation (for example, due to pre-fetching the use of predictions or virtual ^ and is not suitable for union to another string), the suspension of the missed loading in response to the pre-fetching _ ^ people (in some environments) via Speculative improvement in performance achieved by multi-threading

取決於分又與聯合點之特定選擇。在某些具體實施例中’ 系統將分又點放置於抻在丨M 徑最終到達之點)處：(例所有可能執行路 .違之點）處。例如，相對於迴圈之目前反覆過程 lteratlon)’系統分又在緊緊跟隨目前反覆過程之反覆過程處開始之一串，因而實現兩個串全部或部分平行執行。㈣另一範例（例如當迴圈之反覆過程係相互相依時），系充刀又_以執行跟隨迴圈結束之程式碼，實現迴圈之反覆過程在一串中執行’而迴圈後面之程式碼在另一串中執丁對於另一範例，系統分又一串以開始執行跟隨一自已 Φ 呼”函式之返回之程式碼(視需要地預測已呼叫函式回 Ά、，til a ™ 、已呼叫函式與跟隨返回之程式碼經由兩個一^部或部分平行執行。在各種具體實施例中，藉由下列或多者插入分又點：藉由編譯器及/或串體自動（視需要二至少部分基於逐行分析執行、分析動態程式行為或兩 )藉由硬體自動；及藉由程式設計者手動。彳夕執行緒化之各種具體實施例係自動的及/或不可 2 $些自動及/或不可觀察推測多執行緒化具體實施 1 '、σ應用於所有類型之目標軟體（例如應用程式軟體、 136758.doc -16- 200935303 器件驅動程式、作業系統常式或核心及管理程序）而無需任何程式設計者介入。（應注意，說明有時將目標軟體稱為目標碼，且目標碼係包含目標指令。）某些自動及/或不可觀察推測多執行緒化具體實施例係與產業標準指令集 •r (例如x86指令集）、產業標準程式設計工具或語言（例如c、 C++及其他語言）及產業標準通用電腦系統（例如伺服器、工作台、桌上型電腦及筆記型電腦）相容。系統架構 φ 具備串能力之電腦之系統圖1A解說關於具備串能力之電腦之系統，各具備串能力之電腦具有一或多個具備串能力之微處理器，具備串能力之微處理器可存取串體影像、記憶體、非揮發性儲存器、輸入/輸出器件及網路。概念上系統執行串體以觀察（經由硬體協助）及分析目標軟體（例如應用程式、驅動程式、作業系統及管理程序軟體）之（例如χ86)指令之動態執行。串 Φ 體使用觀察來決定如何將χ86指令分割成適於平行執行於 . *備串能力之微處理器之VLIW核資源上之複數個串。串體將已分割指令轉譯為操作（例如微操作或微運算碼），然 ’ 後將該等操作配置於束中用於在VLIW核資源上高效執〃 °串體將該等束儲存於轉譯快取記憶體中用於稍後使用 (例如作為-或乡㈣轉轉譯視需要地包括增加有不直接對應於X86指令之額外操作（例如用以改善效能或用以實現串之平行執行)。系統其後針對已儲存束(例如串影像而非X86指令之部分)之執行配置且執行已儲存束以嘗試 136758.doc 200935303 改善效能。在某些具體實施例中，觀察、分析、分割及針對已儲存束之配置的一或多個及已儲存束之執行係相對於指令之追縱（trace)。該圖式解說具備串能力之電腦2000.1至2000.2，其經啟用以用於經由耦合2063、2064及網路2009而彼此之通信。具備串能力之電腦2000.1經由耦合2050耦合至儲存器 2010，經由耦合2055耦合至鍵盤/顯示器2005且經由耦合 205 6耦合至周邊設備2006。該網路係實現具備串能力之電腦之間之通信的任何通信基礎架構，例如區域網路（LAN)、都會區域網路（MAN)、廣域網路（WAN)及網際網路之任何組合。耦合2063係與（例如）乙太網路（例如l〇Base-T、100Base-T及1或10十億位元）、光學網路（例如同步光學網路或SONET)或針對叢集之節點互連機制（例如無限頻帶（Infiniband)、MyriNet、 QsNET或刀鋒型伺服器背板網路）相容。儲存元件係任何非揮發性大容量儲存元件、陣列或其網路（例如快閃記憶體、磁碟或光碟，以及經由網路附接儲存器或NAS及/或儲存器陣列網路或SAN技術所耦合之元件）。耦合2050係與 (例如）乙太網路或光學網路、光纖通道、先進技術附接或 ΑΤΑ、串列ΑΤΑ或SATA、外部SATA或eSATA以及小型電腦系統介面或SCSI相容。鍵盤/顯示器元件係概念代表字母數字、圖形或其他人輸入/輸出器件之一或多者之任何類型（例如QWERTY鍵盤、光學滑鼠及平板顯示器之組合）。耦合2055係概念代 136758.doc -18- 200935303 表實現具備串能力之電腦與鍵盤/顯示器之間之通信的一或多個麵合。在-範例中，麵合2055之一元件係與通用串列匯流排（USB)相容且另一元件係與視訊圖形配接器 (VGA)連接器相容。周邊設備元件係概念代表可與具備串此力之電腦結合使用之一或多個輸入/輸出器件之任何類，型（例如掃描器或印表機）。辆合2056係概念代表實現具備串能力之電腦與周邊設備之間之通信的一或多個耦合。在各種具體實施例（未解說）中，解說為在具備串能力之 t腦外部之各種元件（例如儲存器2 Q i Q、鍵盤/顯示器2 〇〇 $ 及周邊設備2006)係包含於具備串能力之電腦中。在某些具體實施例中，具備串能力之微處理器2〇〇1丨至⑼⑴2之一或多者包括用以致使能夠耦合至在功能上與解說為在具備串能力之電腦外部之元件之任何者相同或類似之元件的硬體。在各種具體實施例中，所包含之硬體係與一或多個特定協定相容，例如周邊組件互連（pci)匯流排、pci延伸 ❿ （PCI_X)匯流排、PCI特快（PCI-E)®流排、超傳輸（Ητ)匯流排及快速路徑互連（QP1)匯流排之一或多者。在各種具體 . 冑施例中，所包含之硬體係與一用以與（中間）晶片組通信 - 之專屬協定相容，啟用該（中間）晶片組以經由特定協定之任一個或多個通信。在某些具體實施例中，具備串能力之電腦係彼此相同，且在其他具體實施例中具備串能力之電腦依據與市場及/ 或消費者需要相關之差異而變化。在某些具體實施例中，具備串月匕力之電腦作為祠服器、工作台、桌上型電腦、筆 136758.doc 19 200935303 δ己型電腦、個人或攜帶式電腦運作。乂1^1\¥核2013.1及異動記憶體2〇14 如圖所解說，具備串能力之電腦2〇〇〇1包括兩個且備串能力之微處理器讓^⑴觀.2，其分_合至動態隨機存取記憶體（DRAM)元件細2. i至2術.2。具料能力之微處理器分別地經由麵合205 i. i至· 2與快閃記憶體2003 通信且經由耦合2Q53彼此通信。具㈣能力之微處理器 2001.1包括逐行分析單元2〇111、串管理單元2〇121、在某些具體實施例_，具備串能力之微處理器同’且在其他具體實施例中具備争能力之微處理器依據與市場及/或消費者需要相關之差異而變化。在各種具體實施例中，具備串能力之微處理器係實施於單一積體電路晶粒、複數個積體電路晶粒、多晶粒模組及複數個封裝電路之任何者中。為簡潔起見，以下說明係相對於所解說之具備串能力之微處理器之-者。其他具備串體能力具備串能力之微處理器之操作係類似的。具備串體能力之微處理器2〇〇ΐι退出重設狀態（例如當實行冷開機時）且開始自包含於快閃記憶體2003中之串體影像2004之程式碼部分提取且執行串體之指令。該等指令之執行初始化各種串體資料結構（例如，解說為DRAM 2002.1之部分的串體資料2〇〇21A與轉譯快取記憶體2〇02.1B)。初始化包括將串體影像之程式碼部分之所有或任何子集複製至串體資料之一部分，且針對堆積'堆疊及私有資料儲存設定串體資料之側邊區域。 136758.doc -20- 200935303 接著具備串能力之微處理器開始處理χ86指令（例如在某些具體實施例中包含於快閃記憶體中之χ86開機韌體），其經党上述觀察（至少部分經由逐行分析單元2〇111)與分 • #。該處理係進-步經受上述分割成串用於平行執行、轉譯為操作且配置於對應於各種串影像之束中以及儲存於轉澤快取圮憶體（例如轉譯快取記憶體2〇〇2,1Β中）。該處理係進一步經党上述後續針對已儲存束之配置及已儲存束之執行（至少部分經由串管理單元2012.1、VLIW核2013.1及異看動記憶體2014」）。圖式中所解說的元件之分割僅為解說性因為存在採用其他分割的其他具體實施例。例如，各種具體實施例在具備串能力之微處理器中包括快閃記憶體及/或DRAm之全部或任何部分。對於另一範例’各種具體實施例在具備串能力之微處理器中（例如在積體電路晶粒上之一或多個靜態隨機存取記憶體或SRAM中）包括用於串體資料及/或轉譯 φ 快取S己憶體之全部或任何部分之儲存器。對於另一範例，在某些具體實施例中，串體資料2〇〇21八與轉譯快取記憶體2002.1B係包含於不同DRAMt (例如一者在第一雙直列。己憶體模組或DIMM中且另一者在第二DIMM中）。對於另範例，各種具體實施例將串體影像之全部或任何部分儲存於儲存器2010上。大容量多執行緒硬體與串體圖1B與1C共同解說與具備串能力之微處理器（例如圖… 之具備串能力之微處理器之任一個）相關的 136758.doc 21 200935303 概心硬體、串體(軟體)及目標軟體層(例如子季絲式在本質上係概念性的，且㈣：如：系統)。該圖控制及某㈣_合。4_起見’該圖式省略各種 1二::'9。包括一或多個獨立核(例如vliw核⑼」至 _τ)及/:丁個體)，各核實現依據適於同時多執行緒化二?I内容脈絡切換的一或多個硬體執行緒内容脈絡(例如儲存於暫存器槽案賴.…⑷及,或串内容脈π 194Β.1至ι94Β.4之執行個體中）處理。微處理器經啟用以依據6令集架構執行指令。微處理器包括推測多執行緒延伸與增肖，例如用以實現分又與聯合指令及/或操作之處理之硬體、執行緒間及核間暫存器傳播邏輯及/或電路 (多核互連網路195)、實現記憶體版本管理及衝突偵測能力之異動δ己憶體183、逐行分析硬體181及實現推測多執行緒化處理之其他硬體元件。在所解說之具體實施例中，微處理器亦包括一多層快取記憶體階層（例如Ll D快取記憶體 193.1至193.4及L2/L3快取記憶體196之執行個體）、至在微處理器外部之大容量記憶體及/或硬體器件之一或多個介面（耦合至外部系統/串體DRAm 184A之DRAM控制器與北橋197)、（例如）在具有複數個微處理器（各微處理器視需要地包括複數個核）之電腦中有用的一插座至插座系統互連 (多插座系統互連198)及至外部硬體器件之介面/耦合（用於經由外部PCI特快、QPI、超傳輸199耦合之晶片組/PCIe匯流排介面186)。串體層110A與110B(有時統稱為串體層no)與（χ86)目標 136758.doc -22- 200935303 軟體層101係至少部分藉由包含於微處理器中及/或耦合至微處理器之一或多個核（例如圖1CiVLIW核191>1至191 4 之執行個體之任何者）之全部或任何部分加以執行。串體層對於目標軟體層之元件而言係概念不可見，概念上"在” 目標軟體層”下面”及/或”在"與目標軟體層"相同級處"透明運作。目標軟體層包括作業系統核心102及解說為係"在” 作業系統核心"上方"執行之程式（解說為應用程式103.1至 103.4之執行個體在各種具體實施例及/或使用方案中，目標軟體層包括-管理複數個作業系統執行個體之管理程序程式（例如類似於VMware或Xen) 在各種具體實施例中，串體層實現以下能力之一或多者 .用以將一或多個虛擬CPU(例如vcpu 1〇41至1〇46之執行個體）及關聯虛擬器件174呈現至目標軟體的微處理器硬體之虛擬化。VCPU看似執行—其中編碼目標軟趙層之目標指令集。將VCPU動態映射至微處理器之原生核（例如致使能夠執行原生指令集之儿_191」至 191.4之執行個體）及串内容脈絡（其保存⑽如）在暫存器樓案叫幻至歸.4及/或串内容脈絡194Βι^94Ββ 一或多個執行個體中）上。析丁 /軟體時目標軟體之工具裝備、逐行分析及分 Ζ行绪^部分用以識別將（循序）指令串流分成推測多的個別循成：例如，系統將由啊之―或多個所執行序指令串流分割成多個推測多執行緒之串。 136758.doc -23- 200935303 •基於分析將指令及/或程式碼序列插入至目標軟體中，以調用微處理器之各種推測多執行緒硬體單元以分又及聯合串、以預測及/或傳播駐内（live in)值至串、以管理記憶體版本及串間之衝突以及以分又預提取串。 •用以使推測多執行緒化效能加速的目標軟體之最佳化， * 例如重新排程指令以時間更早地產生關鍵串駐内值、延期及/或重新排序抑制平行性之操作以破壞或消除跨越串相依性及移除記憶體混淆、及移除預提取串内之冗餘 ❹ 操作。、 •維護已修改、已工具裝備及/或已最佳化程式碼之儲存庫（例如經由轉譯快取記憶體管理〗丨υ以便儲存庫中之程式碼對於目標碼而言係不可見且係可用以由串體調用而代替原始目標碼（例如加以修改、工具裝備或最佳化之前之目標碼之一部分）。 •處理任何内冑異常或錯豸’其係修?文、工具裝備及最佳藝化（例如推測多執行緒化）之任何者之結果，不然在執行目標軟體時不會出現。在某些環境中，内部異常或錯誤之處理包括重新最佳化及/或停用降低效能之最佳化。 •將一可選機制提供至目標碼用於為串體提供隱示，例如可能有益分又點、同步點、可能跨越串混淆點及其他最佳化資訊。二進制轉譯與動態最佳化在某些具體實施例中，微處理器硬體經啟用以執行—與目標軟體之指令集不同的内部指令[在各種具體實施例 136758.doc -24- 200935303 、’需要地與—或多個硬體加速機制之任何組合協力合作 : 實行動態一進制轉譯（例如經由χ86二進制轉譯1丨$) 以：：或多個目標指令集(例如χ86相容指令集，例如χ86_ . 64♦"集）之目標*體轉譯成原生微操作（微運算碼）。軟體加速機制包括逐行分析硬體1 8 1、硬體加速單元i 82、異動 .^憶體183及硬體咖解媽器187之—或多個之全部或任何 #刀微處理器硬體（例如VLIW核191.1至191.4之執行個體）經啟用以直接執行微運算碼（且在各種具體實施例中，微處理器硬體未經啟用以直接執行目標指令集之一或多個之扣令）。至少在某些環境下，接著將轉譯儲存於儲存庫中（例如經由轉譯快取記憶體管理111)用於快速再呼叫及再使用（例如作為_影像），從而消除再次轉譯。在各種具體實施例中，微處理器經啟用以存取（例如藉由耦合或附接至）相對較大記憶體區。系統經由專用DRAM 模組（其在各種具體實施例中包含在微處理器中或在微處 Φ 理器外部）或替代地作為對於目標碼而言係不可見的外部 • 系統/串體DRAM 184A中之一已儲備區之部分實施該記憶體區。該記憶體區提供針對串體之各種元件（例如程式 . 碼、堆疊、堆積及資料之一或多個）之儲存器且在某些具體實施例中乂供轉譯快取記憶體（例如由轉譯快取記憶體 B理1 11所管理者）之全部或任何部分以及視需要地提供一或多個緩衝器（例如推測多執行緒化暫時狀態緩衝器）。當微處理器最初開機（例如藉由實行冷開機）時，將串體程式碼自快閃ROM複製至記憶體區中（例如至專用DRAM模組 136758.doc -25- 200935303 或外部系統/串體DRAM說之已儲備部分中），微處理器接著自該記憶體區提取原生微運算碼。串體初始化微處理 W例如經由硬體控制172)及串體之内部資料結構之後，串 .體開始使用二進制轉譯（例如經由χ86二進制轉譯115)執行 ’ 開機勒體及’或作業系統核心開機程式碼(編碼於目標指令集之-或多個中），其類似於無二進制轉譯層的習知以硬體為基礎的微處理器。在某些使时案中，與新增推測多執行緒化指令至目標指令集相比使用串體來實行二進制轉譯及/或動態最佳化提供優點。在某些環境中，二進制轉譯及/或動態最佳化 (例如）藉由移除及/或減少用於解碼目標指令集之硬體（例如硬體祕解碼器187)及用於無序執行之硬體實現簡化各祆之硬冑纟某些具體實施例中’概念上採用—或多個 VLIW(超長指令字）微處理器核（例如vli· 191」至⑼* 之執行個體）來取代已移除及/或已減少硬體。vuw核心 φ (例如)執行預排程微運算碼之束，其中-束之所有微運算 . 碼（例如在複數個功能單元（例如ALU 192幻至192尤4及 FPU 192B. 1至192B.4之執行個體）上）平行執行（或開始執 . 〃）。在各種具體實施例中，VLIW核缺少相對較複雜解碼、以硬體為基礎的相依性分析及動態無序排程之一或多個。VLIW核視需要地包括本機健存器（例#li d快取記憶體193」至193.4及暫存器檔案194Α.β194Α.4之執行個體）及其他每核硬體結構以用於高效處理指令。在某些使用方案及/或具體實施例中，VLIW核係小得足 136758.doc -26 - 200935303 以實現以下者之—或客去.腺苗夕^ -夕者.將更夕核封裝於給定晶粒區中、在給定功率預算内為更多核供電及在比其他情況下針對複雜無序核係可能之頻率高的頻率對核計時。在某些使 • 用方案及/或具體實施例中，經由二進制轉譯使VLIW核與目標指令集語意隔離實現與高效推測多執行緒化相關的微 * $算碼格式、暫存器及VLIW核之各種細節之高效編碼，無需修改目標指令集。串體動態最佳化軟體之作用串體層之追蹤建構子系統（例如追蹤逐行分析與捕獲 120)在藉由微處理器執行時收集及/或將已轉譯微運算碼 (例如自具有穿過目標碼之共同控制流路徑之已轉譯基本區塊之序列的微運算碼）組織成追蹤。串體使用各式各樣的技術實行相對廣泛最佳化（例如經由最佳化丨63)。某些技術在範疇上係類似於有權使用原始碼之最佳化編譯器實行之技術，但串體使用在逐行分析（例如經由實體頁逐行分 ❹ 析121、分支逐行分析124、預測性最佳化125及記憶體逐行分析127之一或多個）期間所收集之動態測量之程式行為來導引至少某些最佳化。例如，選擇性重新排序（例如依 * 據經由記憶體混淆分析162所獲得之資訊）至記憶體之載入與儲存物以儘可能早地起始快取遺漏。在某些具體實施例中’選擇性重新排序係至少部分基於參考同一位址之载入與儲存物之測量（例如經由記憶體逐行分析127進行）。在某些使用方案及/或具體實施例中，選擇性重新排序在數百個指令之範疇上實現相對較積極最佳化。接著依據輸入運 136758.doc -27· 200935303 算元何時將可用且各種硬體資源（例如功能單元）何時將自由排程各微運算碼（例如藉由經由排程各微運算碼165插入Depends on the specific choice of points and joint points. In some embodiments, the system places points and points at the point where the 丨M path finally arrives: (where all possible paths are executed.). For example, the current repetitive process relative to the loop, lteratlon)' system, starts with a string that follows the repetitive process of the current repetitive process, thus implementing the two strings in whole or in part. (4) Another example (for example, when the repetitive processes of the loops are mutually dependent), the knives are _ to execute the code following the end of the loop, and the repetitive process of the loop is executed in a string and the loop is followed by The code is executed in another string. For another example, the system is divided into another string to start executing the code following the return of a self-operating function. (If necessary, predict the called function, til a TM The called function and the following return code are executed in parallel by two parts or parts. In various embodiments, the following points are inserted by: or by the compiler and/or the string automatically (as needed, based at least in part on line-by-line analysis execution, analysis of dynamic program behavior or both) by hardware automation; and by programmers manually. Various embodiments of the implementation of the system are automatic and/or not 2 $ some automatic and / or unobservable speculation multi-threaded implementation 1 ', σ applied to all types of target software (such as application software, 136758.doc -16- 200935303 device driver, operating system routine Core and hypervisor) without any programmer intervention. (Note that sometimes the target software is called the target code and the target code contains the target instructions.) Some automatic and/or unobservable speculation multi-threading Specific embodiments are industry standard instruction sets • r (eg x86 instruction set), industry standard programming tools or languages (eg c, C++ and other languages) and industry standard general purpose computer systems (eg server, workbench, table) Compatible with computer and notebook computer. System architecture φ System with computer capable of string capability Figure 1A illustrates a system with a string capable computer. Each computer with string capability has one or more microprocessors with string capability. A serial-capable microprocessor can access serial images, memory, non-volatile memory, input/output devices, and networks. Conceptually, the system performs a string to observe (via hardware assistance) and analyze target software. (eg application, driver, operating system and hypervisor software) (eg χ86) dynamic execution of instructions. It is determined how to split the χ86 instruction into a plurality of strings on the VLIW core resource of the microprocessor that is executed in parallel for the .. string-sequencing capability. The string translates the segmented instruction into an operation (eg, micro-operation or micro-operation code). And then 'configure the operations in the bundle for efficient execution on the VLIW core resource. The string is stored in the translation cache for later use (eg as - or township (4) Translating the translation needs to include additional operations that do not directly correspond to the X86 instructions (eg, to improve performance or to implement parallel execution of the string). The system then targets the stored bundle (eg, a string image rather than a portion of the X86 instruction). Performing the configuration and executing the stored bundle to try 136758.doc 200935303 to improve performance. In some embodiments, observing, analyzing, segmenting, and executing one or more of the stored bundle configurations and the stored bundles Relative to the trace of the instruction. The figure illustrates a string capable computer 2000.1 to 2000.2 that is enabled for communication with each other via couplings 2063, 2064 and network 2009. The string capable computer 2000.1 is coupled to the memory 2010 via the coupling 2050, to the keyboard/display 2005 via the coupling 2055 and to the peripheral device 2006 via the coupling 205 6 . This network is any communication infrastructure that enables communication between computers with serial capabilities, such as any combination of regional network (LAN), metropolitan area network (MAN), wide area network (WAN), and the Internet. The coupling 2063 is associated with, for example, an Ethernet network (eg, l〇Base-T, 100Base-T, and 1 or 10 billion bits), an optical network (eg, a synchronous optical network or SONET), or a node for clusters. Connection mechanisms such as Infiniband, MyriNet, QsNET, or blade server backplane networks are compatible. The storage component is any non-volatile mass storage component, array or network thereof (such as flash memory, disk or CD, and via network attached storage or NAS and / or storage array network or SAN technology) The coupled component). The coupling 2050 is compatible with, for example, Ethernet or optical networking, Fibre Channel, advanced technology attachment or port, serial port or SATA, external SATA or eSATA, and small computer system interface or SCSI. The keyboard/display component concept represents any type of one or more of alphanumeric, graphic or other human input/output devices (eg, a combination of a QWERTY keyboard, an optical mouse, and a flat panel display). The Coupled 2055 Conceptual Generation 136758.doc -18- 200935303 table implements one or more aspects of communication between a string capable computer and a keyboard/display. In the example, one of the components of the face 2055 is compatible with the universal serial bus (USB) and the other component is compatible with the video graphics adapter (VGA) connector. The Peripheral Component Symbol concept represents any type (eg, scanner or printer) that can be used with one or more I/O devices in conjunction with a computer with a string of forces. The 2056 Series concept represents one or more couplings that enable communication between a computer capable of string capabilities and peripheral devices. In various embodiments (not illustrated), various components (eg, memory 2 Q i Q, keyboard/display 2 〇〇$, and peripheral device 2006) that are external to the brain having string capabilities are included in the string. The ability of the computer. In some embodiments, one or more of the serial-capable microprocessors 2〇〇1丨 to (9)(1)2 are included to enable coupling to elements functionally and functionally external to the computer having string capabilities. Any hardware of the same or similar components. In various embodiments, the hard system included is compatible with one or more specific protocols, such as peripheral component interconnect (pci) bus, pci extension (PCI_X) bus, PCI Express (PCI-E)® One or more of the bus, hypertransport (Ητ) bus, and fast path interconnect (QP1) bus. In various embodiments, the hard system included is compatible with a proprietary protocol for communicating with the (intermediate) chipset, enabling the (intermediate) chipset to communicate via any one or more of the specific protocols. . In some embodiments, the computer with string capabilities is identical to each other, and in other embodiments the computer with string capabilities varies depending on market and/or consumer needs. In some embodiments, a computer with a string of powers is used as a server, a workbench, a desktop computer, a pen, and a personal or portable computer.乂1^1\¥核2013.1 and transaction memory 2〇14 As illustrated in the figure, the computer with string capability 2〇〇〇1 includes two microprocessors with string capability to let ^(1) view.2, _ to the dynamic random access memory (DRAM) component fine 2. i to 2 surgery. 2. The microprocessor capable of communicating with each other communicates with the flash memory 2003 via the facets 205 i. i to 2 and communicates with each other via the coupling 2Q53. The microprocessor (200) having the capability of (4) includes a progressive analysis unit 2〇111, a string management unit 2〇121, and in some embodiments, a microprocessor having a string capability, and in other specific embodiments, has a contention The microprocessor of capabilities varies depending on the market and/or consumer needs. In various embodiments, a microprocessor capable of string capability is implemented in any of a single integrated circuit crystal, a plurality of integrated circuit dies, a multi-die module, and a plurality of packaged circuits. For the sake of brevity, the following description is in relation to the illustrated microprocessor capable of string capability. Other microprocessors with string capability and string capability are similar. The serial-capable microprocessor 2〇〇ΐι exits the reset state (for example, when a cold boot is performed) and begins to extract and execute the string from the code portion of the serial image 2004 included in the flash memory 2003. instruction. The execution of the instructions initializes various string data structures (e.g., the string data 2〇〇21A and the translation memory 2〇02.1B, which are part of the DRAM 2002.1). Initialization includes copying all or any subset of the code portion of the string image to a portion of the string data and setting the side regions of the string data for the stacked 'stack and private data storage. 136758.doc -20- 200935303 The microprocessor with string capability then begins processing the χ86 instructions (such as the χ86 boot firmware included in the flash memory in some embodiments), which is observed by the party (at least in part) Through the line-by-line analysis unit 2〇111) and points•#. The processing is further subjected to the above-described splitting into strings for parallel execution, translation into operations, and configuration in bundles corresponding to various string images and storage in a cached cache (eg, translation of cache memory 2〇〇) 2,1Β)). The processing is further performed by the party in response to the subsequent configuration of the stored bundle and the execution of the stored bundle (at least in part via the string management unit 2012.1, VLIW core 2013.1 and the variant memory 2014). The division of elements illustrated in the drawings is merely illustrative as there are other specific embodiments that employ other divisions. For example, various embodiments include all or any portion of flash memory and/or DRAm in a microprocessor capable of string capability. For another example, various embodiments are included in a string capable microprocessor (eg, in one or more static random access memories or SRAMs on an integrated circuit die) for use in string data and/or Or translating φ to cache all or any part of the memory of the S. For another example, in some embodiments, the string data 2〇〇21 8 and the translation cache memory 2102.1B are included in different DRAMts (eg, one in the first double inline. The memory module or The DIMM and the other are in the second DIMM). For another example, various embodiments store all or any portion of the string image on storage 2010. Large-capacity multi-threaded hardware and serial diagrams Figure 1B and 1C illustrate the 136758.doc 21 200935303 related to a microprocessor with string capability (such as any of the microprocessors with string capability). Body, string body (soft body) and target software layer (for example, the sub-wire type is conceptual in nature, and (4): such as: system). The figure controls and some (four)_he. 4_See' This figure omits various 1 2::'9. Including one or more independent cores (eg, vliw cores (9) to _τ) and /: singular entities), each core implements one or more hardware threads according to a multi-threaded two-dimensional context switch The context of the content (for example, stored in the register slot case .... (4) and, or in the execution entity of the string contents π 194Β.1 to ι94Β.4). The microprocessor is enabled to execute instructions in accordance with the 6-set architecture. The microprocessor includes speculative multi-thread extensions and enhancements, such as hardware, inter-thread, and inter-bank register propagation logic and/or circuitry for implementing separate and joint instructions and/or operations (multi-core internet) Road 195), the implementation of memory version management and conflict detection capabilities, δ mnemonic 183, progressive analysis hardware 181 and other hardware components to achieve speculative multi-threading processing. In the illustrated embodiment, the microprocessor also includes a multi-layer cache memory hierarchy (eg, L1 D cache memory 193.1 to 193.4 and L2/L3 cache memory 196 execution entities), to One or more interfaces of a large-capacity memory and/or a hardware device external to the processor (a DRAM controller coupled to an external system/serial DRAm 184A and a north bridge 197), for example, having a plurality of microprocessors ( Each of the microprocessors optionally includes a plurality of cores in a computer-to-socket system interconnect (multi-socket system interconnect 198) and interface/coupling to external hardware devices (for external PCI Express, QPI) , Super Transmission 199 coupled chipset / PCIe bus interface 186). The string layers 110A and 110B (sometimes collectively referred to as string layer no) and (χ86) target 136758.doc -22- 200935303 software layer 101 are at least partially included in the microprocessor and/or coupled to one of the microprocessors All or any portion of a plurality of cores (e.g., any of the execution entities of Figures 1 CiVLIW core 191 > 1 to 191 4) is executed. The string layer is not visible to the components of the target software layer, conceptually "under the target software layer" and/or "at the same level as the target software layer"" transparent operation. The target software layer includes the operating system core 102 and the narration system "execution system core" top "execution program (illustrated as the execution entities of the applications 103.1 to 103.4 in various specific embodiments and/or usage scenarios, The target software layer includes - managing a plurality of operating system execution individual hypervisor programs (eg, similar to VMware or Xen). In various embodiments, the stratum layer implements one or more of the following capabilities: to virtualize one or more The CPU (e.g., vcpu 1〇41 to 〇46 execution entities) and associated virtual device 174 present virtualization of the microprocessor hardware to the target software. The VCPU appears to execute - where the target instruction set of the target soft layer is encoded. Dynamic mapping of the VCPU to the native core of the microprocessor (eg, enabling execution of the native instruction set _191" to the execution entity of 191.4) and the string context (which holds (10) as in the case of the temporary register .4 and/or string content 194 Β ι ^ 94 Β β one or more executing individuals). In the case of Ding/Software, the tool and equipment of the target software, the line-by-line analysis and the branching line are used to identify the individual sequence of dividing the (sequential) instruction stream into more speculations: for example, the system will be executed by ― or more The sequence instruction stream is divided into a plurality of speculative multi-thread strings. 136758.doc -23- 200935303 • Inserting instructions and/or code sequences into the target software based on the analysis to invoke the various speculative multi-thread hardware units of the microprocessor to divide and combine strings, predict and/or The live in value is propagated to the string to manage the memory version and the conflict between the strings and to pre-fetch the string. • Optimization of target software to speed up speculative multi-threading performance acceleration, * for example, rescheduling instructions to generate key string resident values, delays, and/or reordering to suppress parallelism earlier in time to destroy Or eliminate cross-string dependencies and remove memory obfuscation, and remove redundant ❹ operations within the pre-fetched string. • Maintain a repository of modified, tooled, and/or optimized code (eg, via Translate Cache Management) so that the code in the repository is invisible to the target code and is Can be used to replace the original target code (such as modifying, tooling, or part of the target code before optimization) by the string call. • Handling any intrinsic anomalies or mistakes's repairs, tools, equipment, and most The result of any of Jiayi (such as speculative multi-threading) will not occur when executing the target software. In some environments, internal exceptions or errors are handled including re-optimization and/or deactivation. Optimize performance • Provide an optional mechanism to the target code to provide hints for the string, such as possible points, synchronization points, possible cross-string confusion points, and other optimization information. Dynamic Optimization In some embodiments, the microprocessor hardware is enabled to execute - internal instructions that are different from the instruction set of the target software [in various embodiments 136758.doc -24- 20093 5303, 'Require to cooperate with any combination of multiple hardware acceleration mechanisms: Implement dynamic binary translation (for example, via χ86 binary translation 1丨$) to:: or multiple target instruction sets (eg χ86 compatible) The set of instructions, such as χ86_. 64♦"sets, is translated into native micro-ops (micro-ops). The software acceleration mechanism includes line-by-line analysis of hardware 1 8 1 , hardware acceleration unit i 82, transaction. ^ Recalling the body 183 and the hardware 386 device - or a plurality of all or any of the # knife microprocessor hardware (such as the VLIW core 191.1 to 191.4 execution individual) is enabled to directly execute the micro-opcode (and in various In a specific embodiment, the microprocessor hardware is not enabled to directly execute one or more of the target instruction sets. At least in some circumstances, the translation is then stored in a repository (eg, via a translation cache) Memory management 111) is used for fast re-calling and reuse (eg, as an image) to eliminate re-translation. In various embodiments, the microprocessor is enabled for access (eg, by coupling or attaching) Relatively large The system is via a dedicated DRAM module (which is included in the microprocessor or external to the microprocessor in various embodiments) or alternatively as an external system/invisible to the target code. The memory area is implemented by a portion of one of the reserved areas of the serial DRAM 184A. The memory area provides a memory for various components of the string (eg, one or more of program, code, stack, stack, and data) and In some embodiments, one or more portions of the cache memory (e.g., by the manager of the translation cache) are provided, and one or more buffers are optionally provided (e.g., speculation) Threading the temporary status buffer.) When the microprocessor is initially powered on (for example, by performing a cold boot), copy the serial code from the flash ROM to the memory area (eg to the dedicated DRAM module 136758.doc) -25- 200935303 or the external system/serial DRAM said in the reserved portion), the microprocessor then extracts the native micro-ops from the memory region. After the string initialization micro-processing W, for example, via hardware control 172) and the internal data structure of the string, the string body begins to perform a 'boot boot and' or operating system core boot program using binary translation (eg, via χ86 binary translation 115). The code (encoded in - or more of the target instruction set) is similar to a conventional hardware-based microprocessor without a binary translation layer. In some cases, the use of a string to perform binary translation and/or dynamic optimization provides advantages over the addition of speculative multi-threaded instructions to the target instruction set. In some environments, binary translation and/or dynamic optimization (for example) by removing and/or reducing hardware used to decode the target instruction set (eg, hardware secret decoder 187) and for out-of-order execution The hardware implementation simplifies the hard work. In some embodiments, 'conceptually employed' or multiple VLIW (ultra-long instruction word) microprocessor cores (eg, execution individuals of vli 191" to (9)*) Replace the removed and/or reduced hardware. The vuw core φ (for example) performs a bundle of pre-scheduled micro-ops, where all the micro-operations of the bundle are coded (eg in a plurality of functional units (eg ALU 192 to 192 especially 4 and FPU 192B. 1 to 192B.4) The execution of the individual) is performed in parallel (or starting to execute.). In various embodiments, the VLIW core lacks one or more of relatively complex decoding, hardware-based dependency analysis, and dynamic out-of-order scheduling. The VLIW core needs to include native memory (eg #li d cache memory 193) to 193.4 and the executable file of the temporary file 194Α.β194Α.4) and other per-core hardware structures for efficient processing. instruction. In some usage scenarios and/or specific embodiments, the VLIW core system is small enough to 136758.doc -26 - 200935303 to achieve the following - or the guest to go. Gland seedlings eve ^ 夕夕. Counting cores in a given grain region, for more cores within a given power budget, and for frequencies that are higher than would otherwise be possible for a complex disordered core. In some implementations and/or specific embodiments, the VLIW core is semantically isolated from the target instruction set via binary translation to implement the micro*$code format, the scratchpad, and the VLIW core associated with efficient speculative multi-threading. Efficient coding of various details without modifying the target instruction set. The role of the string dynamic optimization software The tracking construction subsystem of the string layer (eg, tracking progressive analysis and capture 120) collects and/or translates the translated microcode (eg, has passed through) when executed by the microprocessor. The micro-ops of the sequence of the translated basic blocks of the common control flow path of the object code are organized into traces. The string is relatively extensively optimized using a wide variety of techniques (e.g., via optimization 丨 63). Some techniques are similar in scope to techniques that are implemented by optimized compilers that have access to the source code, but are used in line-by-line analysis (eg, line-by-line analysis by physical page 121, branch-by-line analysis 124, At least some optimizations are guided by the program behavior of the dynamic measurements collected during predictive optimization 125 and one or more of memory progressive analysis 127. For example, selective reordering (e.g., based on information obtained via memory obfuscation analysis 162) to the load and storage of the memory initiates a cache miss as early as possible. In some embodiments, 'selective reordering is based, at least in part, on loading and storage measurements with reference to the same address (e.g., via memory progressive analysis 127). In some usage scenarios and/or embodiments, selective reordering achieves relatively aggressive optimization across hundreds of instructions. Then, depending on the input, when the operand will be available and various hardware resources (such as functional units) will freely schedule the micro-ops (for example, by interpolating each micro-code 165)

至排程中）。在某些具體實施例（例如具有藉由編碼似VLIW 束167所解說之功能性之某些具體實施例）中，排程嘗試將多達四個微運算碼封裝於各束中。在一束中具有複數個微運算碼致使一特定VLIW核（例如VLIW核191.1至191.4之任何者）能夠在稍後執行已排程追蹤時排序執行該等微運算To schedule). In some embodiments (e.g., with some specific embodiments having functionality illustrated by encoding a VLIW beam 167), scheduling attempts to encapsulate up to four micro-ops in each bundle. Having a plurality of microcodes in a bundle causes a particular VLIW core (e.g., any of the VLIW cores 191.1 through 191.4) to be able to perform the micro-operations when the scheduled tracking is performed later.

❷ 碼。最後’將已最佳化追蹤（具有各具有一或多個微運算碼之VLIW束）作為串影像之全部或部分插入至儲存庫中（例如經由轉譯快取記憶體管理U1)。在某些具體實施例中，硬體僅執行來自儲存於轉譯快取記憶體中之追蹤的原生微運算碼，從而致使能夠連續再使用由串體所實行之最佳化工作。在某些使用方案及/或具體實施例中，取決於（例如）多頻繁地執行一追蹤透過一系列越來越高效能最佳化級連續地重新最佳化追蹤，各級係相對更加昂貴地實行（例如經由促進13〇)。在某些具體實施例中，動態最佳化軟體經由原子執行之使用實現某些相對較積極最佳化。在某些環境中，相對較積極最佳化之執行個體在無原子執行情況下將,·不安全"， :::將導致對架構狀態之不正確修改。原子執行之一範例對架構狀態之修改將微運算碼之群組（稱為提交交群组？ ”’?_可分單元…追蹤視需要地包含—或多個提任何里常=提交群組之所有微運算碼正確完成（例如無 "常或錯誤)’則依據提交群組之所有微運算碼之结 I36758.doc -28- 200935303 果對架構狀態谁杯_几所有微運算二:ΓΛ其他環境下’丢棄提交群組之架構狀態所進行相對於提交群組之結果對丁之變化。例如，在相對於一提交群組之一Weight. Finally, the optimized tracking (with VLIW bundles each having one or more micro-ops) is inserted into the repository as all or part of the string image (e.g., via translation cache memory management U1). In some embodiments, the hardware only performs native micro-ops from the traces stored in the translation cache, thereby enabling continuous reuse of the optimization performed by the string. In some usage scenarios and/or embodiments, depending on, for example, how often to perform a tracking, continuously re-optimizing tracking through a series of increasingly high performance optimization levels, the levels are relatively more expensive. Implementation (eg via promotion 13〇). In some embodiments, the dynamic optimization software achieves some relatively aggressive optimization via the use of atomic execution. In some environments, a relatively actively optimized execution entity will, in the absence of atomic execution, "unsafe", ::: will result in incorrect modification of the state of the architecture. One example of atomic execution changes the state of the architecture. Groups of micro-computing codes (referred to as submitting cross-groups?) '?_divisible units...tracking as needed—or multiple mentions often = submit group All micro-ops are correctly completed (for example, no "often or wrong)', according to the conclusion of all micro-computing codes of the submitted group I36758.doc -28- 200935303. For the architectural state, who is the cup_a few micro-operations two: ΓΛ In other contexts, the structure state of the drop-off group is changed relative to the result of submitting the group. For example, in one of the submitted groups

微運碼所伯测之一異常（例如頁錯失或跟隨與最初追縱所沿著之路徑不同之路徑之分支)之事件中，發生回轉’且丟棄由提交群組之所有微運算碼所產生之所有結果。在某些具體實施例及/或使用方案巾，回轉之後，微處理器及’或串體採用原始程式順序(且視需要地在無一或多個最佳化之情況下）重新執行對應於提交群組之微運算碼之指令以查明異常之來源。標題為"Method andIn the event that one of the micro-codes is abnormal (for example, a page misses or a branch that follows a path different from the path along which the initial trace was followed), a turn occurs and discards all micro-opcodes generated by the commit group. All the results. In some embodiments and/or usage scenarios, after the slewing, the microprocessor and the 'or string are re-executed in the original program order (and optionally without one or more optimizations) corresponding to Submit the group's microcode instructions to find out the source of the anomaly. Titled "Method and

Apparatus f〇r Incremental Commitment to Architectural State"之共同待審美國專利申請案1〇/994,774揭示關於動態最佳化及提交群組之其他資訊。在某些具體實施例及/或使用方案中，組合運作之硬體與軟體（例如）藉由經由相對較積極VLIW追蹤排程與最佳化擁取單一串内之精細平行性實現類似於無序動態排程微處理器之益處。在各種具體實施例中，類似於無序微處理器’硬體與軟體實行精細平行性擷取，同時相對高效地重新排序及使獨立串交錯以涵蓋記憶體延時停止。在某些環境中，硬體與軟體實現橫跨許多核及/執行緒之相對高效縮放，實現每時脈可能數百個微運算碼之有效發出寬度。多執行緒動態最佳化在具有大容量多核及/或多執行緒微處理器之某些具體實施例中，實施動態最佳化軟體以相對高效地使用複數個 136758.doc -29- 200935303The copending U.S. Patent Application Serial No. 1/994,774, the entire disclosure of which is incorporated herein by reference. In some embodiments and/or usage scenarios, the hardware and software of the combined operation are similar to none, for example, by tracking the scheduling through a relatively aggressive VLIW and optimizing the fine parallelism within a single string. The benefits of sequencing dynamic scheduling microprocessors. In various embodiments, fine parallelism is performed similar to the unordered microprocessor' hardware and software, while reordering relatively efficiently and interleaving the independent strings to cover memory latency stalls. In some environments, hardware and software implement relatively efficient scaling across many cores and/or threads, enabling an effective emission width of hundreds of micro-ops per clock. Multi-Thread Dynamic Optimization In some specific embodiments with large-capacity multi-core and/or multi-threaded microprocessors, dynamic optimization software is implemented to use a plurality of relatively efficient 136758.doc -29- 200935303

核及/或執行緒之諸。例如，追蹤逐行分析與捕獲！20、串建構14〇、排程與最佳化16〇及χ86二進制轉譯I} $之一或 ^個係在1多個級處遍佈式多執行緒化，致使能夠減夕/肖耗或有效隱藏與二進制轉譯及/或動態最佳化關聯之某些或全部額外負擔。微處理器以背景方式執行動態最佳化軟體以便不妨礙執行目標碼（例如透過來自轉譯快取記憶體之已最佳化程式碼）中之正向進展。各種具體實施例實施用以實現執行動態最佳化軟體之背景方式的一或多 «I例如’微處理器及/或串體將資源之部分（例如在 -多核微處理器具體實施例中之一或多個核)明確專用於執行動態最佳化軟體。該專用係永久的，或#代地為瞬變及/或動態的’例如當資源之部分係可用時(例如當目標碼明顯將未使用VCPU置於閒置狀態下時卜對於另一範例， -或多個核之優先權控制機制實現串體執行緒(其映射⑽ 如)至目標可見VCPU)在具有很少或不具有可觀察效能降級之情況下共用該等核及關聯快取記憶體（例如，藉由使用由依據目標指令集架構或ISA執行之已停止目標執行緒所建立之鬆弛循環; 硬體與串體實施方案在各種具體實施例中，圖1A所解說之元件對應於圖汨與1C所解說之功能性之全部或部分。例如，你呆些具體實施例中，圖1A之DRAM 2002.1對應於圖1C之外部系體DRAM 1 84A，且轉譯快取記憶體管理i營 s *里轉譯快取記憶體2002.1B。對於另一範例，在某些具 •貝苑例中， I36758.doc -30- 200935303 圖1A之VLIW核2013.1對應於圖1C之VLIW核191·1至191.4 之一或多個’圖1Α之異動記憶體2014.1對應於圖1C之異動記憶體183，且圖1Α之逐行分析單元2011.1對應於圖1C之逐行分析硬體181。對於另一範例’在某些具體實施例中圖1Α之串管理單元2012.1對應於耦合至圖ic之暫存器檔案 194A.1至194A.4及/或串内容脈絡194B.1至194B.4之一或多個的控制邏輯。對於圖ΙΑ、1B及1C之元件間之對應之另一範例，在某些具體實施例中，圖1A之串體影像2〇〇4具有圖1B與1C之串體層11 0A與110B之全部或任何部分之初始影像。對於另一範例，在某些具體實施例中，圖1A之具備串能力之微處理器2001.1實施藉由圖1C之硬體層19〇所例示之功能。在各種具體實施例中，圖1C之晶片組/pcie匯流排介面 186、多插座系統互連198及/或？。特快、Qpi、超傳輸^的之全部或任何部分實施與圖1A之耦合2050、2〇55、2〇56、 2063、205 1.1及2053關聯之介面之全部或任何部分。在各種具體實施例中，與圖ic之中斷、SMP及計時器175結合運作的晶片組/PCIe匯流排介面186及/或pci特快、Qpi、超傳輸199之全部或任何部分實施圖1A之鍵盤/顯示器2〇〇5 及/或周邊設備2006之全部或任何部分。在各種具體實施例中，圖1CiDRAM控制器與北橋197之全部或^何部分實施與圖1Α之耦合2052.1關聯之介面之全部或任何部分f 推測多執行緒化模型各種具體實施例之推測多執行緒化係用於使用在其中始 136758.doc 200935303 Π::確定性已程式排序執行之外觀的未修改目標碼程式排序執=執行緒化提供-已嚴格 ^”有至多後繼串。若上代串p分又第一下、、、後在與S1聯合之前及/或在S1終止之前嘗試分又 Μ二代串S2,則82之分又係無效的（例如，（例如）藉由將串嘗無操作或視為卿抑㈣之分又）。若上代 ❹ ⑩ 刀又且不存在Μ資源（例如不存在自由執行緒二絡)來完成分又，則視需要地取決於分又為什麼類 ^分又抑制分又或替代地阻隔已分又執行緒直到資付可用。缺在某些具體實施例中’微處理器經啟用以依據原生微運异碼指令集執行’原生微運算碼指令集包括可用以分又串、控制串間之互動、聯合串及中止(例如殺除)串之各式各樣微運算碼、特徵及内部暫存器。在某些具體實施例中’各式各樣微運算碼包括： nype target，inherit引導微處理器建立上代串ρ之新後繼串S。微處理器（經由硬體與軟體元件之任何组合）依據一或多個串體及/或硬體定義之政策將後繼串映射至微處理器之特定核與執行緒。執行上代串^。⑽運算碼的特定VCPU擁有後繼串（連同上代串一起）。後繼串之執行在由target參數所指定（根據亊體位址空間内之一原生微運算碼位址或作為—目標碼RIp)之目標位址處開始。inhe出參數係用作執行分叉操作之後將藉由上代申 I36758.doc •32· 200935303 修改哪些暫存n収應將哪㈣存器複製⑽承）至後繼串之一指示（參見定位於本文別處之小節”前跳越串"）。參數針對後繼串指定數個不同串類型之一(例如精細 . 前跳越串、完全推測多執行緒串、預提取串或具有其他語意或用途之串）。fGrk微運算碼提供—為串ι〇之輸出 • 值。串1D係與後繼串關聯之識別項（其至少在同一 vcpu 内係全局唯—的），該識別項指定後繼串相對於與擁有上代與後繼串兩者之特定VCPU關聯之所有其他串之程 © 式順序。 • kill.cmptype· ce ra，rb，τ引導微處理器消除一或多個串。更明確言之，當在上代串p内加以執行時，k⑴以遞回方式中止P之後繼争8(若有的話）及8之所有後繼串（若有的話）。kiU微運算碼之執行經由指定ALU運算 cmptype(例如kU1· sub或kiu and)比較暫存器運算元^與树而產生—結果，然、後檢查該結果之指定條件碼cc(例 ❹如j於或等於）。若指定條件為真，串範疇識別項τ與關聯分叉微運算碼之串範疇識別項匹配且巢套式分又深度 . 為零’則殺除上代串p之後繼串。參見定位於本文別處 - 之小節”串範疇識別"與，，巢套式串"以瞭解另外揭示内容。 • waiuype [object]引導微處理器停止執懸而未決。更明確言之，當在"内加以執行時二造成串S之執行在進行之前等待一指定條件（且視需要地等待-指定物件’例如記憶體位址）。例如，在某些具 136758.doc -33- 200935303 體實施例令’微處理器經啟用以等待直至串為架構的 (例如非推測的），等待待寫入之一特定記憶體位置，等待直至後繼串完成及等待直至上代串達到某一狀態。 • join引導微處理器阻隔與上代串關聯之推測後繼串之執行，直到上代串與後繼串聯合。當一特定串在為推測性的同時不能夠正向進展時藉由串體執行聯合微運算碼。 •微運算碼視需要地包括-傳播位元，其指示硬體將微運算碼之結果（在上代串中）發送至上代串之後繼串。參見定位於本文別處之小節"前跳越串"以瞭解與傳播位元修改之另外揭示内容。在某些具體實施例中’藉由執行複數個其他微運算碼、實行寫入至内部機器狀態暫存器、以視需要地自動方式調用各別不同以非微運算碼為基礎的硬體機制或其任何板合來實施上述微運算碼之功能性之某些或全部。，在其中上代串分又—推測後繼串之各種使用方案中Nuclear and / or thread. For example, tracking line-by-line analysis and capture! 20, string construction 14 〇, scheduling and optimization 16 〇 and χ 86 binary translation I} $ one or ^ system at more than 1 level of ubiquitous multi-threading, resulting in reduced eve / Xiao consumption or effective Hide some or all of the additional burden associated with binary translation and/or dynamic optimization. The microprocessor executes the dynamic optimization software in a background manner so as not to interfere with the positive progression in the execution of the object code (e.g., through the optimized code from the translation cache). Various embodiments implement one or more of the "1" microprocessors and/or string portions of the resources used to implement the background of the dynamic optimization software (eg, in a multi-core microprocessor embodiment) One or more cores are explicitly dedicated to performing dynamic optimization software. The dedicated system is permanent, or #代地为 transient and / or dynamic 'for example, when part of the resource is available (for example, when the target code obviously puts the unused VCPU in an idle state), for another example, Or multiple core priority control mechanisms that implement the string thread (which maps (10), for example) to the target visible VCPU) to share the core and associated cache memory with little or no observable performance degradation ( For example, by using a slack loop established by a stopped target thread executed in accordance with a target instruction set architecture or ISA; hardware and constellation implementations In various embodiments, the elements illustrated in FIG. 1A correspond to maps All or part of the functionality explained with 1C. For example, in some embodiments, the DRAM 2002.1 of Figure 1A corresponds to the external system DRAM 1 84A of Figure 1C, and the translation cache memory management i camp s * Translating the cache memory 2002.1B. For another example, in some cases, I36758.doc -30- 200935303 The VLIW core 2013.1 of Figure 1A corresponds to the VLIW cores 191.1 to 191.4 of Figure 1C. One or more of the changes in Figure 1 The memory 2014.1 corresponds to the transaction memory 183 of FIG. 1C, and the progressive analysis unit 2011.1 of FIG. 1 corresponds to the line-by-line analysis hardware 181 of FIG. 1C. For another example, in some embodiments, FIG. The management unit 2012.1 corresponds to control logic coupled to one or more of the scratchpad files 194A.1 to 194A.4 and/or the string contexts 194B.1 to 194B.4 of Figure ic. For Figures 1, 1B and 1C As another example of the correspondence between the components, in some embodiments, the string image 2〇〇4 of FIG. 1A has an initial image of all or any portion of the string layers 110A and 110B of FIGS. 1B and 1C. As another example, in some embodiments, the string capable microprocessor 2001.1 of FIG. 1A implements the functions illustrated by the hardware layer 19 of FIG. 1C. In various embodiments, the chip set of FIG. 1C /pcie bus interface 186, multi-socket system interconnect 198 and/or ?. All or any part of express, Qpi, super-transmission ^ is implemented with coupling 2050, 2〇55, 2〇56, 2063, 205 of FIG. 1A 1.1 or 2053 all or any part of the associated interface. In various embodiments The chipset/PCIe bus interface 186 and/or the pci express, Qpi, and ultra-transmission 199 operating in conjunction with the interrupt of the ic, the SMP and the timer 175, and the keyboard/display 2〇〇5 of FIG. 1A are implemented. And/or all or any portion of the peripheral device 2006. In various embodiments, all or any portion of the interface associated with the coupling 2052.1 of Figure 1CiDRAM controller and Northbridge 197 is speculated to be more The threaded model of the various embodiments of the speculative multi-threading system is used to use the unmodified object code program in which the 136758.doc 200935303 Π:: deterministic programmed order execution is performed. Has been strict ^" there are at most successor strings. If the previous generation string p is first and then, before and after the association with S1 and/or before the termination of S1, the division of 82 is invalid (for example, by, for example) It will be considered as no operation or as a distinction between (4) and again. If the previous generation ❹ 10 knives and there are no Μ resources (for example, there is no free actor 2 network) to complete the points again, then depending on the points and why the class and the points are suppressed or alternatively blocked and executed again Until the payment is available. In some embodiments, the 'microprocessor is enabled to execute the native microcode instruction set according to the native microcode instruction set, including the available to divide and string, control the interaction between strings, union strings, and abort (eg, kill In addition to the various types of micro-codes, features and internal registers. In some embodiments, the various microcodes include: nype target, which instructs the microprocessor to create a new successor string S of the previous generation string ρ. The microprocessor (via any combination of hardware and software components) maps the successor string to a particular core and thread of the microprocessor in accordance with one or more string and/or hardware defined policies. Execute the previous generation string ^. (10) The specific VCPU of the opcode has a successor string (along with the previous generation string). The execution of the successor string begins at the target address specified by the target parameter (based on one of the native micro-opic code addresses in the body address space or as the target code RIp). The inhe parameter is used to perform the fork operation and will be modified by the previous generation application I36758.doc •32· 200935303 to determine which temporary storage n (4) to copy (10) to one of the successor strings (see positioning in this article). The section elsewhere skips the string. The parameter specifies one of several different string types for the successor string (such as fine. Pre-jump string, fully speculative multi-thread string, pre-fetch string, or other semantics or use) String). The fGrk micro-opcode provides the output value of the string 1. The string 1D is an identifier associated with the successor string (which is globally unique at least within the same vcpu), which specifies the successor string relative to The sequence of all other strings associated with a particular VCPU that has both the previous and succeeding strings. • kill.cmptype· ce ra, rb, τ directs the microprocessor to eliminate one or more strings. More specifically, when When executed in the upper generation string p, k(1) suspends P in a recursive manner and then contends 8 (if any) and all subsequent successors (if any) of 8. The execution of the kiU micro-opcode operates the cmptype via the specified ALU. (eg kU1·sub or kiu and Comparing the register operand with the tree - the result, then checking the specified condition code cc of the result (for example, j is equal to or equal to). If the specified condition is true, the string category identifier τ and the associated score The cross-country operation code string category identification item matches and the nested type is divided into depth. Zero' then kills the previous generation string p subsequent string. See elsewhere in this article - the section "string category identification" and, nesting Strings " to understand the additional disclosure. • waiuype [object] boots the microprocessor to stop suspending. More specifically, when executed in " causes the execution of string S to wait for a specified condition (and optionally wait - specify an object 'e.g., a memory address) before proceeding. For example, in some embodiments 136758.doc -33-200935303, the microprocessor is enabled to wait until the string is architected (eg, non-speculative), waiting for a particular memory location to be written, waiting until The successor string is completed and waits until the previous generation string reaches a certain state. • The join bootstrap microprocessor blocks the execution of the speculative successor string associated with the previous generation string until the previous generation string is merged with the successor string. A joint micro-op is executed by a string when a particular string is not speculative while being able to progress forward. • The micro-opcode optionally includes a -propagation bit that instructs the hardware to send the result of the micro-algorithm (in the previous generation string) to the successor string of the previous generation string. See the section "Before Jumping Strings" located elsewhere in this article for additional disclosure of the propagation and modification of the bits. In some embodiments, 'by performing a plurality of other micro-ops, performing a write to an internal machine state register, and automatically calling different hardware mechanisms based on non-micro-opcodes as needed. Any or all of the functionality of the above described micro-codes may be implemented. In the various usage scenarios in which the upper generation is divided and the post-sequence string is speculated

於後繼串等待戋傳y ,, Α — 執仃（例如暫停或暫時中止）及等待上二存在數個原因。例如，若在推測串中發生串情況下該異常指示-誤推測或其中上代 fII 又後繼串的-情況。對於另-範例，一推於僅使用在非推=操作，因為限制該特定操作用體視需要地包括存取〇、構串中。受限制操作之執行個超傳輸⑼)、讀取或寫入特器，1如經由PCI特快、QPI、憶體)、進入受限於以非推：：：體區域(例如不可快取記八方式執行的一串體部分或嘗 136758.doc -34- 200935303 試使用已延期操作結果。當一上代串與一等待後繼串交叉時且若上代串驗證上代串之所有駐外者與後繼串之駐内者匹配，則後繼串之異常為”真正的”。該異常在異常不為不正確推測之副效應之意義上為真正的且因而微處理器以架構可見方式處理該異常。在各種情況下，當後繼串之執行恢復時，後繼_立即基於該異常導向（例如至作業系統核心中）以處理該異常（例如頁錯失）。在某些情況下，當後繼串之執行恢復時，執行無錯誤地繼續，因為後繼串現在為架構性（非推測性卜微處理器依程式順序聯合串，且各VCPU擁有該等串之或多個。最新式架構串表示擁有該串之VCPU之架構狀態。微處理器使得架構狀態可用於在擁有VCPU外部觀察 (例如經由已提交至記憶體之儲存物）。微處理器經啟用以在微處理器内之核之間自由移動最新式架構串，且同時擁有VCPU看似繼續執行（（例如）藉由相對於擁有VCPU所執行之作業系統核心所觀察）。推測多執行緒化策略微處理器硬體與微處理器串體（軟體）實現具有越來越寬範之數個級上之推測多執行緒化： •當—串在一相對較長延時快取操作（例如自主記憶體得以作出補償的快取遺漏）上停止時視需要地自動分又預提取串（參見定位於本文別處之小節”預提取串"）。在使用資料之前（例如在藉由自之分又預提取串之（上代）串存取資料之前），預提取串嘗試將預期使用之資料提取至 136758.doc -35- 200935303 一或多個快取記憶體中及/或嘗試準備好一或多個具有適田貝料之分支m ^在某些環境中，預提取_針對數百個循環係處於作用中。在某些具體實施例中，只要分又串未分又另一串，系統提供用以分又預提取串的任何類型之串（從而防止其中—特定宰具有—個以上後繼串之方案）。在某些具體實施例中，甚至當分又串已分又另串時，系統亦提供用以分又預提取串的任何類型之串（導致其中一特定串具有一個以上後繼串之方案）。在某些具體實施例中’硬體具有用以依據一或多個軟體及/或串體可控制預提取政策選擇性啟動或抑制預提取串之建立的邏輯。當串體決定上代串係相對可能在一特定指令（例如相對頻繁地遇到快取遺漏之載入）上停止時藉由上代串分又則跳越串（參見定位於本文別處之小節"前跳越串。。或者，分又一前跳越串因此該前跳越串在一相對高度可預測最後分支（例如具有大於預定及/或可程式化臨限值之正確預測率之分支）之後開始執行。前跳越串阻隔直到上代串提供前跳越串取決於之駐内者，例如，當上代串產生駐外者時將針對駐外者之值發送至前跳越串（其中上代串之駐外者之子集係前跳越串之駐内者）。已發送駐外者視需要地包括暫存器及/或記憶體位置。基於動態（且視需要地靜態）推斷控制流結構與習語之串體分又推測串執行緒化（SST)串（參見定位於本文別處之小節”推測串執行緒化（SST)"：^該等結構與習語包括反 I36758.doc -36- 200935303There are several reasons for waiting for y, y, squatting (such as pausing or temporarily aborting) and waiting for the second. For example, if an exception occurs in the speculative string, the anomaly indication-false speculation or the case where the upper generation fII is a successor string. For the other-example, one push is only used in the non-push = operation, because the specific operation is restricted to include the access 〇, the string. Restricted operation is performed by a supertransfer (9)), read or written to the device, 1 via PCI Express, QPI, Remembrance), access is restricted to non-push::: body area (eg not fast cached eight) The execution of a bunch of parts or taste 136758.doc -34- 200935303 Try to use the delayed operation result. When a previous generation string and a waiting succession string cross and if the previous generation string verifies all the foreigners and subsequent successors of the previous generation string If the resident matches, then the exception of the successor string is "true." The exception is true in the sense that the exception is not a side effect of the incorrect speculation and thus the microprocessor handles the exception in an architecturally visible manner. Next, when the execution of the subsequent string is resumed, the successor_ is immediately based on the exception (for example, into the operating system core) to handle the exception (eg, page miss). In some cases, when the execution of the subsequent string is resumed, execution is performed. Continue without error, because the successor string is now architectural (non-speculative microprocessor-based sequential synchronizing strings, and each VCPU owns one or more of these strings. The most recent architecture string indicates that the string is owned. The architectural state of the VCPU. The microprocessor allows the architectural state to be used for external observation of the VCPU (eg, via storage that has been committed to memory). The microprocessor is enabled to freely move the latest between cores within the microprocessor. The architecture string, and at the same time, the VCPU appears to continue execution (for example, by observing the operating system core with respect to the VCPU). Predictive multi-threading strategy microprocessor hardware and microprocessor string (software) Implementing speculative multi-threading on a number of levels that are increasingly broad: • When the string is stopped on a relatively long delay cache operation (such as a cache miss that the self-memory can compensate) Automatically pre-fetching strings (see the section "Pre-fetching strings" located elsewhere in this article.) Before using the data (for example, before pre-fetching the strings from the previous generation) The pre-extraction string attempts to extract the expected data into one or more cache memories and/or try to prepare one or more of the materials with the field In some environments, pre-fetching is in effect for hundreds of cycles. In some embodiments, as long as the strings are not separated and another string is provided, the system provides a pre-fetching string. Any type of string (thus preventing it - a particular slaughter has more than one successor string). In some embodiments, even when the substrings are split and another string, the system provides Extracting any type of string of strings (resulting in a scheme in which one particular string has more than one successor string). In some embodiments, the 'hardware has controllable pre-following based on one or more software and/or constellations. The policy selectively initiates or suppresses the logic of the establishment of the prefetch string. When the string determines that the previous generation string is relatively likely to stop on a particular instruction (eg, a cache that frequently encounters a cache miss), the previous generation is again Then skip the string (see the section located elsewhere in this article " before jumping over the string. . Alternatively, the preceding skipped string is therefore executed after the previous skipped string is branched at a relatively highly predictable last branch (e.g., a branch having a correct prediction rate greater than a predetermined and/or programmable threshold). The jump before the string is blocked until the previous generation string provides the previous jump depends on the resident. For example, when the previous generation generates the resident, the value of the foreigner is sent to the forward jump string (where the upper generation string is stationed outside) The subset of the person who jumped in front of the string is the one who stayed in the string). Sentees are included as needed to include the scratchpad and/or memory location. Inferring the string structure of the control flow structure and the idiom based on dynamics (and optionally statically) and speculating the string execution (SST) string (see the section located elsewhere in this article). Predictive string implementation (SST)": ^These structures and idioms include anti-I36758.doc -36- 200935303

覆結構（例如迴圈）、呼叫與返回（例如副常式、函式、程序及程式庫之呼叫與返回）及控制流聯合（例如在一條件區塊中藉由"if"與"else"路徑兩者所到達之共同聯合點）。SST串包含一或多個指令序列（例如基本區塊、追蹤、提交群組或其他指令量）。在某些方案中動態控制流變化發生在各指令序列之結束處以決定針對欲執行串之下一指令序列。在一 SST串内之控制流變化（與某些其他串類型不同）獨立於該SST串之後繼串内之控制流發生。在SST串内之控制流變化相對罕見地使後繼Overlays (such as loops), calls and returns (such as secondary routines, functions, calls and returns to programs and libraries), and control flow unions (for example, in a conditional block by "if" and "Else" the common point of arrival of both paths). The SST string contains one or more sequences of instructions (e.g., basic blocks, traces, commit groups, or other instruction quantities). In some scenarios dynamic control flow changes occur at the end of each sequence of instructions to determine the sequence of instructions for the string to be executed. The control flow variation within an SST string (as opposed to some other string types) occurs independently of the control flow within the subsequent string of the SST string. Control flow changes within the SST string are relatively rare to make successors

丁 j ("V 效。在某些情況下，系統選擇性使一 SST串變化為一預提取串。在某些環境中’一 SST串針對數萬個或數十萬個循環係處於作用中。 •在某些具體實施例中在SST串之建構期間使用逐行分析串I,參見定位於本文別處之小節，，針對逐行分析之工具裝備"）以聚集跨越串轉遞資料。相對於其他串，逐行分析串係以串列方式（例如依程式順序）而非與逐行分析:之上代串平行加以執行。預提取串 =串執行之某些環境中，執行遇到不_阻隔進展之停採哭、迈属）作為回應，微處 ,益視而要地分叉預提取串同時停止遇到停止事。D (j). In some cases, system selectivity changes an SST string to a pre-fetch string. In some environments, an SST string is in effect for tens of thousands or hundreds of thousands of cycles. • In some embodiments, the progressive analysis string I is used during the construction of the SST string, see the section located elsewhere herein, and the tooling for the line-by-line analysis ") to aggregate the data across the string. Relative to other strings, the line-by-line analysis is performed in tandem (for example, in program order) rather than in line with the line-by-line analysis: the previous generation string. Pre-extraction string = In some environments where the string is executed, the execution encounters a stop that does not block the progress of the process. In response, the micro-location, the benefit of the fork-forward pre-extraction string while stopping the encounter with the stop.

微處理器分配（新）預提取串（在某此I ^ ± . — /、體實施例十，在與上代串相同之核上’但在不同的串内容脈絡中取串以上代串之架構狀態（暫存器與已隐體）開始。預提取 136758.doc •37- 200935303 串繼續執行直到將資訊遞送至已停止（上代）串致使已停止串能夠恢復處理（例如將針對快取遺漏之資料遞送至已停止串）接著微處理器（例如硬體層190之元件）自動毀壞預 f取串且解除對已停止串之阻隔。在某些具體實施例中，微處理器具有用以依據一或多個軟體及/或串體可控制預提取政策選擇性啟動或抑制預提取串之建立的邏輯。例如’串體組態微處理器用以在一串所遇到之一_漏導致主記憶體存取時分又一預提取串，及用以在一_漏導致 L2或L3命中時停止該串。在執仃-载人之某些環境中，預提取串遇到相對較長延 =快取遺漏（例如導致預提取串之分又㈣漏）。若這㈣人遞送（在預提取串之内容脈絡中）—區別（例如藉經由快取命中可獲得之所右1送之所有其他資料值（例如阻隔。預提取串使用針對^料值）的”含糊"占位符值而非异碼具有一含糊輸入算碼之結果傳播含冑運算碼作為針對微運值|，）。微處理器如同一曰^時稱為"微運算碼輸出一含糊與該分支之實際目的地匹配支之:：目的地串執行-儲存物時，該預提取串分::::二當-預提取見（例如可觀察及可控制）為該預提取串可件，以防止上代串觀察、歹’J或暫時記憶體緩衝元若儲存物寫入一含糊值子物。在某些具體實施例中， J36758.doc 3糊值(例如至快取記憶體中)，則儲存物 •38- 200935303 之目的地接收該含糊值（例如在快取記憶體之一或多個快取列中之受影響位元組係標記為含糊）。目的地之後續载入接收該含糊值’從而傳播該含糊值。在各種使用方案中’含糊值之傳播致使能夠避免預提取不需要之資料（例如在載入指標時）及/或避免不然將不正確或無效率地更新分支預測器者（例如在載入分支條件時）。The microprocessor allocates (new) pre-fetched strings (in some cases I ^ ± . / /, body embodiment 10, on the same core as the previous generation string) but the architecture of the string above the string in different string contexts Status (scratchpad and hidden) begins. Prefetch 136758.doc •37- 200935303 The string continues to execute until the message is delivered to the stopped (upper generation) string causing the stopped string to resume processing (eg, for cache misses) Data is delivered to the stopped string. Subsequent microprocessors (e.g., elements of the hardware layer 190) automatically destroy the pref f string and unblock the stopped string. In some embodiments, the microprocessor has a Multiple software and/or serials can control the logic of the pre-fetch policy to selectively initiate or suppress the establishment of the pre-fetch string. For example, the 'string configuration microprocessor is used to encounter one of the strings in a string to cause the main memory. The access time is further divided into pre-fetched strings, and is used to stop the string when a _ leak causes an L2 or L3 hit. In some environments of the shackle-carrying, the pre-fetched string encounters a relatively long delay=cache Missing (for example, causing pre-fetching strings (4) Leakage. If this (4) person delivers (in the context of the pre-fetched string) - the difference (for example, all other data values sent by the right 1 via the cached hit (eg blocking. Pre-fetching string use) For the "ambiguous" placeholder value of the material value, instead of the heterocode, the result of a ambiguous input algorithm propagates the 胄-containing arithmetic code as the micro-transport value|,). The microprocessor is called the same 曰^ "Micro-opcode output is ambiguously matched to the actual destination of the branch:: When the destination string is executed-stored, the pre-fetched string::::two-pre-fetch see (for example, observable and Controlling) pre-fetching the string to prevent the previous generation string observation, 歹'J or temporary memory buffer element if the storage is written to a vague value. In some embodiments, J36758.doc 3 (eg, to the cache), the destination of the storage •38-200935303 receives the vague value (eg, the affected byte in one or more cache columns of the cache memory is marked as vague) Subsequent loading of the destination receives the vague value' Broadcasting the vague value. The propagation of vague values in various usage scenarios enables avoidance of pre-fetching of unwanted data (eg, when loading indicators) and/or avoiding those who would otherwise update the branch predictor incorrectly or inefficiently ( For example, when loading a branch condition).

在某些具體實施例中，微處理器具有用以針對遇到快取遺漏之載入組態條件與臨限值以傳回含糊結果（代替停止預提取串）的邏輯。例如，串體組態微處理器用以僅針對導致主記憶體存取之快取遺漏產生含糊值，及用以針對其他快取遺漏停止。在各種使用方案（例如整數及/或浮點程式碼）中，預提取串使得資料在上代串使用之前可用（減少或消除快取遣漏）及/或準備好分支預測器（減少或消除誤預測）。各種具體實施例使用預提取串而非硬體預提取（或除）硬體預提取（之外）使用預提取串。在其中自上代串分又預提取_之某些環境中，預提取串針對數百個循環執行同時上代串正在等待—快取遺漏（例日自實施（例如）為DRAM之主記憶體對此遺漏作出補償時）。在某些使用方案及/或具體實施例中，一系統致使一能夠針對—上代串正在等待之時間之相對較長部向進展。例如，串體建構一或多個追蹤用於使用在預提取串中，且兮笙微運复且該等追蹤視需要地排除具有某些屬性之例如，串體視需要地排除對記憶體位址產生無 136758.doc -39- 200935303 =成用之微運异碼。例如，串體視需要地排除僅用以驗 :相ί較谷易_之分支的微運算碼。例如，相對於在一 2疋預提取串内之―追縱，串體視需要地排除將在預提取串内未讀取(或相對不可能加以讀取)之值儲存至記憶體的 :運算碼對：又一範例，串體視需要地排除載入在微運、碼之執行之前已經存在(或相對可能存在)於快取記憶體中之資料的微運算碼。例如，串體視需要地排除具有使得微運算碼與預提取無關之屬性的微運算碼。在某些具體實關及/或使时案巾，微處理 =遠遠提前於（等待）上代_(假定有可用時間）執行預提取串。例如，串體嘗試最小化(藉由消除或減少)在— 追縱中之微運算碼，僅留下在至特定載人之執行的一或多 =關鍵路徑上之微運m料特定載人係（例繁地導致快取遺漏之載人、導致具有相對較長欲填充延^ 之快取遺漏之載人或其奸組合。在某些具體實施例中，串體與硬體（例如快取遺漏效能計數器）結合（例如）藉集關於過期載人之資訊來收集且保存用以決定特入逐行分析資料結構。當最佳化預提取追縱時，串體= 地運作以減少產生特定載人之目標位址之資料前跳越串 ^ 前跳越多執行緒化模型串體層之逐行分析子系統(例如圖13之追捕獲叫在藉由微處理器執行時將較追 = 跳越推測多執行緒化之候選者。在某些具體實 136758.doc •40- 200935303In some embodiments, the microprocessor has logic to return configuration results and thresholds for cache misses to return ambiguous results (instead of stopping the pre-fetch string). For example, a serial configuration microprocessor is used to generate ambiguous values only for cache misses that result in primary memory access, and to stop for other cache misses. In various usage scenarios (such as integer and/or floating-point code), the pre-fetch string allows the data to be available before the previous generation (reducing or eliminating cache misses) and/or preparing the branch predictor (reducing or eliminating errors) prediction). Various embodiments use pre-extraction strings instead of hardware pre-fetching (or in addition to) hardware pre-extraction (except). In some environments where the previous generation is pre-fetched and pre-fetched, the pre-fetched string is executed for hundreds of cycles while the previous-generation string is waiting - the cache misses (for example, the main memory of the DRAM is implemented, for example) When the omission is made, the compensation is). In some usage scenarios and/or embodiments, a system enables one to progress toward a relatively long period of time that the previous generation string is waiting. For example, the string constructs one or more traces for use in the pre-fetched string, and the micro-recovery and the traces optionally exclude certain attributes, for example, the string optionally excludes the memory address Produced no 136758.doc -39- 200935303 = used in the micro-transport. For example, the string body optionally excludes the micro-ops that are only used to check the branch of the __. For example, the string is optionally excluded from storing the value of the unread (or relatively impossible to read) in the prefetched string to the memory relative to the "snap" in a 2" prefetched string: Code Pair: In another example, the string optionally excludes the micro-ops that are already stored (or relatively likely to exist) in the cache memory before the execution of the code. For example, the string optionally excludes micro-ops having attributes that make the micro-ops independent of pre-fetching. In some specific implementations and/or timepieces, the micro-processing = far ahead of (waiting) the previous generation _ (assuming there is time available) to perform the pre-fetch string. For example, the string tries to minimize (by eliminating or reducing) the micro-ops in the tracking, leaving only the micro-carriers specific to the one or more = critical paths to the specific manned execution. (usually resulting in a cache of missing carry-overs, resulting in a relatively long-lasting missed loader or a traitor combination. In some embodiments, the string and hardware (eg, fast) The missing performance counter is combined with, for example, borrowing information about the expired manned to collect and store the data structure used to determine the ad hoc line-by-line analysis. When optimizing the pre-fetching tracking, the string body = operation to reduce the generation The data of the target address of the specific manned person jumps over the string ^ The more jumps the thread is analyzed by the thread-by-line analysis of the threaded layer of the model layer (for example, the catching capture of Figure 13 will be chased when executed by the microprocessor = Jump over the speculation of multi-threading candidates. In some concrete 136758.doc •40- 200935303

❿ 使用方案中，系統針對具有相對高度可預測終端分支（例如無條件分支、迴圈指令分支或系統已相對成功預測之分支）之追蹤使用前跳越串。系統視需要地基於一或多個特性選擇候選者。一範例性特性係（例如）由於相對較多N〇p 引起的相對較低靜態指令級平行性（ILP)。另一範例性特性係相對較低動態ILP(例如具有相對頻繁地停止之載入，導致相對難以靜態觀察之動態排程間隙另一範例性特性係大於單一核能夠提供者的平行發出可能性。在具有包含全部迴圈反覆過程之追蹤及/或其中迴圈反覆過程时在相對較少相依性之某些使时案中前跳越推測多執行緒化係有效的。在具有不為用於直插擴展至單一追蹤中之候選者之呼叫與返回的某些❹方案中，前跳越推測多執行緒化係有效的。在某些使用方案及/或具體實施例中’前跳越推測多執行緒化產生類似於以獅為基礎的：序核之效能位準(但具有相對較少硬體複雜性)。在某些前跳越推測多執行緒化環境中，一後繼串跳過在追縱之 =始之前的數百個指令。在某些情況下，藉由前跳越推測多執^緒化所實現（例如藉由相對較高或最大重疊所達成）之效lb改善取決於後繼者之聞料獨H 〗始位址之相對準確預測及資圖2解說執行前跳越串（例义甲（例如藉由串體加以合成）之硬體之一範例，其針對以循環 <听間對核或互連加以繪製。在該說明中，術語"前跳越串係私目標碼（或其二進制已轉譯版本）之執行（作為一串）， ’其中削跳越串在上代串之終端 136758.doc 41 200935303❿ In the usage scenario, the system skips the pre-flight string for tracking with a relatively highly predictable terminal branch (for example, an unconditional branch, a loop instruction branch, or a branch that has been relatively successfully predicted by the system). The system selects candidates based on one or more characteristics as needed. An exemplary characteristic is, for example, relatively low static instruction level parallelism (ILP) due to relatively large N〇p. Another exemplary characteristic is a relatively low dynamic ILP (e.g., having a relatively frequent stop loading, resulting in a relatively difficult static observation of the dynamic scheduling gap. Another exemplary characteristic is greater than the parallel issuance possibility of a single core capable provider. It is effective to prejudge the multi-threading system in some cases that have relatively little dependency when tracking with all loop repetitive processes and/or loop repetitive processes. In some schemes where the in-line extension is extended to the call and return of the candidate in the single tracking, the pre-jump multi-threading system is effective. In some usage scenarios and/or specific embodiments, the pre-jumping speculation Multi-threading produces a lion-based: probabilistic performance level (but with relatively little hardware complexity). In some pre-jumping speculative multi-threading environments, a successor string skip Hundreds of instructions before the start of the chase = in some cases, by the pre-jumping speculation multi-implementation (for example, by a relatively high or maximum overlap) Successor The relatively accurate prediction of the starting address and the picture 2 illustrate an example of the hardware of the jump before execution (for example, by synthesizing the string), which is directed to looping < listening Drawing the core or interconnect. In this description, the term "pre-jumping string is the execution of the private object code (or its binary translated version) (as a string), 'where the clip is skipped in the upper string Terminal 136758.doc 41 200935303

追蹤之t纟後所執仃（在某些環境中）之下一指令（或二進制轉譯等效者）處開始執行。對於各前跳越串，串體層之矛王式馬產生器（例如圖⑺之排程與最佳化_及/或圖id 串建構140之一或多個元件）將_ fork.skip微運算碼插入至上代串之終端追蹤上。—串之"終端追蹤"係指該串到達其聯合點之前藉由該串所執行的最後追蹤。系統執行 kip微運舁碼時，系統分又新（例如後繼或下代）串作為前跳越串。前跳越串在到達包含fork.skip微運算碼之追縱之結束之後依程式順序所執行之τ —指令（或其二進制已轉譯版本）處開始執行°在某些具體實施例中，對於以條件或間接分支結束之終端追蹤，前跳越串在分支之動態決定之目標處開肖。纟某些使用彳案及，或具體實施例中，系統經由追蹤預測器及/或分支預測器動態選擇分又目標。在其中終端追蹤以無條件分支結束及/或串體在基本區塊之中間結束追蹤之方案中，當產生終端追蹤時決定前跳越串之開始點。在圖2中，在上代串200中之f〇rk.skip微運算碼2ιι已建立後繼串201，其在右行中解說作為串m 22執行於核2 上。後繼串在由於核間通信延時所引起之某一延遲（解說為三個循環）後開始。後繼串接著開始執行對應於分叉目標位址之追縱。 fork.skip微運算碼編碼一指定上代之終端追蹤欲寫入之架構暫存器之位圖（該追蹤未修改其他架構暫存器）的傳播集（解說為propagated_archreg_set欄位虛線框元件212)。除 136758.doc -42- 200935303 非後繼串先前已寫入暫存器集之成員的架構暫存器之第將其後讀取該暫存器之其自未傳播之版本）。 ’否則後繼串之執行在為傳播 -次讀取上停止，因此後繼者己的私有版本（代替上代之尚相對於上代串之終端追蹤，微運算碼格式包括一用以指示微運算碼之結果將傳播至後繼串之機制。在某些具體實施例中，VLW束包括_或多個”傳播”位元，其各與該束Execution begins after an instruction (or binary translation equivalent) that is executed (in some circumstances) after tracing. For each pre-jumping string, the spear-level puppet horse generator (such as the scheduling and optimization of Figure (7) and/or one or more components of the figure id string construction 140) will be _fork.skip micro-operation The code is inserted into the terminal tracking of the previous generation string. —String "Terminal Tracking" refers to the last trace performed by the string before it reaches its joint point. When the system executes the kip micro-transport weight, the system divides the new (for example, subsequent or next generation) string as the pre-jump string. The pre-jump string begins execution at the τ-instruction (or its binary translated version) executed by the program sequence after the end of the trace containing the fork.skip micro-opcode. In some embodiments, Terminal tracking at the end of a conditional or indirect branch, the front jump string is opened at the target of the dynamic decision of the branch. In some use cases and, or in particular embodiments, the system dynamically selects the target by means of a tracking predictor and/or a branch predictor. In the scheme in which the terminal tracking ends with an unconditional branch and/or the string ends the tracking in the middle of the basic block, the start point of the previous skip is determined when the terminal tracking is generated. In Fig. 2, the f〇rk.skip micro-opcode 2 ι in the upper-generation string 200 has established a successor string 201, which is illustrated in the right-hand row as a string m 22 on the core 2. The successor string begins after a delay (illustrated as three cycles) due to the inter-core communication delay. The successor string then begins to perform the tracking corresponding to the bifurcation target address. The fork.skip micro-opcode encodes a propagation set that specifies the bitmap of the architecture register to be written (the trace does not modify other architectural registers) (illustrated as the propagated_archreg_set field dashed box component 212). Except 136758.doc -42- 200935303 The non-subsequent string of the schema register that was previously written to the member of the scratchpad set will then read its unspread version of the scratchpad). 'Otherwise the execution of the successor string is stopped for the propagation-sub-read, so the private version of the successor (instead of the previous generation's terminal tracking with respect to the previous generation string, the micro-opcode format includes a result indicating the micro-opcode) The mechanism that will propagate to the successor string. In some embodiments, the VLW beam includes _ or more "propagation" bits, each of which is associated with the beam

之-或多個微運算碼關聯。體針對前跳越排程且最佳化-終端追蹤時，在且只有在微運算瑪係欲寫人至特定架構暫存器A之最後微運算碼（相對於追蹤之微運算碼之原始程式順序）的條件下串體才設定各微運算碼之傳播位元，從而產生駐外值。在某些具时施㈣，原始程式順序係不同於已排程VLIW追縱之執行順序，且在其他具體實施例中，該等順序係相同的。- or multiple micro-ops associated. For the forward skip scheduling and optimization - terminal tracking, and only in the micro-matrix to write the last micro-computing code of the specific architecture register A (relative to the original program of the tracking micro-computing code) Under the condition of the sequence), the string is set to propagate the bit of each micro-opcode, thereby generating an external value. In some cases (4), the original program order is different from the order in which the scheduled VLIW tracks, and in other embodiments, the order is the same.

當-以架構暫存器A為目標之微運算碼執行且設定該微運算碼之傳播位元時’將微運算碼輸出值V發送至(目前串之）後繼串S。㈣上，接著將值心至串s之暫存器2 中以便串S中之微運算碼採用新（區域產生）值覆寫架構暫存器A之前Stlt取架構暫存器A之嘗試接收值V。若嘗試讀取駐内架構暫存器辑已停止後繼串S，則，因為值乂已到達所以接著對串S解除阻隔以繼續執行。在某些環境中，在後繼串讀取暫存11之前上代串_特定駐外架構暫存器。料特定暫存器係在f景中寫人至後繼串之暫存器播細如暫存器槽案194幻至194A.4之任何者）且不為停 136758.doc •43· 200935303 止之來源。不藉由終端追跑宜私七追蹤寫入不為傳播集之成員的架構暫存器，且後繼串因而繼蚤少、、承在終端追蹤之開始處的暫存器之值。在背景中將該等值值躲 ^, 傳播至與後繼串關聯之暫存器檔案中。若後繼串存取暫存器二仔器之削未傳播已繼承架構暫存器，則後繼串停止。圖2解說該傳播之一範例。分又微運算碼川建立後繼串之後後繼串之第-追蹤之最初三個束則、卻及加執行（分別在循環3、4及5 , m * 因為該等束並不取決於任何駐内暫存器（例如來自上代串終端追縱之駐外暫存器）。不過，當束283在循環6期間嘗試執行時，該束停止，因為該束係取決於上代串之終端追縱尚未產生的駐内架構暫存器 %加與％邮。在循環9中，上代串終端追瞰之束269分別經由微運算碼215與216計算％如與％咖之駐外值且將該等值傳播至後繼串。該等值在數個循環（例如對應於核間通信延時）之後到達執行後繼串2〇1之核’且在循環12中， (後繼串）追蹤201唤醒且執行束284與285。當後繼串之下一束嘗試讀取％rdi時，-值係不可用的。上代串在循環13中經由微運算碼217產生之駐外值且在循環16中將該值傳播至後繼串201用於在循環16中到達。接著在循環“中束286喚醒且執行。該圖式解說藉由後繼串讀取暫存器之别某些駐外架構暫存器（例如分別藉由微運算碼213與2 Μ 所傳播之％rsp與％xmmh0)之背景傳播。在某些環境中，已將針對後繼串欲繼承之一架構暫存器的一值發送至後繼串之前上代串嘗試覆寫該暫存器。在某 I36758.doc • 44 - 200935303 些具體實施例中，一暫存器之舊值在至後繼者之途中之前，連鎖硬體防止上代串覆寫該舊值。在某些環境中，上代串已將對應駐外值傳播至後繼串之前後繼串覆寫駐内架構暫存器而不讀取該暫存器。在某些具體實施例中，後繼 . 串通知上代串後繼串不再等待所傳播之暫存器值，因為後 • 繼串具有更新式（區域產生）值。在各種具體實施例中使用各種機制將暫存器值從上代串傳播至後繼串。某些具體實施例針對駐外傳播之暫存器對 ❹、繼承之暫存器使用不同傳播機制及/或優先權。在某些具體實施例中，不複製暫存器值。而是，後繼串使用二= 時複製暫存器快取機制從隨選上代串取回已繼承且駐外值。該機制使用寫入時複製功能以防止繼承值在傳達至後繼串之前由上代串覆寫，及後繼串不再取決於一值時抑制傳播。在某些具體實施例中，使用一暫存器重命名機制來避免複製實際值。分成操作將上代串之重命名表複製至後 φ 繼串（而非複製值），且兩個串共用一或多個實體暫存器直到串覆寫該等實體暫存器之一或多個。 * 推測串執行緒化（SST) . SST概覽串體將目標軟體分割成複數個獨立可執行串，以實現增 —之平行性、效能或兩者。串體與硬體共同運作以動態逐 2分析目標軟體則貞測目標軟體之控制與資料流之區域間 /、有相對較少或不具有彼此相依性之相對較大區域。串體藉由在開始處插入一分又點且在結束處插入一聯合點/分 136758.doc -45- 200935303 又目標將各區域轉變為一串。串係相對於彼此加以程式排序’且獨立執行。在各種具體實施例中’硬體與串體繼續基於來自觀察及逐行分析動態控制流與資料相依性之即時回授監視及精細，分又與聯合點之選擇’在某些使用方案中實現已改善效能、已改善適應性及已改善健壯性。串範疇識別在某些推測多執行緒化具體實施例中，一分又點產生兩個平行串：在目標軟體中之分又目標位址處開始執行之新後繼串及在分又點之後繼續執行(在目標軟體中)之現有上代串。追蹤預測器及/或分支預測器動態選擇分又目標。在一分叉之後，上代串之範疇（例如壽命）包括在分又操作之後在上代串之執行路徑到達後繼串之初始開始位址或到達某些其他限制之前所執行的所有程式碼。串體串逐行分析子系統導出各串之範疇。 ❿ 若串體識別一針對平行化之迴圈’則分又點（在該分又點處執行一分叉操作）與分又目標（在該分又目標處後繼串開始執行）兩者係指迴圈之頂部且終止迴圈之分支限制上代串之範疇。在迴圈之結束處之一條件分支（其針對下一反覆過程跳至迴圈之頂部）之方案中，不採用分支之中止方向。串體使用啟發法來基於各種編譯器（例如Gcc、、When the micro-opcode targeting the architecture register A is executed and the propagation bit of the micro-code is set, the micro-code output value V is sent to the (current string) subsequent string S. (4) Up, and then the value is sent to the register 2 of the string s so that the micro-opcode in the string S is overwritten with the new (region-generated) value before the architecture register A is Stlt takes the attempted reception value of the architecture register A. V. If the attempt to read the resident architecture register has stopped the successor string S, then since the value 乂 has arrived, the string S is then unblocked to continue execution. In some environments, the string_specific resident architecture register is passed before the successor string reads the scratchpad 11. The specific register is written in the scene of the scene to the successor string of the subsequent stream, such as the register slot 194 phantom to 194A.4) and does not stop 136758.doc •43· 200935303 source. The value of the register is not traced by the terminal, and the register is not the member of the propagation set, and the successor string is followed by the value of the register at the beginning of the terminal tracking. The value is hidden in the background and propagated to the scratchpad file associated with the successor string. If the subsequent serial access register is not propagated, the successor architecture is stopped, and the successor string is stopped. Figure 2 illustrates an example of this propagation. After the micro-calculation code, the successor string is followed by the first-tracking of the first three-string, but the addition is performed (in cycles 3, 4, and 5, respectively, m * because the bundle does not depend on any resident a scratchpad (e.g., an external register from the previous generation string terminal). However, when the bundle 283 attempts to execute during loop 6, the bundle stops because the bundle depends on the last generation string and the terminal has not yet been generated. The resident architecture register % is added with the % mail. In loop 9, the bundle 269 of the previous generation string is calculated by the micro-computing codes 215 and 216, respectively, and the external value of the % coffee is propagated and the value is propagated. To the successor string. The value arrives at the core of the execution successor string 2〇1 after several cycles (eg, corresponding to the inter-core communication delay) and in loop 12, the (subsequent string) trace 201 wakes up and executes beams 284 and 285. The value is not available when a bunch of subsequent strings attempts to read %rdi. The previous generation string is generated in the loop 13 via the micro-opcode 217 and propagated in loop 16 to the successor string. 201 is used to arrive in loop 16. Then in the loop "middle bundle 286 wakes up and executes This diagram illustrates the background propagation of some of the external architecture registers (e.g., %rsp and %xmmh0 propagated by the micro-ops 213 and 2 分别, respectively) by the subsequent string read register. In some environments, a value has been attempted to be overwritten by a previous generation string before a successor string is sent to a subsequent string. In some I36758.doc • 44 - 200935303 In the middle, the old value of a register is before the way to the successor, the chain hardware prevents the previous generation from overwriting the old value. In some environments, the previous generation string has propagated the corresponding resident value to the successor string before the successor string. Overwriting the resident architecture register without reading the scratchpad. In some embodiments, the successor string notification of the previous generation string does not wait for the propagated scratchpad value because the subsequent string has Update (region generation) values. Various mechanisms are used in various embodiments to propagate the scratchpad values from the previous generation string to the successor string. Some embodiments are directed to the temporary propagation of the temporary propagation, the inheritance of the temporary storage. Use different propagation mechanisms and/or priorities In some embodiments, the scratchpad value is not replicated. Instead, the successor string uses the two=time copy scratchpad cache mechanism to retrieve the inherited and resident values from the on-demand string. The mechanism uses writes. The copy function prevents the inherited value from being overwritten by the previous generation before being passed to the successor string, and the subsequent string is no longer dependent on a value to suppress propagation. In some embodiments, a scratchpad renaming mechanism is used to avoid duplication. The actual value. The split operation copies the rename table of the previous generation string to the subsequent φ successor string (instead of the duplicate value), and the two strings share one or more physical registers until one of the physical registers is overwritten. Or multiple. * Predictive string manipulation (SST). The SST overview string splits the target software into a plurality of independent executable strings to achieve parallelism, performance, or both. The combination of the string and the hardware to dynamically analyze the target software is a measure of the relative security of the target software and the relatively large area of the data stream. The string is transformed into a string by inserting a point and a point at the beginning and inserting a joint point/minute at the end. 136758.doc -45- 200935303 The strings are programmed relative to each other' and executed independently. In various embodiments, 'hardware and string continue to be based on real-time feedback monitoring and fine-grained from the observation and line-by-line analysis of dynamic control flow and data dependencies, and the selection of joint points is implemented in some usage scenarios. Improved performance, improved adaptability and improved robustness. String Category Identification In some speculative multi-threading implementations, two parallel strings are generated in one point and one point: a new successor string that begins execution at the target software and at the target address and continues after the points and points. Execute the existing upper generation string (in the target software). The tracking predictor and/or the branch predictor dynamically select points and targets. After a fork, the category of the previous generation string (e.g., lifetime) includes all code that was executed after the execution of the previous generation string reached the initial start address of the successor string or before reaching some other limit. The string string progressive analysis subsystem derives the scope of each string. ❿ If the string recognition is for the parallelized loop, then the point is again (executing a fork operation at the point and point) and the target is again (the successor string is executed at the point and the target) The branch at the top of the loop and ending the loop limits the scope of the previous generation string. In the scenario where one of the conditional branches at the end of the loop (which jumps to the top of the loop for the next iteration), the branch abort direction is not used. The string uses heuristics based on various compilers (eg Gcc,

Microsoft Visual Studio、Sun Studio、PathScale編譯器套件及PGI)之輸出識別終止分支與方向。編譯器針對給定才匕 136758.doc -46- 200935303 令集（例如x86)產生粗略等效控制流習語。例如，藉由找到跳越至緊接在針對下一反覆過程反向跳至迴圈之頂部之基本區塊之後之基本區塊的任何已採用分支來識別迴圈之邊界。其他終止分支包括返回指令及至在迴圈主體中之最後 • 基本區塊之後之位址的無條件分支。 . 彳量其中分又原點係、緊接纟函式+叫之前（例如在_ CALL指令之前）且目標位址係緊接在呼叫指令之後（即在返回位址處）的呼十返回分又。上代串之料係僅藉由函式彳叫之主體來決^ ’且係藉由上代串與返回位址之交又點來終止冑態地，除非程式執行錯誤程式碼或異常處理常式，否則函式呼叫相對頻繁地返回至呼叫位置。 (例如）當在開始相對較大程式碼區塊之又目標係在該區塊之社束之後_力产甘刀，，σ末之後時，存在其他相對更一般化類型之分又。在該區塊（例如上代串之料）内之内部分支視需要地退出區塊且分支至後繼範脅中。串體將内部分支 • 朗且工具裝備為終止分支。在各種具體實施例中，作為更般化控制流分析技術之部分處理各種已建構程式化情況（例如針對迴圈、呼叫及返回）。某…、體實化例中’藉由在包含分又原點之基本區塊 ]"且以遞回方式跟隨至每一分支之已採用及未採用出口兩者執行一遍佈控制流圖表上之基本區塊的深度優先橫越找到終止分支。在使用方案中，定位終止分支係因各式 η:凊况(例如未映射至位址空間中之分支、無效或不刀支目軚及引起難以決定控制流變化之其他情況）而 136758.doc -47· 200935303 複雜化=不過，即使串體未偵測所有持目標軟體之正確性。甚支’串體亦保 (例如原妗碼）β田向級程式結構資訊如原始碼）之任何知識時，接納未偵實現串體操作。、、”端分支亦串體藉由將條件性殺除微運算碼注入之追蹤中識別且工具裝備嗲等追終止分支指一評估為真_=微:::3 ==的所有後繼串。若執:== 測性的則一替代類型之條件性殺除微運算碼之執殺除微運算碼之串及其所有後繼串（參見定位、本文別處之小節”橋追蹤與駐内暫存器預測")。若一終止基本區塊以分支微運算碼（例如" 啊其中比較暫存器_2且只有在比較條件Μ為真時才知用为支)結束’貝"體注入一匹配殺除微運算碼，例如”kUUc R1，R2，T"。殺除微運算碼指定與分支匹配 cc、R1 及 R2。巢套式串為了維持目標軟體之完全確定性執行，在某些具體實施例中串體使用已嚴格程式排序非巢套式推測多執行緒化模尘其中上代串P具有至多一後繼串81(81可選遞回至後繼串S2’等等）。某些具體實施例致使一串能夠具有複數個後繼預提取串（視需要地除單一非預提取後繼串之外卜因為預提取串不對架構狀態進行修改。在某些程式中，聯合S1之前P遇到另一分又點。為了保 I36758.doc •48· 200935303 寺決又性行為，當後繼串存在時㈣㈣在^ :以點。為了確保ρ最終確實聯合S1，寧 = 與硬體實施功能（例如超時）來偵測及中止失控^重套新分析針對終端分支之目標軟體以減少或防止將來出、現。各殺除微運算碼係採用串㈣識別項加以標記，因此若抑制針對-串之分又點’則亦抑制針對串料之任何殺除微運算碼。The output of Microsoft Visual Studio, Sun Studio, PathScale Compiler Suite, and PGI) recognizes the termination branch and direction. The compiler generates a rough equivalent control flow idiom for a given set of 136758.doc -46- 200935303 command sets (eg x86). For example, the boundary of the loop is identified by finding any branch that jumps to the base block immediately after the basic block that jumps back to the top of the loop for the next iteration. Other termination branches include return instructions and unconditional branches to the address after the last • basic block in the loop body.彳其中其中其中、、、、原原原原原原原原原原原原 + + + + + + + + 且且且且且且且且且且且且且且且且且且且且且且且also. The material of the previous generation is only terminated by the main body of the function, and is terminated by the intersection of the previous generation string and the return address, unless the program executes the error code or the exception handling routine. Otherwise the function call returns to the call location relatively frequently. For example, when the target is at the beginning of a relatively large code block and the target is after the community of the block, there are other relatively more general types. The internal branch within the block (e. g., the material of the previous generation) optionally exits the block and branches into the successor. The string will be internal branches • The tool will be equipped to terminate the branch. In various embodiments, various structured stylizations (e.g., for loops, calls, and returns) are handled as part of a more generalized control flow analysis technique. In a ..., physical example, 'by using the basic block containing the origin and the origin" " and recursively following each branch with and without the exit to perform a control flow graph The depth of the basic block is first traversed to find the terminating branch. In the usage scheme, the location termination branch is caused by various types of η: (for example, not mapped to a branch in the address space, invalid or no branch, and other conditions that make it difficult to determine the control flow change) 136758.doc -47· 200935303 Complexity = However, even if the string does not detect the correctness of all target software. In the case of any of the knowledge of the original structure, such as the original code, the unrecognized implementation of the string operation is accepted. The "end branch" is also identified by the tracking of the conditional kill microinjection code injection and the tooling and the like termination branch refers to all subsequent strings evaluated as true_=micro:::3 ==. If the implementation of === estimative, then an alternative type of conditional killing of the micro-computing code in addition to the micro-computing code string and all its successor strings (see positioning, elsewhere in this article) bridge tracking and resident temporary storage Prediction "). If the termination of the basic block is to branch microcode (for example, " ah which compares the register _2 and only knows when the comparison condition is true), the end of the 'bean' body injection a match kill micro Opcodes, such as "kUUc R1, R2, T". The kill code specifies that the branch matches cc, R1, and R2. The nested string is maintained in a specific deterministic manner to maintain the target software, in some embodiments The body uses a strictly programmed non-nested speculative multi-threaded demographic dust in which the upper-generation string P has at most one successor string 81 (81 optionally recursively to the successor string S2', etc.). Some embodiments result in a string It is possible to have a plurality of subsequent pre-fetch strings (except as needed for a single non-pre-fetch subsequent string) because the pre-fetched strings do not modify the architectural state. In some programs, P encounters another point before the joint S1. In order to protect I36758.doc •48· 200935303 Temple and sexual behavior, when the successor string exists (4) (4) in ^: to point. In order to ensure that ρ eventually combined with S1, Ning = with hardware implementation functions (such as timeout) to detect and Suspension of control Analyze the target software for the terminal branch to reduce or prevent future generations and currents. Each killing micro-code is marked with a string (four) identification item, so if the point-to-string is suppressed, then any of the data is suppressed. Kill the micro-opcode.

為實行遞回函式’各串保存—在抑制分又時遞增的私有分又巢套計數器（在建立串時初始化為零）。當硬體處理— 殺除微運算碼時1串之巢套計數器為零則殺除微運算碼僅中止串，否則使巢套計數器遞減且不中止串。候選串選擇在某些使用方案中，某些迴圈為用於推測多執行緒化之良好候選者（每串具有一或複數個反覆過程）。在某些具體實施例中，硬體包含逐行分析邏輯單元且_體合成工具裝備碼（其與逐行分析邏輯單元互動）以用於決定哪些迴圈係適於分裂為平行串。目標軟體中之各反向（迴圈）分支具有串體針對識別與逐灯分析使用的一唯一目標實體位址p。硬體藉由追蹤總循環與反覆過程且針對總循環與反覆使用串體可調諧臨限值過渡出決定為太小以致於不能多產地最佳化之迴圈（例如硬體過濾出每反覆過程具有少於256個循環之迴圈硬體向相對較大迴圈分配一藉由p編索引之迴圈設定檔計數器 (LPC)。該LPC保存總循環、反覆過程、可信度估計器及 I36758.doc •49- 200935303 與決定迴圈是否為針對最佳化之良好候選者相關的其他資訊°串體週期性審查LPC以識別串候選者。串體管理 LPC。在各種具體實施例中，LPC之一或多個係快取於硬體中及/或儲存於記憶體中。在某些具體實施例中針對其他類型之候選串（例如已呼 Η函式）使用類似技術。對於呼叫，視需要地使用一組呼叫逐订分析計數器（CPC)來記錄各種統計内容，例如已呼In order to implement the recursive function, each string is saved—the private branch that is incremented in the suppression time and the nested counter (initialized to zero when the string is created). When hardware processing - when the micro-opcode is removed, the nested counter of one string is zero, then the micro-operating code is killed, only the string is aborted, otherwise the nest counter is decremented and the string is not aborted. Candidate String Selection In some usage scenarios, some loops are good candidates for speculative multi-threading (each string has one or more iterations). In some embodiments, the hardware includes a progressive analysis logic unit and a body synthesis tooling code (which interacts with the progressive analysis logic unit) for determining which loops are suitable for splitting into parallel strings. Each of the reverse (loop) branches in the target software has a unique target entity address p used by the string for identification and lamp-by-light analysis. The hardware loops through the tracking of the total cycle and the repetitive process and uses the tunable threshold for the total cycle and the repeated use to determine the loop that is too small to be optimized for prolific production (eg, hardware filtering out each repetitive process) A loop hardware with less than 256 cycles allocates a loop index counter (LPC) indexed by p to the relatively large loop. The LPC saves the total loop, the repeat process, the confidence estimator, and the I36758 .doc •49- 200935303 Other information related to determining whether the loop is a good candidate for optimization. The string periodically reviews the LPC to identify the string candidates. The string manages the LPC. In various embodiments, the LPC One or more of the caches are cached in hardware and/or stored in memory. In some embodiments, similar techniques are used for other types of candidate strings (eg, called functions). Need to use a set of call-by-book analysis counters (CPC) to record various statistical content, such as called

叫函式中所耗用的循環數、修改哪些暫存器、最可能傳回值及在決定串是否為針對最佳化之良好候選者時可能有用的其他資訊。串巢套圖表建構在某些具體實施例中’串體作為為串體熟知之串或候選串動態建構-或多個表示目標碼之區域間之關係之資料妹構。串體使㈣等結構來追蹤串在彼此㈣之巢套。例對於複數個巢套式迴圈（例如内部迴圈與外部迴圈）， 3 = 1體之串視需要地包含一巢套式函式呼叫(該函中，串二串)或一或多個迴圈。在某些具體實施例在某些具體實施例中，串體向已料存在轉譯快取記憶财運异碼（例如保锞與、番曾瑕備碼’以在執行已轉實施例中碼之運行_更㈣巢㈣料結構。在某歧具體實施例中，硬體包括用以協 …、體邏輯。賴㈣發㈣巢套關係之基於串巢套資料結構中所表不之串巢套階層，串體使用 136758.doc -50- 200935303 啟發法來選擇程式碼之相對更有效區域以轉變為串，且串體工具裝備各選定串以用於以下所說明之進一步逐行分析。在某些具體實施例中，啟發法包括一或多個用以自巢套式内部與外部迴圈選擇一適當串之技術。 • 針對逐行分析之工具裝備 • 基於分又原點、分又目標及終止分支與個別方向集，串體將工具裝備注入至目標軟體之以微運算碼為基礎的轉譯 Y例如儲存於轉譯快取記憶體中）中以形成完整且正確界定 ® 料之串。在某些具體實施例中，串體將逐行分析分又注入至包含在分又原點處之基本區塊之追蹤中。逐行分析分 =不硬體建立-逐行分析串’例如定位於本文別處之小節”上代串逐行分析"與"後繼串逐行分析"中所說明。串體識別且工具裝備包含各終止分支之追縱如定位於本文別處之小節"串範疇識別"中所說明。上代串逐行分析 ❷ 十對逐行刀析之工具裝備之後，執行包含分又點之追縱的下一次，丰體作為上代串之後繼串建立一逐行分析串。 . 逐行分析串阻隔直到上代串與逐行分析串之開始位址交 ’ X °接著逐行分析串開始執行’而上代串阻隔。當逐行分析串完成（例如經由交又點、、終止分支或另一分又）時上代串解除阻隔且聯合逐行分析串。如以下所說明硬體調用串體以完成串建構。實行一逐行分析分又之後’硬體進入特殊逐行分析模式以執行上代串之其餘者。 136758.doc •51 · 200935303 對於上代串中之某些事件之各出現，串體配置一串執行逐行分析記錄（SEPR)，其欲寫入至藉由串體所分配以保$ 由上代串所產生之SEPR的一記憶體緩衝器中。在某此較佳具體實施例中，在實行某些類型之記憶體位址（載二^ 儲存物）時寫入SEPR。在某些具體實施例中，（例如）藉由記錄基本區塊、追蹤、控制流變化或類似資料之執行寫入額外SEPR以致使串體能夠稍後重新建構由串所執= 確程式碼序列。 ’ 後繼串逐行分析上代串在完成時阻隔，同時後繼（逐行分析）_執行且識別暫存器與記憶體相依性。相對於暫存器相依性，隨著^ 繼串執行，在硬體透過後繼串中之暫存器寫入之前，备^ 體最初讀取架構暫存器時硬體更新每串位元遮罩。位2遮罩表示來自上代串用作針對後繼串之駐内者的駐外者^ 相對於記憶體相依性’在某些具體實施例中異動記憶體版本管理系統實現資料快取記憶體内之推告 b± 田甲戰入資 '、，硬體採用快取列（或位元組級）粒度在記憶體位置上進行儲備。硬體藉由更新推測串已載入哪些位元組（或多 $位元組之塊）之位圖追蹤儲備物。硬體視需要地追蹤元貝料，例如哪些特定將來串已載人—記憶體石f科於《 7 Μ衣0 更體採用快取列及/或在各別不同結構中儲存位圖。位^栽入之資料來自已早於載入串（依程式順序）寫入該串(例如所有串之最晚者。在某些環境中’最早之串係架構甲（例如，在列係乾淨時卜在某些環境中，最早之串係早 136758.doc -52- 200935303 於，入串之推測串（例如在列係薪時）。列： = 取列時，硬體檢查任何將來串是否在快取、肴，：硬體！這樣的話，則硬體已谓測到跨越串現體跨=將來串及任何較晚卜或者，硬體通知串此淆，以致使串體能夠實施一用於軟體定義政策。丁止串之靈活始L於硬Γ字逐行分析串串列化以在上代串已完成之後開 …丁"跨越串混淆不發生；硬體依程式順序(相對於串順序，夫必筏A 、順序（相對入及儲存物之微運算碼之順序）執行所有載 …J'因此健備硬體係自由的可用於其他用途。硬體斑：：：二式下時’在某些具體實施例中系統(例如串記憶體轉遞。）使用記憶體儲備硬體來分析跨越逐订》析_之料對於—㈣而言係有達迴圈之頂部時逐行分析社击甘 w執仃到吁、仃刀析結束。其他類型之分又(例如呼刀又或一般化分又)具有可能無限制範疇，且因此系統使用啟發法來限制逐行逐行分析串已完成執行時，對±^= 測到對上代串解除阻隔且串體聞払 :行一建構完全推測_所需要之工具裝備的聯合處= 經由SEPR處理之資料流圖表建構使用系統先前所收集的已程式排序S猶資料，串體以 (上二串之作為根節點之駐外者開始構建一資料流圏表 136758.doc -53- 200935303 料又㈣所說明，執行上代串時’硬體作為硬體執行哪些追縱及/或基本區塊之一記錄保存已程式排㈣心列表，亦保存快取記憶體標籤及相關載入與儲存物之索引 -資料。使用該記錄’串體將各已執行追縱中之各基本區塊解碼為已程式排序微運算碼之串流。為建構卿，使用 =器重命名表’將微運算碼運算元轉換為至較早微運算碼之指標。 ❹ 為追縱記憶體相依性，串體保存一記憶體重命名表’立將快取位置映射至用以寫入至一位址的最晚儲存操作1 此，載入與儲存物選擇性指定先前儲存物為原始運算元。串體結合記憶體重命名表使用記錄請叹中之快取以在DFG中包括記憶體相依性。在程序完結時，已將上代串中所執行之所有微運算碼併入至資料流圖表中，其中囍衣t具中藉由目剛暫存器重命名表與記憶體重命名表指向圖表之根節點（駐外者）。 ·、橋追蹤與駐内暫存器預測 -推測後繼串之駐内集(例如上代串之最後駐外者)係自 :又上代串時所存在之架構暫存器值預測。串體自各駐外 (暫存器與記憶體兩者）深度優先搜尋動態卿以產生產 7運算碼之子集。依程式順序的所有子集之集合係駐外產生集。縣I=建立以上代串中之分又點處之架構暫存器與記憶體值開始的橋追縱’且僅包括用以預測最後駐外者之駐外產生集（如藉由後繼推測串之駐内位元遮罩所指示卜橋追 136758.doc -54- 200935303 縱亦將任何駐外暫存器預測複製至記憶體緩衝器。稍後系統使用該等複本來偵測誤預測。當一追縱分又至-推測㈣，串體設立新串以在橋追縱而非推測串之第-微運算碼處開始執行。除處置暫存器相依性之外，橋㈣將任何終止分支（及計算分支條件之相關微運算碼)轉換為中止推測丰之微運算碼。最後，橋追縱針對串設立各種㈣暫存^，例如至已㈣記憶體值列 ❹ 表、延期列表及無條件分支、至推測串之開始的指標。橋追縱最佳化 —旦串體6建構橋㈣’串體便嘗試使用各種動態最佳化技術減少或最小化長度。某些習語(例如溢出盥填充暫存器或在-串t使料多呼叫與返回）有時導致無變化地自堆叠重複載人及儲存—暫存器。同樣地，有時使堆疊指標或其他暫存n重複遞增或遞減，而集合起來，相依性鏈係4效於新增一常數。串體辨識該等f語與圖案之至少某些且將相依性鍵最佳化為相對少或較少的操作。例如，串體使用他⑽e七ad- use短路，其中藉由— 先前儲存物之值推測性取代來自該儲存物之—載人讀取資料（隨同暫存器與記憶體預測一起在聯合點處驗證推測）。若串體不能夠將橋追蹤減少至預定或可程式化長度則串體放棄串之最佳化。放棄發生在各種環境中，例如當存在真的跨越串暫存器相依性時，或當―駐外者係在上二串中相對較晚地加以計算且在後繼串中相對較早地加以耗用 136758.doc -55- 200935303 (從而導致相對較長相依性鏈）時。 °己憶體值預測對於某些串’橋追蹤預測記憶體值行分加：_ , φ體使用在後繼逐刀析串(例如定位於本文別處之所說明）之執扞期門所㈣後繼串逐行分析"令代甘載入儲備資料來決定藉由上由後料行分析㈣取（㈣稱為跨越 =遞)哪些記憶體位置。在某些具體實施例接存取硬體資料快取纪甲瓶直鏟… 1 籤與70資料以構建橫跨串所轉遞之快取位置之列表。叮串體在針對DFG之記憶體重命名表中查找The number of cycles consumed in the function, which registers are modified, the most likely to return values, and other information that may be useful in determining whether a string is a good candidate for optimization. Nested Set Chart Construction In some embodiments, the string body is dynamically constructed as a string or candidate string that is well known to the string body or a plurality of data structures representing the relationship between the regions of the object code. The string makes the structure of (4) and so on to track the nests of the strings in each other (4). For a plurality of nested loops (such as internal loops and external loops), the 3 = 1 body string optionally includes a nested function call (in the letter, string two strings) or one or more Circles. In some embodiments, in some embodiments, the string is expected to have a translation cache memory (for example, a security code, a code) to perform the code operation in the executed embodiment. _ more (four) nest (four) material structure. In a specific embodiment, the hardware includes the use of coordination, body logic. Lai (four) hair (four) nested relationship based on the nested nest data structure The string uses the 136758.doc -50- 200935303 heuristic to select the relatively more efficient region of the code to convert to a string, and the string tool equips each selected string for further line-by-line analysis as explained below. In a specific embodiment, the heuristic includes one or more techniques for selecting an appropriate string from the nested internal and external loops. • Tooling for line-by-line analysis • Based on the origin, the sub-target, and the termination The branch and the individual direction set, the string device injects the tool equipment into the target software, and the micro-code-based translation Y is stored in the translation cache memory, for example, to form a complete and correctly defined string. In some embodiments, the stringer injects the progressive analysis into the tracking of the basic blocks contained at the origin and origin. Line-by-line analysis = non-hardware establishment - line-by-line analysis string 'for example, located in the section elsewhere," on the line-by-line analysis of line-by-line analysis "and "subsequent string line-by-line analysis" The tracking of each terminating branch is described in the section "Chasition Category Identification" located elsewhere in this article. The previous generation string analysis is performed ❷ After the dozens of tools are equipped with line-by-line analysis, the execution includes points and points. The next time, the body expands a line-by-line analysis string as a successor string. The line-by-line analysis of the string is blocked until the previous generation string intersects with the start address of the line-by-line analysis string 'X° and then the line-by-line analysis string begins execution' The upper generation string is blocked. When the line-by-line analysis string is completed (for example, via the intersection, the end branch, or the other branch), the upper generation string is unblocked and the line is analyzed line by line. As described below, the hardware calls the string to complete the string. Construction. Perform a line-by-line analysis and then 'hardly enter the special progressive analysis mode to execute the rest of the previous generation string. 136758.doc •51 · 200935303 For each occurrence of certain events in the previous generation string, The string configuration performs a series of progressive analysis records (SEPR), which are to be written into a memory buffer allocated by the string to protect the SEPR generated by the previous generation string. In an example, the SEPR is written when certain types of memory addresses (loaded storage) are implemented. In some embodiments, for example, by recording basic blocks, tracking, controlling flow changes, or the like Execution writes additional SEPR to enable the string to be re-constructed later by the string = correct code sequence. 'Subsequent string progressive analysis of the previous generation string is blocked at completion, while subsequent (progressive analysis) _ execution and identification Memory and memory dependencies. Relative to the dependency of the scratchpad, with the succession of the string, the hard disk is hard to read the schema register before the scratchpad is written in the successor string. Each bit of the bit mask is updated. The bit 2 mask indicates that the resident string from the previous generation is used as a resident for the successor string ^ relative to the memory dependency 'in some embodiments, the transaction memory version management System implementation data cache memory告b±田甲战入入', the hardware uses the cache column (or byte level) granularity to reserve in the memory location. The hardware by loading the speculative string has loaded which bytes (or more The bitmap of the $bit block tracks the stock. The hardware tracks the material as needed, for example, which specific future strings have been loaded - the memory stone f is in the "7 Μ衣0 more body using the cache column And/or storing bitmaps in separate structures. The bits are loaded from the string earlier than the load string (in the order of the program) (eg the last of all strings. In some environments) The earliest string architecture A (for example, when the column is clean, in some environments, the earliest string is 136758.doc -52- 200935303, the string of speculations (for example, when the column is paid). Column: = When taking the column, the hardware checks if any future strings are in the cache, food: Hardware! In this case, the hardware has been found to cross the string span = future string and any later, or hardware notification string confusion, so that the string can implement a software definition policy. The flexibility of the D-stop string is based on the line-by-line analysis of the string of hard words. After the previous generation string has been completed, the Ding "crossing string confusion does not occur; the hardware depends on the program order (relative to the string order, the husband must A, the order (the order of the micro-opcodes of the relative input and the storage) performs all the loads... J' so the hard-wired system is free for other uses. Hard-spot::: when the second type is in some implementations In the example system (such as string memory transfer), using the memory reserve hardware to analyze the analysis of the traversing of the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ At the end of the call, the end of the analysis. Other types of points (such as call or generalization) have possible unrestricted categories, and therefore the system uses heuristics to limit the progressive line-by-line analysis of the string has been completed, ±^= It is detected that the upper generation string is unblocked and the string body is heard: the line-construction is completely speculated _ the joint of the tools and equipment required = the data flow chart constructed by SEPR is used to construct the system. Data, string body Beginning to build a data flow table for the external nodes of the root node 136758.doc -53- 200935303 It is also explained in (4), when performing the previous generation string, what hardware and hardware are used as the hardware to perform tracking and/or recording of one of the basic blocks Save the program (4) heart list, and also save the cache memory tag and the related load and store index - data. Use the record 'string body to decode each basic block in each executed track to be programmed. The stream of micro-computing code. For the construction of the Qing, use the = device to rename the table 'convert the micro-operating code operand to the index of the earlier micro-computing code. ❹ In order to trace the memory dependence, the string saves a memory weight naming The table 'maps the cache location to the latest storage operation for writing to the address. 1 The load and storage selectively specify the previous storage as the original operand. The string combined memory weight naming table usage record Please sigh in order to include memory dependencies in the DFG. At the end of the program, all the micro-ops executed in the previous generation string have been incorporated into the data flow chart, in which Just register The name table and memory weight naming table point to the root node of the chart (residents). · Bridge tracking and resident register prediction - guess the successor string of the inner set (such as the last survivor of the previous generation string) from: Predictor value predictions that existed in the previous generation string. The string body searches for dynamics from each of the external (scratchpad and memory) depths to generate a subset of the 7 operational code. All subsets of the program order are The collection system is set up outside the county. County I= establishes the structure of the above-mentioned generation string and the structure of the scratchpad and the beginning of the memory value of the bridge, and only includes the set of foreign resident generations for predicting the last resident. (For example, by the inferred string of the in-situ bit mask, the indicated bridge is 136758.doc -54- 200935303 and any external register register is copied to the memory buffer. The system later uses these replicas to detect false predictions. When a trace is repeated to - speculation (4), the string sets up a new string to begin execution at the bridge-snap-and-predictive string---the micro-code. In addition to handling the dependency of the scratchpad, the bridge (4) converts any terminating branch (and the associated micro-computing code that computes the branching condition) into a discontinuous speculative micro-computing code. Finally, the bridge pursuit sets up various (four) temporary storages for the string, for example, to the (four) memory value column 、 table, the deferred list and the unconditional branch, and the indicator to the beginning of the speculative string. The bridge is optimized to reduce or minimize the length using a variety of dynamic optimization techniques. Some idioms (such as overflow 盥 padding registers or in-string t-feeding multiple calls and returns) sometimes result in unremoved self-stacking of the person and storage-storage. Similarly, sometimes the stacking index or other temporary n is repeatedly incremented or decremented, and the dependency chain 4 is added to add a constant. The string recognizes at least some of the f words and patterns and optimizes the dependency keys to relatively few or fewer operations. For example, the string uses his (10)e seven ad-use shorts, where the manned read data from the stock is speculatively replaced by the value of the previous stock (with the register and the memory prediction at the joint point) Verify speculation). If the string is not able to reduce the bridge tracking to a predetermined or programmable length, the string will abandon the string optimization. Abandonment occurs in a variety of environments, such as when there is a true cross-string register dependency, or when the "outsider" is calculated relatively late in the last two strings and consumed relatively early in the successor string Use 136758.doc -55- 200935303 (thus resulting in a relatively long dependency chain). °Recalling the body value prediction for some string 'bridge tracking prediction memory value line addition: _, φ body is used in the subsequent step-by-step analysis (for example, as explained elsewhere in this article) The string-by-line analysis " enables Daigan to load the reserve data to determine which memory locations are taken by (4) (4). In some embodiments, access to the hardware data cache is performed. 1 Sign and 70 data to construct a list of cache locations that are forwarded across the string.叮 Find the body in the memory weight naming table for DFG

影響的各絲位置。該表M H 表扣向用以寫入至該位置的最近儲存微運算碼（依程式順序）。接著串體構建產生储存微運算碼之值所必需之微運算碼之子圖表（例如使用深度優先搜尋)。串體包括隨同用以產生暫存器值預測之任何其他微運算碼一起進入橋追蹤之微運算碼。在橋追蹤中之儲存微運算碼將上代串中之儲存物與後續後繼串解除耦合（後繼争而是自橋追蹤載入預測）。最後，串體將關於各已預測儲存物之資訊複製至一與實際儲存值相比較晚的每串儲存物預測確認列表中以確認推測。在各種具體實施例中，該資訊包括儲存物之實體位址、所儲存之值及藉由儲存物所寫入的位元組之遮罩（或替代地，儲存物之以位元組計之大小以及偏移）。聯合處理常式追蹤藉由串體所建構之各推測串具有一匹配橋追蹤與聯合處 136758.doc •56- 200935303 理常式追蹤。聯合處理常式追蹤確切使用之路士… 飞退峨释呢由橋追蹤進行的實際預測）。1 或記憶體值_(例如，忽略未使用之二—上代串結束（例如經由與後繼串之交又點、级對他事件）時，硬體重新料後繼串開始執行針 f上代争所定義之聯合處理常式。對於所使用之S暫存器值預測，聯合處理常式自記憶體 2衝器讀取已預測值（例如U於本文別處之小節，，橋追縱 =駐内暫存器預測”中所說明），且將已預測值與來自上代 :之駐外值相比較。硬體包括"透通"暫存器讀取與記憶體入函式’其致使能夠比較聯合追蹤讀取聯合追蹤之狀離 (例如暫存器與記憶體)及上代串之對應狀態用於比較。；些具體實施例僅比較由後繼串所讀取之暫存器。同樣地’為確認記憶體值預測，聯合追料佈所使用的已預測儲存物之列表（在各種具體實施例中，包括實體位址、值及針對各項目之位元組遮罩的-或多個）反覆，且將各已預測儲存值與上代串之處於相同實體位址處之區域產生駐外值相比較。若系統痛測到任何失配，則系統中止後繼串且上代串如同系統未分又後繼串一樣越過聯合點繼續。若聯合係成功的，則系統丟棄上代串且後繼串變為針對對應VCPU之新架構串。結論在說明中僅為方便製備文字與圖式起見已進行某些選擇且除非存在相反指示，否則該等選擇不應本質上解釋為傳 136758.doc •57- 200935303 達關於所說明之具體實施例之結構與操作的額外資訊。該等選擇之範例包括：用於圖式編號之標識之特定組織或= 派與用以識別及參考具體實施例之特徵與元件之元件識= 項（例如，標注或數字標識符）之特定組織或指派。詞語”包括”或”包含”係明確意欲解釋為說明開端範疇之邏輯集的抽象詞語且並非意指傳達實體包含，除非明顯後隨詞語在…之内”。 ’ 儘管已基於說明與理解之清晰之目的詳細說明先前之具體實施例，但本發明不受限於所提供之細節。存在本發= 之許多具體實施例。所揭示之具體實施例係範例性的且不為限制性的。應瞭解與該說明-致的建構、配置及使用之許多變化係可能的’且係在所頒佈之專利之申請專利範圍之範嗜内:、例如’互連與功能單元位元寬度、時脈速度及所使用之技術類型係依據各種具體實施例在各組件區塊中可變化。提供給互連與邏輯之名稱僅為範例性，且不應解釋為限制所說明之概念。流程圖與流程圖式之程序、動作及功能元件之順序與配置係依據各種具體實施例可變化。此外，除非 ::作出相反陳述’否則所指定之值範圍、所使用之最大 =最小值或其他特定規以例如ISA、循環之雙目 =器中之項目或分級之數目)係僅為所說明之具體實此等值範圍、最大與最小值及特定規定，係預期追方案技術中之改善與變化，且不應解釋為限制。可採用此行技術中所熟知的功能等效技術而非所說明者 136758.doc -58· 200935303 來實施各種組件、子系統、功能、操作、直插常式、副常式、常式、程序、巨集指令或其部分。亦應瞭解具體實施例之許多功能態樣係可與具體實施例相依設計約束及更快處理（利於將先前採用硬體之功能轉移至軟體中）與更高整合铯度（利於將先前採用軟體之功能轉移至硬體中）之技術趨向成函數關係選擇性採用硬體（即，一般專用電路）或軟體（即，經由已程式化控制器或處理器之某一方式）實現。The position of each wire affected. The table M H tabs to the most recently stored micro-op (in program order) for writing to the location. The string is then constructed to produce a sub-graph of the micro-ops necessary to store the value of the micro-ops (e. g., using depth-first search). The string includes the micro-ops that enter the bridge tracking along with any other micro-ops used to generate the scratchpad value prediction. The stored micro-op in the bridge trace decouples the storage in the previous generation string from the subsequent successor string (subsequent contention is self-bridge tracking loading prediction). Finally, the string copies the information about each of the predicted stocks to a list of stock prediction confirmations that are later than the actual stored value to confirm the guess. In various embodiments, the information includes a physical address of the stored item, a stored value, and a mask of the byte written by the stored item (or alternatively, the stored item is in a byte group) Size and offset). Joint processing routine tracking Each of the speculative strings constructed by the string has a matching bridge tracking and association. 136758.doc • 56- 200935303 Regular tracking. The joint processing routine traces the exact road used... The flyback is actually predicted by the bridge tracking). 1 or the memory value _ (for example, ignoring the unused two - the end of the previous generation string (for example, via the intersection with the successor string, the level to the other event), the hardware re-follows the successor string to start the execution of the needle f The joint processing routine. For the S register value prediction used, the joint processing routine reads the predicted value from the memory 2 buffer (for example, U is in the section elsewhere herein, bridge tracking = resident temporary storage) (predicted in the "predicted"), and compares the predicted value with the resident value from the previous generation: the hardware includes "through" register read and memory input function, which enables comparison of joints The status of the trace read joint trace (eg, scratchpad and memory) and the previous generation string is compared for comparison. Some embodiments compare only the scratchpad read by the successor string. Memory value prediction, a list of predicted stores used by the joint tracking cloth (in various embodiments, including physical addresses, values, and - or more of the byte masks for each item), And each predicted stored value and the previous generation The area at the same physical address is compared with the external value. If the system detects any mismatch, the system aborts the successor string and the previous generation string continues over the joint point as if the system was not divided and the successor string. If the joint system succeeds, Then the system discards the previous generation string and the subsequent string becomes a new architecture string for the corresponding VCPU. Conclusion In the description, some choices have been made for the convenience of preparing text and schema and unless there is an indication of the opposite, the selection should not be essential. The above is explained as 136, 758.doc • 57- 200935303 for additional information on the structure and operation of the specific embodiments described. Examples of such choices include: the specific organization used for the identification of the schema number or = Recognizing and referring to the specific organization or assignment of the features and the elements of the elements (e.g., annotated or numerical identifiers). The word "including" or "comprising" is expressly intended to be interpreted as a logical set of the opening category. Abstract words do not mean to convey the inclusion of an entity unless it is clearly followed by the word within." 'Although it is based on clarity of explanation and understanding The present invention is not limited to the details of the present invention. The specific embodiments disclosed herein are illustrative and not restrictive. Many variations of the construction, configuration, and use of this description are possible and are within the scope of the patent application scope of the issued patent: for example, 'interconnect and functional unit bit width, clock speed and The type of technology used is variable in various component blocks in accordance with various specific embodiments. The names provided to the interconnects and logic are merely exemplary and should not be construed as limiting the illustrated concepts. The sequence and configuration of the procedures, acts, and functional elements can be varied in accordance with various embodiments. In addition, unless:: the contrary statement is made, otherwise the range of values specified, the maximum=minimum used, or other specific rules such as ISA , the binoculars of the cycle = the number of items or grades in the device) is only the specified range of values, maximum and minimum values, and specific regulations, which are expected to be pursued. Improvement and change in the operation, and should not be construed as limiting. Various components, subsystems, functions, operations, in-line routines, sub-normals, routines, procedures may be implemented using functionally equivalent techniques well known in the art, rather than the described 136758.doc -58.200935303 , macro instructions or parts thereof. It should also be appreciated that many of the functional aspects of the specific embodiments can be designed and constrained in accordance with the specific embodiments and facilitate processing (to facilitate the transfer of previously used hardware functions into the software) and higher integration flexibility (to facilitate the use of previously used software) The technique of transferring functionality to hardware) tends to be functionally selective using hardware (ie, general purpose circuitry) or software (ie, via a programmed controller or processor).

在各種具體實施例中之特定變化包括（但不受限於）：分割差異；不同形狀因數與組態；使用不同作業系統及其他系統軟體；冑用不同介面標準、網路協定或通信鍵路；及依據-特定應用之唯-性工程與行業約束實施本文所說明之概念時預期的其他變化。已採用遠遠超出所說明之具體實施例之許多態樣之最小實施方案所需要者的細節與環境内容脈絡說明具體實施例。熟習此項技術者應認識到某些具體實施例省略所揭示之組件或特徵而不改變其餘元件間之基本合作。因而應瞭解所揭示的許多細節並非實施所說明之具體實施例之各種態樣所需要的。在其餘元件係可區別於先前技術之範圍内丄所省略之組件與特徵並不限制本文所說明之概念。 «又汁之所有此類變化係在由所說明之具體實施例所傳達之原理之上的非實質性變化。亦應瞭解本文所說明之且體 =施例可廣泛應用於其他應用，且不受限於所說明之I體 =例的特定應用或產業。因而本發明不應解釋為包括包頒佈之專利之申請專利範圍之範相的所有可能修 136758.doc -59- 200935303 改與變化。【圖式簡單說明】圖〗A解說關於具傷串力電之電腦且古。系各具備串能力 ”有-或多個有權使用串體影像、記憶體性儲存器、輸入/輸出器件及發器。卞内崎及兴壻串能力之微處理 =與〗。共同解說與一具備串能力之微處理器相關的概心硬體、串體（軟體）及目標軟體層（例如子系統）二、了共同解說執行前跳越串(例如藉由串體加口幻之硬體之-範例，其針對以循環計之時互連加以繪製。有時說明將圖2八、咖成為圖2。，【主要元件符號說明】參 101 102 103.1 至103.4 104.1 至 104.6 110、110A、110B 111 115 120 121 124 125 127 (x86)目標軟體層作業系統核心應用程式 VCPU 串體層轉譯快取記憶體管理 X 8 6 一進制轉譯追蹤逐行分析與捕獲實體頁逐行分析分支逐行分析預測性最佳化記憶體逐行分析 136758.doc -60- 200935303 130 促進 140 串建構 160 排程與最佳化 162 記憶體混淆分析 ' 163 最佳化， 165 排程各微運算碼 167 編碼似VLIW束 172 硬體控制 ❿ 174 虛擬器件 175 中斷、SMP及計時器 181 逐行分析硬體 182 硬體加速單元 183 異動記憶體 184A 外部系統/串體DRAM 186 晶片組/PCIe匯流排介面 187 硬體x86解碼器 190 硬體層 • 191.1 至 191.4 VLIW 核 . 192A.1 至 192A.4 ALU 192B.1 至 192B.4 FPU 193.1 至193.4 LI D快取記憶體 194A.1 至 194A.4 暫存器檔案 194B.1 至 194B.4 串内容脈絡 195 多核互連網路 136758.doc -61 - 200935303 211 212 © 213、214、215、216、217 269 、 280 、 281 、 282 、 284 、 285 196 197 198 199 200 201 2000.1、2000.2 2001.1 ' 2001.2Specific variations in various embodiments include (but are not limited to): segmentation differences; different form factors and configurations; use of different operating systems and other system software; use different interface standards, network protocols, or communication keys And other changes expected in the implementation of the concepts described herein based on the specific application-specific engineering and industry constraints. Specific embodiments have been described in terms of details and environmental contexts that are required to be far from the minimum implementations of many aspects of the specific embodiments described. Those skilled in the art will recognize that certain embodiments omit the disclosed components or features without altering the basic cooperation between the remaining components. It is understood that many of the details disclosed are not required to implement the various embodiments of the specific embodiments described. The components and features that are omitted in the remaining elements are distinguished from the prior art and do not limit the concepts described herein. «All such variations of the juice are based on insubstantial changes in the principles conveyed by the specific embodiments illustrated. It should also be understood that the embodiments described herein can be widely applied to other applications and are not limited to the particular application or industry in which the described I body = example. Therefore, the present invention should not be construed as including all modifications of the scope of the patent application scope of the patent issued. [Simple description of the diagram] Figure 〗 A explains the computer and the ancient with the injury. Each has the ability to string. There are - or more of the right to use the serial image, memory storage, input / output devices and transmitters. 微 Neiqi and Xing 壻 string of micro-processing = and 〗. Joint interpretation and A microprocessor-based hardware, string (software), and target software layer (such as a subsystem) with string capabilities. 2. A common explanation of the jump before execution (for example, by string and hard - the example, which is drawn for the time when the loop is used for the loop. Sometimes the description is shown in Fig. 2, and the coffee is shown in Fig. 2. [Major component symbol description] Reference 101 102 103.1 to 103.4 104.1 to 104.6 110, 110A, 110B 111 115 120 121 124 125 127 (x86) target software layer operating system core application VCPU serial layer translation cache memory management X 8 6 binary translation tracking progressive analysis and capture entity page line-by-line analysis branch line by line analysis Predictively Optimized Memory Progressive Analysis 136758.doc -60- 200935303 130 Promotion 140 String Construction 160 Scheduling and Optimization 162 Memory Confusion Analysis '163 Optimization, 165 Scheduling Microcode 167 Encoding VLIW Bundle 172 Hardware Control 174 Virtual Device 175 Interrupt, SMP and Timer 181 Progressive Analysis Hardware 182 Hardware Acceleration Unit 183 Transaction Memory 184A External System / Serial DRAM 186 Chipset / PCIe Bus Interface 187 Hardware X86 Decoder 190 Hard Layer • 191.1 to 191.4 VLIW Core. 192A.1 to 192A.4 ALU 192B.1 to 192B.4 FPU 193.1 to 193.4 LI D Cache Memory 194A.1 to 194A.4 Register File 194B .1 to 194B.4 Serial context 195 Multicore internet 136758.doc -61 - 200935303 211 212 © 213, 214, 215, 216, 217 269, 280, 281, 282, 284, 285 196 197 198 199 200 201 2000.1 , 2000.2 2001.1 ' 2001.2

2002.1 ' 2002.2 2002.ΙΑ 2002.IB ❹ 2003 * 2004 - 2005 2006 2009 2010 2011.1 2012.1 L2/L3快取記憶體 DRAM控制器與北橋多插座系統互連 PCI特快、QPI、超傳輸上代串後繼串 fork.skip微運算碼傳播集微運算碼束具備串能力之電腦具備串能力之微處理器動態隨機存取記憶體元件串體資料轉譯快取記憶體快閃記憶體串體影像鍵盤/顯示器周邊設備網路儲存器逐行分析單元串管理單元 136758.doc -62- 200935303 2013.1 VLIW 核 2014.1 異動記憶體 2050 麵合 2051.1、2051.2 耦合 2052.1 麵合 2053 耦合 2055 耦合 2056 耦合 2063 耦合 2064 耦合 136758.doc -63 -2002.1 ' 2002.2 2002.ΙΑ 2002.IB ❹ 2003 * 2004 - 2005 2006 2009 2010 2011.1 2012.1 L2/L3 cache memory DRAM controller and Northbridge multi-socket system interconnection PCI Express, QPI, super transmission last string successor string fork. Skip micro-opcode spread set micro-computing code bundle with string capability computer with string capability microprocessor dynamic random access memory component string data translation cache memory flash memory string image keyboard / display peripheral device network Road storage line-by-line analysis unit string management unit 136758.doc -62- 200935303 2013.1 VLIW core 2014.1 transaction memory 2050 face 2051.1, 2051.2 coupling 2052.1 face 2053 coupling 2055 coupling 2056 coupling 2063 coupling 2064 coupling 136758.doc -63 -

Claims

200935303 X. Patent application scope: 1 A system, which comprises: a string construction component, which is used for dynamic construction - one of them - (four) progressive analysis orientation - at least a string of construction components is at least partially (four) a part, wherein the person performs at least one test of the plurality of threads of the day h, and the segmentation execution part of each string includes a plurality of individual string images; δχ is a cut execution thread, and the execution component is used for the one Or a plurality of processing; and ''string-based t for the processed string 铱 Platinum F7 β This / Ding Xu § hai execution component enables I to simultaneously perform corresponding to the string f ^ ^ . The knife "executes the multiple strings of two or more threads of the individual multiple images of the application. 2. The system of claim 1, wherein for -, _. y is divided for each processed string Thread The thread in the thread contains at least the thread of the batch/cutting thread and a string of the string of 5 successors. The architecture string 3 is the architecture state of the thread cutting thread. - Architecture string content, \ 'The successor string is younger than the string of the string splitting thread, and the money string is updated with a sequence of contexts of the speculative version of the one of the architectural states of the string splitting thread. The system of the long term 1' Wherein the execution component is further enabled to simultaneously perform: a plurality of strings comprising one of the architectures associated with the processed strings of the processed strings, each of the architectural strings updating an individual architecture string context containing the individual architectural states. For example, the system of claim 3, 136758, doc 200935303, wherein the divided strings of the processed strings include the string splitting execution thread, the plurality of threads in the thread, and at least one or more successors of the string splitting thread The string of the framework string, each of the worms or the worms 4-way than the string splitting thread 芊 = each subsequent string update contains one of the staggered threads of the staggered thread state of the speculative version of the armored A (7) successor string context; and a dedicated network is stored in one or more dedicated content caches in the microprocessor. A ❹ ❹ ❹ 5 a system in which the concurrent performers are executed on a single core of a microprocessor. A plurality of strings 6_ Π::Γ: 统' where the plurality of threads within the execution thread are simultaneously executed. 7.:=~ The second component of the construction component is the executable code and the microcode - or more of the actual 8' item 7 ''where the executable code and the exhaustion of the - or medium. At least partially stored in one or more non-volatile storage devices, such as the system of claim 1, further comprising: a hidden body that includes - or a plurality of DRAM devices; The constructing component of the thread part allocates the portion of the memory. The system of claim 1 is constructed to the string, further comprising: a build that is enabled to decode at least a type of upsell and at least one type of deconstructed microcode. 136758.doc 200935303 "• The system of claim 2, further comprising: a context store, dedicated serial logic, coupled to the content context; and at least for the one or more threads And it is enabled to perform a plurality of hardware assisted merges by the & string cutter. The system of claim 2, further comprising: a content network storage device

a string of contextual links, such as string and singularity; and a contextual context corresponding to at least one of the processed strings of at least one processed string splitting thread to the (4) context store The system of claim 2, further comprising: a transaction memory comprising a dedicated transaction memory and dedicated transaction memory control logic, the transaction memory system being enabled to target the body: At least one processed string splitting thread of the thread implements memory asset=hardware assisted version management, wherein each of the plurality of memory locations holds a plurality of data versions, wherein for each of the plurality of memory locations In one case, the first version of one of the versions corresponds to the architecture string and the at least one second version of the versions individually corresponds to at least one of the successor strings. 14. The system of claim 1, further comprising · an analysis component for identifying an individual span that corresponds to the occurrence of one or more individual memory locations between the plurality of simultaneous execution threads and confusing one or more individual memory locations One or more latency dependencies; I36758.doc 200935303 component that removes the one or more latency dependencies by employing one or more individual deferred operations instead of S spanning string operations; Evaluating each of the deferred operations performed by the plurality of concurrent execution threads within the thread; wherein the identifying and the replacing are enabled to dynamically operate during the processing of the processed string split threads, And ° ^ relative to the execution of the processed string split threads, the result system 15 achieved by executing the processing of the thread within the thread simultaneously, the target 5; the result specified for the strict sequential processing architecture. The method includes: constructing at least one of the one or more threads to segment the thread segmentation thread, ^ via 兮一$ each shed#, wherein the dynamic construction system is at least partially at least part of the plurality of threads - Implemented, in which the string splitting the thread of the string is divided into "Hung Bo", each string image; ° 仃邛包含包含包含个别包含包含包含包含包含包含包含个别个别个别个别个别个别个别个别个别个别个别个别Lines; and for each of the processed strings, eight 丨丨 / - - - - - - , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , string. Two or two of I are 16. According to the method of claim 15, the further step includes ·· wherein the processed string of each of the eight/strings includes the string splitting thread 2 _ the plurality of threads within the thread The successor string-architecture string, each of which "reads at least - or the architecture string is younger; /, than the string splitting thread 136758.doc 200935303 where the processed string is updated for the processed string split threads a framework string context containing the architectural state of the string splitting thread; and a partitioning thread for each of the processed strings, each subsequent string updating packet 3 is a speculative version of one of the architectural states of the cutting thread Individual 'subsequent string content context. 17. The method of claim 15, further comprising simultaneously executing a plurality of strings comprising threads of one of the architectural strings associated with each of the strings being executed, each architectural string updating an individual architecture comprising individual architectural states String content 脉 context. 18. The method of claim 15, wherein the one or more threads comprise all or any portion of an application executed in a user mode and/or an operating system core executed in a privileged mode. The method of claim 15, wherein the one or more threads comprise all or any portion of a virtual machine hacking and/or one or more operating system cores managed by the virtual machine monitor. The method of claim 15, wherein the one or more threads are based on at least a heartbeat architecture and the plurality of string images are based on a second * instruction set architecture. The method of claim 15, wherein the dynamic construction is automatic and the one or more threads are unobservable. 22. A method of graying out: £5 Ί r, further comprising performing each of the plurality of strings within the execution thread simultaneously on a core of a microprocessor. The method of claim 2, further comprising performing each of the plurality of strings within the execution thread simultaneously on an individual power of the microprocessor. 136758.doc