TW200935303A - Strand-based computing hardware and dynamically optimizing strandware for a high performance microprocessor system - Google Patents

Strand-based computing hardware and dynamically optimizing strandware for a high performance microprocessor system

Info

Publication number
TW200935303A
TW200935303A TW97148039A TW97148039A TW200935303A TW 200935303 A TW200935303 A TW 200935303A TW 97148039 A TW97148039 A TW 97148039A TW 97148039 A TW97148039 A TW 97148039A TW 200935303 A TW200935303 A TW 200935303A
Authority
TW
Taiwan
Prior art keywords
string
thread
strings
execution
threads
Prior art date
Application number
TW97148039A
Other languages
Chinese (zh)
Inventor
Matt T Yourst
Original Assignee
Strandera Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Strandera Corp filed Critical Strandera Corp
Publication of TW200935303A publication Critical patent/TW200935303A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Advance Control (AREA)

Abstract

Strand-based computing hardware and dynamically optimizing strandware are included in a high performance microprocessor system. The system operates on real time automatically and unobservably to parallelize single-threaded software into a plurality of parallel strands for execution by cores implemented in a multi-core and/or multi-threaded microprocessor of the system. The microprocessor executes a native instruction set tailored for speculative multithreading. The strandware directs hardware of the microprocessor to collect dynamic profiling information while executing the single-threaded software. The strandware analyzes the profiling information for the parallelization, and uses binary translation and dynamic optimization to produce native instructions to store in a translation cache later accessed to execute the produced native instructions of some of the single-threaded software. The system is capable of parallelizing a plurality of single-threaded software applications (e. g. application software, device drivers, operating system routines or kernels, and hypervisors).

Description

200935303 九、發明說明: 【發明所屬之技術領域】 效率及使用效用之改 需要電腦處理之改進以提供致能 善。 【先前技術】 除非明確識別為公開或廣為熟知, (包括基於内容脈絡、定義或比較目的提=術及概念 認此類技術與概念係先前公開熟、 ’’、解釋為承 A阉热知的或不然為先前 部分。本文所引用的所有參考 ^ ^ 1右有的話),包括專利、專 利申請案及公開案,無論是否明確併人均基 ^ 引用方式全文併入本文中。 ’目的以 【發明内容】 本發明可以採用許多方式予以實施,包括作為程序、製 以物:、裝置、糸統及電腦可讀取媒體(例如光學及/或磁 性大容量儲存器件(例如碟片)中之媒體或具有 社 參 存器(例如快閃儲存器)之積體電路)。在此說明書中,此耸 實施方案(或本發明可以採用的任何其他形式)可 術。實施方式閣述實現以上所識別之技術領域中之=技 效率及使用效用之改善的本發明之_或多個具 二 t結論^更詳細所論述,本發明包含在所頌佈之巾請專利 範圍之範疇内的所有可能修改及變化。 【實施方式】 以下隨同解說本發明之選定細節之附圖一起 之-或多個具體實施例之詳細說明。關於該等具體 136758.doc 200935303 說明本發明。本文之具體實施例係 明明顯不受限於本文之該等具體實施例=為;例,本發 受其限制,m aa & ^ {何者或全部或 且本發明包含許多替代者 避免閣述千篇-律,可以應用各式各樣詞^等^者^ 不受限於):第-、最後…、各種、二包括(但 定、選摆甘-谷種、另外、其他、特 使用=特別)來區分具體實施例集-本文所200935303 IX. Invention Description: [Technical field of invention] Improvement of efficiency and use efficiency Improvements in computer processing are required to provide good performance. [Prior Art] Unless explicitly identified as public or well-understood, (including the context of the context, definitions, or comparisons, the techniques and concepts are previously publicly known, '', interpreted as A Or otherwise the previous section. All references cited in this document, including the patents, patent applications, and publications, are hereby incorporated by reference in their entirety in their entirety. SUMMARY OF THE INVENTION The present invention can be implemented in a number of ways, including as a program, a device, a device, a system, and a computer readable medium (eg, an optical and/or magnetic mass storage device (eg, a disc) Medium in the medium or integrated circuit with a social storage (such as flash memory). In this specification, this embodiment (or any other form that the invention may take) may be practiced. MODE FOR CARRYING OUT THE INVENTION The present invention is embodied in a more detailed discussion of the technical efficiency and utility of the invention as identified above in the technical field identified above. All possible modifications and variations within the scope of the scope. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The following detailed description of selected embodiments of the invention, together with The invention is described with respect to such specific 136, 758.doc 200935303. The specific embodiments herein are obviously not limited to the specific embodiments herein; for example, the present disclosure is limited thereto, m aa & ^ {what or all or the invention includes many substitutes to avoid Thousands of laws - can apply a variety of words ^ and so on ^ are not limited to): first -, last ..., various, two including (but fixed, selected Gan - grain, other, other, special use = special) to distinguish the specific embodiment set - this article

偏見,而:己明顯並非意指傳達品質或任何形式之偏愛或 7見而僅用以方便地區別各別不同集。所揭示程序之某 皂操作之順序係可在本發明之範疇内改變。在多個具 施例用以說明程序、方法及/或程式指令特徵之變化之 處,預期其他具體實施例依據一預定或動態決定之準則實 行分別對應於複數個該等多個具體實施例之複數個操作模 式之一之靜態及/或動態選擇。為了充分瞭解本發明,在 以下說明中提出許多的特定細節。此等細節係基於範例之 目的而提供且本發明可以依據申請專利範圍加以實施而無 該等細節之某些或全部。基於清晰之目的,技術領域中所 ,、,、头的與本發明有關之技術材料未詳細加以說明以便不會 徒然地使本發明模糊不清。 概覽 術語 本文之揭示内谷使用各種術語。以下係至少某些術語之 範例。 執行緒之範例係處理器之軟體抽象化,例如處於相同架 構機器狀態(例如軟體可見狀態)上後立即共用且執行的才t 136758.doc 200935303 令之動態序列。某些(所謂單執行緒)處理器經啟用以在一 架構機器狀態上―二欠執行一指令序列。某些(所謂多執行 緒)處理器經啟用以在N個架構機器狀態上—次執行N個指 令序列。在某些系統中,作業系統在可用硬體資心 立、毁壞及排程執行緒。 串之範例係處理器硬體之抽象化’例如處於相同機器狀 態上後立即共用且執行的動態微運算碼(例如藉由處理器Prejudice, and: Obviously it does not mean to convey quality or any form of preference or to see and distinguish only the different sets. The order of a certain soap operation of the disclosed procedures can be varied within the scope of the invention. Where a plurality of embodiments are used to illustrate changes in the characteristics of the program, method, and/or program instructions, it is contemplated that other embodiments are directed to a plurality of the specific embodiments, respectively, in accordance with a predetermined or dynamically determined criteria. Static and/or dynamic selection of one of a plurality of modes of operation. In order to fully understand the present invention, numerous specific details are set forth in the description below. The details are provided for the purposes of example and the invention may be practiced without departing from the scope of the invention. The technical material related to the present invention in the technical field, and the head is not described in detail in order to avoid obscuring the present invention in vain. Overview Terminology The terminology used in this article uses various terms. The following are examples of at least some of the terms. The example of a thread is the software abstraction of the processor, such as the dynamic sequence that is shared and executed immediately after being in the same architectural machine state (such as the software visible state). Some (so-called single-thread) processors are enabled to execute a sequence of instructions on an architectural machine state. Some (so-so-so-executive) processors are enabled to execute N instruction sequences on N architectural machine states. In some systems, the operating system is equipped with hardware, destruction, and scheduling threads. An example of a string is an abstraction of the processor hardware', such as a dynamic micro-opcode that is shared and executed immediately after being in the same machine state (e.g., by a processor)

硬體可直接執行之微操作)序列。對於某些串,機器狀態 係架構機器狀態(例如架構暫存器狀態),且對於某些串機 器狀態並非軟體可見的(例如已重命名暫存器狀態或效能 分析暫存器)。在某些具體實施例中,若—串之機器狀態 包括執行緒之所有架構機器狀態(例如通用暫存器 '軟 體可存取機器狀態暫存器及記憶體狀態)則該串係作業系 統可見的。在某些具體實施例中,即使一串之機器狀態包 括-執行緒之所有架構機器狀態,該串也並非作業系統可 見的。 ㈣串之範例係作業系統可見且對應於執行緒之串。推 測串(例如後繼串)之範例係作業系統不可見之串。一些串 僅包含隱藏機器狀態(例如預提取或逐行分析串卜 >在某些具體實施例中,串體及/或處理器硬體建立、毁 壞及排程串。在某些具體實施例_ ’分又建立串。某些分 ::對指定目標位址(針對由分又所建立之串)且視需要地 指定其他資訊(例如’欲作為機器狀態繼承之資料)的(上代 串之)微運算碼作出回應。當執行(上代)串之微運算碼時, I36758.doc 200935303 視需要地建立一推測後繼串。 在各種具體實施例及/或使用方案中,對殺除微運算 碼、不可復原錯誤及串之完成(例如經由聯合)的一或多個 作出回應毀壞串。在某些具體實施例及/或使用方案中, 對聯合微運算碼作出回應聯合串。在某些具體實施例及/ * 或使用方案中,對一組硬體偵測條件(例如與後繼串之開 始位址匹配的目前執行位址)作出回應聯合串。在各種具 體實施例中’藉由串體及/或硬體之任何組合(例如對處理 ㈣算碼作出回應或自自地對默或程式指定條件作出回 應)毀壞串。在某些使用方案中,藉由將一上代架構串之 某一機器狀態與該上代之一後繼串之機器狀態合併聯合 串;接著毀壞該上代且下代串視需要地變為一架構串。 虛擬中央處理單元(v C P U)之範例係針對作業系統加以 啟用以在任何特定時間將一執行緒排程於其上的一軟體可 見執行内容脈絡。纟某些具體實施W中,—電腦系統將— ❹ 或多個VCPM 5見至作業系、统。各VCPU實施架構機器狀態 之-暫存器部分,且在某些具體實施例中,在_或多個 VCPU之間共用架構記憶體狀態。概念上各包含由串 體及/或硬體動態建立之一或多個串。對於各VCPU,將該 等串配置於先進先出(FIF0)仔列令,其中下一欲提交串係 VCPU之架構串’且所有其他串為推測的。 多核、多執行緒化及推測 微處理器、多核及多執行緒化 自20世紀7G年代引人第—個微處理器以來微處理器之效 136758.doc 200935303 月已增長。某些微處理器具有深管線及/或在多GHz時脈頻 率運作以採用單一處理器從循序程式中擷取效能。軟體工 币作為被處理器將循序及/或按順序執行的指令與操作 之序列寫入某些程式。各種微處理器嘗試藉由在增加之時 脈頻率運作、無序(〇◦〇)執行指+、以推測方式執行指令 或其各種組合來增加程式之效能。某些指令係獨立於其他 私令,因而提供指令級平行性(ILP),且因此係可平行或The hardware can be directly executed by the micro-operation) sequence. For some strings, the machine state is the state of the machine (such as the architectural scratchpad state) and is not visible to some of the stringer states (for example, the scratchpad state or the performance analysis register). In some embodiments, if the machine state of the string includes all architectural machine states of the thread (eg, the general-purpose scratchpad 'software accessible machine state register and memory state'), the string operating system is visible of. In some embodiments, even if a string of machine states includes all of the architectural machine states of the thread, the string is not visible to the operating system. (4) The example of the string is visible to the operating system and corresponds to the thread of the thread. An example of a push string (e.g., a successor string) is an invisible string of operating systems. Some strings contain only hidden machine states (e.g., pre-fetch or line-by-line analysis)> in some embodiments, string and/or processor hardware builds, destroys, and schedules. In some embodiments _ 'Make a string again. Some points:: For the specified target address (for the string created by the branch) and optionally specify other information (such as 'data to be inherited as machine state') (upper generation string) The micro-opcode responds. When executing the (pre-generation) string of micro-ops, I36758.doc 200935303 optionally constructs a speculative successor string. In various embodiments and/or usage scenarios, the micro-ops are eliminated. Reversing the string by one or more of the unrecoverable errors and the completion of the string (eg, via federation). In some embodiments and/or usage scenarios, the joint micro-computing code is responsive to the combined string. In an embodiment and/or or usage scheme, a set of hardware detection conditions (e.g., a current execution address that matches the start address of a successor string) is echoed. In various embodiments, 'by string And/or any combination of hardware (such as responding to processing (4) calculations or responding to the conditions specified by the program or the program itself). Destroy the string. In some usage scenarios, by placing a string of a previous generation The machine state merges with the machine state of one of the previous generations; then the previous generation is destroyed and the next generation string becomes a string of architectures. The virtual central processing unit (v CPU) example is enabled for the operating system. A software-readable execution context on which a thread is scheduled at any given time. In some implementations, the computer system will see ❹ or multiple VCPMs 5 see the operating system and system. Scheduling the state of the machine - the scratchpad portion, and in some embodiments, sharing the architectural memory state between _ or multiple VCPUs. Conceptually each includes one of dynamically built by the string and/or hardware or Multiple strings. For each VCPU, the strings are configured in a first-in, first-out (FIF0) queue, where the next string of architectures to be submitted to the VCPU and all other strings are speculative. Multi-core, multi-threaded and Microprocessors, multicores, and multi-threading have been growing since the introduction of the first microprocessor in the 1960s. 136758.doc 200935303 has grown. Some microprocessors have deep pipelines and/or more The GHz clock frequency operates to extract performance from a sequential program using a single processor. The software coin is written as a sequence of instructions and operations that are executed sequentially and/or sequentially by the processor. Increasing the performance of the program by operating at increased clock frequency, out-of-order (〇◦〇) execution of instructions, speculative execution of instructions, or various combinations thereof. Some instructions are independent of other private orders and thus provide instruction levels. Parallelism (ILP), and therefore can be parallel or

〇〇〇執行。某些微處理器嘗試採用ILP來改善效能及/或增 加微處理器之功能單元之利用。 某些微處理器(有時稱為多核微處理器)具有一個以 (例如處理單元)。某些單一晶片實施方案具有完整多核微 處理器’其在某些執行個體中具有共用快取記憶體及/或 為該等核共用之其他硬體。在某些環境中,代理程式(例 如串體)將計算任務分割成執行緒,且某些多核微處理器 藉由在微處理器之核上平行執行該等執行緒實現較高效 能。某些微處理器(例如某些多核微處理器)具有經啟用同 時多執行緒化(SMT)之核。 與X86指令集相容的某些微處理器(例如來自與· 之某些微處理器)具有(相肖複雜)〇〇〇核之相董"交少複製 體。某些微處理器(例如來自Sun與咖之某些微處理器)具 有(相對簡單)有序核之相對較多複製體。某些词服器與多 媒體應用程式係加以多執行緒化,且具有相對較多核之某 些微處理器在多執行緒軟體上相對較好地實行。 某些多核微處理器在具有相對較高執: 緒級平行性 I36758.doc 200935303 (TLP)之軟體上相對較好地實行。不過,在某些環境中, 甚至當執行具有相對較高TLP之軟體時,某些多核微處理 器之某些資源亦未使用。力求改善TLP之軟體工程師使用 協調對共用資料之存取以避免碰撞及/或不正確行為之機 制 '藉由減少或避免執行緒間之連鎖確保平穩且高效平行 連鎖之機制及輔助出現在多執行緒實施方案中之錯誤之除 錯之機制。〇〇〇 Execution. Some microprocessors attempt to use ILP to improve performance and/or increase the utilization of functional units of the microprocessor. Some microprocessors (sometimes referred to as multi-core microprocessors) have one (e.g., processing unit). Some single-chip implementations have a full multi-core microprocessor that has shared cache memory and/or other hardware shared by the cores in certain execution entities. In some environments, an agent (e.g., a string) divides computing tasks into threads, and some multi-core microprocessors are more efficient by executing the threads in parallel on the core of the microprocessor. Some microprocessors, such as some multi-core microprocessors, have a core that is enabled with simultaneous multi-threading (SMT). Some microprocessors that are compatible with the X86 instruction set (such as some microprocessors from and from) have a phased-to-core copy. Some microprocessors (such as some microprocessors from Sun and coffee) have relatively many replicas of (relatively simple) ordered cores. Some word processors and multi-media applications are multi-threaded, and some microprocessors with relatively many cores are relatively well implemented on multi-threaded software. Some multi-core microprocessors are relatively well implemented on software with relatively high level of parallelism, I36758.doc 200935303 (TLP). However, in some environments, certain resources of some multi-core microprocessors are not used even when executing software with relatively high TLPs. Efforts to improve TLP software engineers using mechanisms to coordinate access to shared data to avoid collisions and/or incorrect behavior 'by reducing or avoiding linkages between threads to ensure a smooth and efficient parallel chain mechanism and assist in multiple executions The mechanism for debugging errors in the implementation scheme.

相對於某些問題領域,某些編譯器將一執行緒之看似循 序操作自動辨識為可分為操作之平行執行緒。某些操作序 列相對於獨立性係不確定的且可能用於平行執行(例如自 某些通用程式化語言(例#c、c+MJava)所產生之程式碼 之部分)。軟體工程師有時使用某些特殊用途程式化語言 (或通用程式化語言之平行延伸)明顯表達平行性,及/或^ ^化多核及/或多執行緒微處理器或其部分(諸如圖形處理 皁疋或GPU)。軟體工程師有時針對某些科學性、浮點及 媒體處理應用程式明顯表達平行性。 推測多執行緒化基本原理 在某些使用方案及/或具體實施例中,推測多執4 、執行緒級推測或兩者實現更高效自動平行化。在-測多執行緒微處理器系統中編 咩器軟體、串體、韌旁 微碼或微處理器之硬體單开式盆乂 胃早7"或其㈣組合將複數個類3 刀又指令之選定一者之一或多個 <夕個執仃個體概念插入至- 式之各種位置中。概念上,該系 —該系統開始執行在該程式户 目標位址處之一(新)後繼串, 甲理暫存器值(且視t I36758.doc 12 200935303 ί儲存物)自(上代)串(後繼串係自該(上代)串分又) 、’串之傳播。該傳播係經由停止後繼串直到該等值到 或藉由預測該等值且梢後將預測值與由上代串所產生 .:值相比較。該系統作為-執行緒之一子集建立後繼串 , 相串自該執行緒接收架構狀態之子集及/或後繼 執行緒之—指令子集)。分叉指令指定目標位址 :曰“日私暫存器(RIP)。㉟系統在各種具體實施例中經 ❹ 各種硬體元件(例如邏輯單元、有限狀態機、微碼引擎 3 電路)、各種軟體元件(例如藉由核可執行之指令、 韌體二微竭、串體及其他軟體代理程式)或其各種組合實 施串f理功能(例如分又與聯合)。 人推測多執行緒微處理器系統採用(原始)程式順序處理聯 “呆作。考量將後繼串分又至目標位址之上代串。當上代 直執行至目標位址(有時稱為交又點)時聯合出現。在 :些環境中,後繼串已完成(與上代串平行)且後繼串係立 ❹/準備聯合。在耳葬合點處,系统實行各種一致性檢查,例 . =保上代串傳播至後繼串的(可能預測)駐外暫存器值與 • 聯合點處上代串之實際值匹配。該等檢查保證具有分又串 之執行結果係與無分又串之結果相同。若該等檢查之任何 者失敗,則系統採取適當動作(例如藉由丟棄分又串之結 2)。上代與後繼_之聯合後,上代串終止。系統接著使 =上代串之内容脈絡可用於再使用。後繼串變為系統建立 δ亥串所針對之執行緒的架構可見執行個體。該系統使得後 繼串之目前架構狀態(例如暫存器與記憶體)為微處理器内 I36758.doc -13- 200935303 之其他執订緒(例如在另—核上之執行緒卜微處理器之其 他代理程式(例如DMA)及微處理器外部之器件可觀察。 某些推測多執行緒系統實施巢套式串模型。例如,上代 • _P分又主要後料s,且以相方式分又子串π、似 …系統將子串巢套於上代串内。子串獨立於8且相互獨 立執行。Ρ之所有子串完成後Ρ立即與s有條件地聯合。相 八他推測多執行緒系統實施一已嚴格程式排序的非 I套式推測多執行緒化模型。例如’各上代串ρ具有在任 何時間肖未處理完畢的至多一分又後繼串Sel^s交又(導 致聯合)或S不再執行之前ρ不再分又串。在某些環境中, ”=施巢套式模型相比實施非巢套式模型使用較少及/或 較間單硬體。採用未修改循序程式之某些使用方案係適於 結合非巢套式模型實施方案使用。 某些推測多執行緒系統使用記憶體版本管理。例如, (推測)儲存至特定記憶體位置的一後繼串使用為就程式順 ❹“言比後繼串晚之串可觀察但其他串(其就程式順序而 . :係比後繼串早)不可觀察的該位置之一私有版本。在聯 7後繼與上代串❹統使得推測儲存物為其他代理程式可 , 觀察(以原子方式)。其他代理程式包括不同於後繼(且較 晚)串、微處理器之其他串或單元(例如DMA)、微處理器 外部之器件及系統之啟用用以存取記憶體之任何元件的 串在某些環境中,聯合之前系統累加數千位元組之推剛 儲^資料。考量其中一上代串(就程式順序而言較晚)欲寫 入-記憶體位置且該上代串之一後繼串欲讀取該記憶體位 136758.doc 200935303 置的一情況。若後繼串在上代串寫入記憶體位置之前讀取 β己憶體位置,則系統令止後繼串。 揭不内容有時將上述 f況稱為跨越串記憶體混淆。在某些方案中,系統藉由選 擇为又點減少(或避免)跨越串記憶體混淆之發生,導致較 少(或無)跨越串記憶體混淆。 =上,系統將屬於特定執行緒之串配置於已程式排序 其類似於無序處理11中之重新排序緩衝器(R0B) 之個別指令。系統依程式順序處理串分又且聯合。在仔列 頭部之串為架構串,且經啟用以執行聯合操作之唯一串, 而後續串係推測串。在某些方案中,串包含獨立於其他串 的複雜控制流(例如分支、呼叫及迴圈)。在某此環境中, 串在建立(在分又點處)與終止(在聯合點處)之間執行數千 個和令。在某些情況下,即使具有相對較少尚未處理完畢 串相對較大數置的串級平行性係透過數千個指令可 用0 某些㈣基於各式各樣的目的(例如預提取)使用推測多 灯緒化,而某些系統僅針對預提取使用推測多執行緒 化。例如’一特定串在執行一導致存取相對較慢^快取或 主記t體之载入指令時遇到快取遺漏。系統自載入指令分 2提t串’且停止特定串。系統在等待針對(遺漏)載入 之傳回資料時繼續停止特定争。與某些其他類型之串不 同,遺漏載入不阻隔箱担 不it 是提供預測或虛設值而 償。在各種使用方案中,預提取串致 b針對具有獨立於初始遺漏載入所計算之位址之載 136758.doc 200935303 入的預提取;針對與處理已連結列表相m的預提 取,調諧或預校正分支預測器;或其任何組合。春 載入作出補償(例如由於預提取宰使用預測或虛^且不 適於聯合至另一串)時中止對遺漏載入作出回應而分又之 預提取_ ^ 人( 在某些環境中’經由推測多執行緒化所獲得之效能改善Relative to some problem areas, some compilers automatically recognize a seemingly sequential operation of a thread as a parallel thread that can be divided into operations. Some operational sequences are indeterminate relative to independence and may be used for parallel execution (e. g., portions of code generated from some common stylized languages (eg #c, c+MJava)). Software engineers sometimes use some special-purpose programming languages (or parallel extensions of general-purpose programming languages) to express parallelism and/or multi-core and/or multi-threaded microprocessors or parts thereof (such as graphics processing). Saponin or GPU). Software engineers sometimes express parallelism for certain scientific, floating point, and media processing applications. Predicting the basic principles of multi-threading In some usage scenarios and/or specific embodiments, it is speculated that multi-execution 4, thread-level speculation, or both achieve more efficient auto-parallelization. In the multi-threaded microprocessor system, the compiler software, string body, tough side microcode or microprocessor hardware single open basin stomach early 7" or its (four) combination will be a plurality of class 3 knife One or more of the selected ones of the instructions are inserted into various locations of the formula. Conceptually, the system - the system begins executing one of the (new) successor strings at the target address of the program, and the value of the scratchpad value (and the storage of the I) is from the (previous generation) string. (Subsequent stringing from the (previous generation) stringing again), 'string propagation. The propagation is based on stopping the successor string until the equivalent or by predicting the value and comparing the predicted value with the value generated by the previous generation string. The system acts as a subset of the thread to establish a successor string that receives a subset of the architectural state and/or subsequent threads from the thread. The fork instruction specifies the target address: 日 "RIP". The 35 system is implemented in various embodiments by various hardware components (eg, logic unit, finite state machine, microcode engine 3 circuit), various Software components (such as kernel-executable instructions, firmware two-exhaustion, serials, and other software agents) or various combinations thereof implement string functions (such as splits and unions). People speculate multi-thread micro-processing The system uses the (original) program order to process the "stay." Consider the succession of the string and the string of the destination address. Co-occurs when the previous generation executes directly to the target address (sometimes called the intersection). In some environments, the successor string has been completed (parallel to the previous generation string) and the subsequent string is erected/prepared. At the ear burial point, the system performs various consistency checks, for example. = The (possibly predicted) external register value of the succession string is transmitted to the successor string at the joint point. These checks ensure that the results of the execution of the series and the string are the same as those of the undivided and string. If any of these checks fail, the system takes the appropriate action (for example, by discarding the separate string 2). After the combination of the previous generation and the successor _, the previous generation string is terminated. The system then makes the context of the = previous generation string available for reuse. The successor string becomes the system establishment. The architecture of the thread to which the δ 串 string is directed can be seen as an execution entity. The system makes the current architectural state of the successor string (such as the scratchpad and the memory) the other bindings of the I36758.doc -13- 200935303 in the microprocessor (for example, the executor of the microprocessor on the other core) Other agents (such as DMA) and devices external to the microprocessor can be observed. Some speculative multi-threaded systems implement nested string models. For example, the previous generation • _P points are mainly post-materials, and are separated by phase. The string π, like... system nests the substrings in the upper generation string. The substrings are independent of 8 and are executed independently of each other. After all the substrings are completed, they are conditionally combined with s immediately. Implement a non-I-type speculative multi-threaded model with strict program ordering. For example, 'the upper generation string ρ has at most one point and the subsequent string Sel^s intersection (causing joint) or S at any time. The ρ is no longer divided before the execution. In some environments, the “= nested model uses less and/or more single hardware than the non-nested model. The unmodified sequential program is used. Some usage scenarios are suitable for combining non- The use of a nested model implementation. Some speculative multi-threaded systems use memory version management. For example, (presumably) a successor string stored to a specific memory location is used as a program to "speak more than a successor string." Observe but the other strings (which are in the order of the program. : The system is earlier than the successor string). The private version of this position is unobservable. In the joint 7 and the previous generation, the constellation makes the speculative storage available to other agents. Atomic mode. Other agents include different strings (or DMA) from subsequent (and later) strings, microprocessors, devices and systems external to the microprocessor to access any component of the memory. In some environments, the system accumulates thousands of bytes of data before the joint. Consider one of the upper strings (later in terms of program order) to write to the memory location and the upper generation string A subsequent string is to read the memory bit 136758.doc 200935303. If the successor string reads the position of the beta memory before the previous string is written to the memory location, the system stops the successor string. Uncovering the content sometimes refers to the above-mentioned f-state as confusing across string memory. In some scenarios, the system reduces (or avoids) the occurrence of cross-string memory confusion by selecting to reduce (or avoid) the occurrence of cross-string memory confusion. Crossing the string memory. =Up, the system configures the string belonging to the specific thread to the individual order of the reordering buffer (R0B) similar to the out-of-order processing 11. The system processes the sequence according to the program sequence. And the union. The string at the head of the queue is a skeleton string, and is enabled to perform a unique string of joint operations, and the subsequent string is a speculative string. In some schemes, the string contains complex control flows independent of other strings (eg Branches, Calls, and Loops. In some circumstances, the string performs thousands of sums between establishment (at the point and point) and termination (at the joint point). In some cases, even if there are relatively few unprocessed strings, the string parallelism is relatively large. It can be used by thousands of instructions. Some (4) use speculation based on a variety of purposes (such as pre-fetching). Multi-lighting, while some systems use speculative multi-threading only for pre-fetching. For example, a particular string encounters a cache miss when executing a load instruction that results in a relatively slow access to the cache or the master. The system self-loading instruction divides the t-string ' and stops the specific string. The system continues to stop the specific contention while waiting for the returned data to be (missing) loaded. Unlike some other types of strings, missing loads are not blocked. It is compensated for the provision of predictions or dummy values. In various usage scenarios, pre-fetching string b is pre-fetched for 136758.doc 200935303 with a location independent of the initial miss loading; for pre-fetching, tuning or pre-processing with the processed linked list Correct the branch predictor; or any combination thereof. Spring loading makes compensation (for example, due to pre-fetching the use of predictions or virtual ^ and is not suitable for union to another string), the suspension of the missed loading in response to the pre-fetching _ ^ people (in some environments) via Speculative improvement in performance achieved by multi-threading

取決於分又與聯合點之特定選擇。在某些具體實施例中’ 系統將分又點放置於抻在丨M 徑最終到達之點)處:(例所有可能執行路 .違之點)處。例如,相對於迴圈之目前反覆過程 lteratlon)’系統分又在緊緊跟隨目前反覆過程之反覆過 程處開始之一串,因而實現兩個串全部或部分平行執行。 ㈣另一範例(例如當迴圈之反覆過程係相互相依時),系 充刀又_以執行跟隨迴圈結束之程式碼,實現迴圈之反 覆過程在一串中執行’而迴圈後面之程式碼在另一串中執 丁對於另一範例,系統分又一串以開始執行跟隨一自已 Φ 呼”函式之返回之程式碼(視需要地預測已呼叫函式 回 Ά、,til a ™ 、 已呼叫函式與跟隨返回之程式碼經由兩個 一^部或部分平行執行。在各種具體實施例中,藉由下列 或多者插入分又點:藉由編譯器及/或串體自動(視需要 二至少部分基於逐行分析執行、分析動態程式行為或兩 )藉由硬體自動;及藉由程式設計者手動。 彳夕執行緒化之各種具體實施例係自動的及/或不可 2 $些自動及/或不可觀察推測多執行緒化具體實施 1 '、σ應用於所有類型之目標軟體(例如應用程式軟體、 136758.doc -16- 200935303 器件驅動程式、作業系統常式或核心及管理程序)而無需 任何程式設計者介入。(應注意,說明有時將目標軟體稱 為目標碼,且目標碼係包含目標指令。)某些自動及/或不 可觀察推測多執行緒化具體實施例係與產業標準指令集 •r (例如x86指令集)、產業標準程式設計工具或語言(例如c、 C++及其他語言)及產業標準通用電腦系統(例如伺服器、 工作台、桌上型電腦及筆記型電腦)相容。 系統架構 φ 具備串能力之電腦之系統 圖1A解說關於具備串能力之電腦之系統,各具備串能力 之電腦具有一或多個具備串能力之微處理器,具備串能力 之微處理器可存取串體影像、記憶體、非揮發性儲存器、 輸入/輸出器件及網路。概念上系統執行串體以觀察(經由 硬體協助)及分析目標軟體(例如應用程式、驅動程式、作 業系統及管理程序軟體)之(例如χ86)指令之動態執行。串 Φ 體使用觀察來決定如何將χ86指令分割成適於平行執行於 . *備串能力之微處理器之VLIW核資源上之複數個串。串 體將已分割指令轉譯為操作(例如微操作或微運算碼),然 ’ 後將該等操作配置於束中用於在VLIW核資源上高效執 〃 °串體將該等束儲存於轉譯快取記憶體中用於稍後使用 (例如作為-或乡㈣轉轉譯視需要地包括增加有 不直接對應於X86指令之額外操作(例如用以改善效能或用 以實現串之平行執行)。系統其後針對已儲存束(例如串影 像而非X86指令之部分)之執行配置且執行已儲存束以嘗試 136758.doc 200935303 改善效能。在某些具體實施例中,觀察、分析、分割及針 對已儲存束之配置的一或多個及已儲存束之執行係相對於 指令之追縱(trace)。 該圖式解說具備串能力之電腦2000.1至2000.2,其經啟 用以用於經由耦合2063、2064及網路2009而彼此之通信。 具備串能力之電腦2000.1經由耦合2050耦合至儲存器 2010,經由耦合2055耦合至鍵盤/顯示器2005且經由耦合 205 6耦合至周邊設備2006。 該網路係實現具備串能力之電腦之間之通信的任何通信 基礎架構,例如區域網路(LAN)、都會區域網路(MAN)、 廣域網路(WAN)及網際網路之任何組合。耦合2063係與(例 如)乙太網路(例如l〇Base-T、100Base-T及1或10十億位 元)、光學網路(例如同步光學網路或SONET)或針對叢集之 節點互連機制(例如無限頻帶(Infiniband)、MyriNet、 QsNET或刀鋒型伺服器背板網路)相容。儲存元件係任何 非揮發性大容量儲存元件、陣列或其網路(例如快閃記憶 體、磁碟或光碟,以及經由網路附接儲存器或NAS及/或儲 存器陣列網路或SAN技術所耦合之元件)。耦合2050係與 (例如)乙太網路或光學網路、光纖通道、先進技術附接或 ΑΤΑ、串列ΑΤΑ或SATA、外部SATA或eSATA以及小型電腦 系統介面或SCSI相容。 鍵盤/顯示器元件係概念代表字母數字、圖形或其他人 輸入/輸出器件之一或多者之任何類型(例如QWERTY鍵 盤、光學滑鼠及平板顯示器之組合)。耦合2055係概念代 136758.doc -18- 200935303 表實現具備串能力之電腦與鍵盤/顯示器之間之通信的一 或多個麵合。在-範例中,麵合2055之一元件係與通用串 列匯流排(USB)相容且另一元件係與視訊圖形配接器 (VGA)連接器相容。周邊設備元件係概念代表可與具備串 此力之電腦結合使用之一或多個輸入/輸出器件之任何類 , 型(例如掃描器或印表機)。辆合2056係概念代表實現具備 串能力之電腦與周邊設備之間之通信的一或多個耦合。 在各種具體實施例(未解說)中,解說為在具備串能力之 t腦外部之各種元件(例如儲存器2 Q i Q、鍵盤/顯示器2 〇 〇 $ 及周邊設備2006)係包含於具備串能力之電腦中。在某些 具體實施例中,具備串能力之微處理器2〇〇1丨至⑼⑴2之 一或多者包括用以致使能夠耦合至在功能上與解說為在具 備串能力之電腦外部之元件之任何者相同或類似之元件的 硬體。在各種具體實施例中,所包含之硬體係與一或多個 特定協定相容,例如周邊組件互連(pci)匯流排、pci延伸 ❿ (PCI_X)匯流排、PCI特快(PCI-E)®流排、超傳輸(Ητ)匯流 排及快速路徑互連(QP1)匯流排之一或多者。在各種具體 . 冑施例中,所包含之硬體係與一用以與(中間)晶片組通信 - 之專屬協定相容,啟用該(中間)晶片組以經由特定協定之 任一個或多個通信。 在某些具體實施例中,具備串能力之電腦係彼此相同, 且在其他具體實施例中具備串能力之電腦依據與市場及/ 或消費者需要相關之差異而變化。在某些具體實施例中, 具備串月匕力之電腦作為祠服器、工作台、桌上型電腦、筆 136758.doc 19 200935303 δ己型電腦、個人或攜帶式電腦運作。 乂1^1\¥核2013.1及異動記憶體2〇14 如圖所解說,具備串能力之電腦2〇〇〇1包括兩個且備串 能力之微處理器讓^⑴觀.2,其分_合至動態隨機 存取記憶體(DRAM)元件細2. i至2術.2。具料能力之微 處理器分別地經由麵合205 i. i至· 2與快閃記憶體2003 通信且經由耦合2Q53彼此通信。具㈣能力之微處理器 2001.1包括逐行分析單元2〇111、串管理單元2〇121、 在某些具體實施例_,具備串能力之微處理器 同’且在其他具體實施例中具備争能力之微處理器依據與 市場及/或消費者需要相關之差異而變化。在各種具體實 施例中,具備串能力之微處理器係實施於單一積體電路晶 粒、複數個積體電路晶粒、多晶粒模組及複數個封裝電路 之任何者中。 為簡潔起見,以下說明係相對於所解說之具備串能力之 微處理器之-者。其他具備串體能力具備串能力之微處理 器之操作係類似的。具備串體能力之微處理器2〇〇ΐι退出 重設狀態(例如當實行冷開機時)且開始自包含於快閃記憶 體2003中之串體影像2004之程式碼部分提取且執行串體之 指令。該等指令之執行初始化各種串體資料結構(例如, 解說為DRAM 2002.1之部分的串體資料2〇〇21A與轉譯快 取記憶體2〇02.1B)。初始化包括將串體影像之程式碼部分 之所有或任何子集複製至串體資料之一部分,且針對堆 積'堆疊及私有資料儲存設定串體資料之側邊區域。 136758.doc -20- 200935303 接著具備串能力之微處理器開始處理χ86指令(例如在某 些具體實施例中包含於快閃記憶體中之χ86開機韌體),其 經党上述觀察(至少部分經由逐行分析單元2〇111)與分 • #。該處理係進-步經受上述分割成串用於平行執行、轉 譯為操作且配置於對應於各種串影像之束中以及儲存於轉 澤快取圮憶體(例如轉譯快取記憶體2〇〇2,1Β中)。該處理係 進一步經党上述後續針對已儲存束之配置及已儲存束之執 行(至少部分經由串管理單元2012.1、VLIW核2013.1及異 看 動記憶體2014」)。 圖式中所解說的元件之分割僅為解說性因為存在採用 其他分割的其他具體實施例。例如,各種具體實施例在具 備串能力之微處理器中包括快閃記憶體及/或DRAm之全部 或任何部分。對於另一範例’各種具體實施例在具備串能 力之微處理器中(例如在積體電路晶粒上之一或多個靜態 隨機存取記憶體或SRAM中)包括用於串體資料及/或轉譯 φ 快取S己憶體之全部或任何部分之儲存器。對於另一範例, 在某些具體實施例中,串體資料2〇〇21八與轉譯快取記憶 體2002.1B係包含於不同DRAMt (例如一者在第一雙直列 。己憶體模組或DIMM中且另一者在第二DIMM中)。對於另 範例,各種具體實施例將串體影像之全部或任何部分儲 存於儲存器2010上。 大容量多執行緒硬體與串體 圖1B與1C共同解說與具備串能力之微處理器(例如圖… 之具備串能力之微處理器之任一個)相關的 136758.doc 21 200935303 概心硬體、串體(軟體)及目標軟體層(例如子季絲 式在本質上係概念性的,且㈣:如:系統)。該圖 控制及某㈣_合。4_起見’該圖式省略各種 1二::'9。包括一或多個獨立核(例如vliw核⑼」至 _τ)及/:丁個體),各核實現依據適於同時多執行緒化 二?I内容脈絡切換的一或多個硬體執行緒内容 脈絡(例如儲存於暫存器槽案賴.…⑷及,或串内容 脈π 194Β.1至ι94Β.4之執行個體中)處理。微處理器經啟用 以依據6令集架構執行指令。微處理器包括推測多執行緒 延伸與增肖,例如用以實現分又與聯合指令及/或操作之 處理之硬體、執行緒間及核間暫存器傳播邏輯及/或電路 (多核互連網路195)、實現記憶體版本管理及衝突偵測能力 之異動δ己憶體183、逐行分析硬體181及實現推測多執行緒 化處理之其他硬體元件。在所解說之具體實施例中,微處 理器亦包括一多層快取記憶體階層(例如Ll D快取記憶體 193.1至193.4及L2/L3快取記憶體196之執行個體)、至在微 處理器外部之大容量記憶體及/或硬體器件之一或多個介 面(耦合至外部系統/串體DRAm 184A之DRAM控制器與北 橋197)、(例如)在具有複數個微處理器(各微處理器視需要 地包括複數個核)之電腦中有用的一插座至插座系統互連 (多插座系統互連198)及至外部硬體器件之介面/耦合(用於 經由外部PCI特快、QPI、超傳輸199耦合之晶片組/PCIe匯 流排介面186)。 串體層110A與110B(有時統稱為串體層no)與(χ86)目標 136758.doc -22- 200935303 軟體層101係至少部分藉由包含於微處理器中及/或耦合至 微處理器之一或多個核(例如圖1CiVLIW核191>1至191 4 之執行個體之任何者)之全部或任何部分加以執行。串體 層對於目標軟體層之元件而言係概念不可見,概念上"在” 目標軟體層”下面”及/或”在"與目標軟體層"相同級處"透明 運作。目標軟體層包括作業系統核心102及解說為係"在” 作業系統核心"上方"執行之程式(解說為應用程式103.1至 103.4之執行個體在各種具體實施例及/或使用方案中, 目標軟體層包括-管理複數個作業系統執行個體之管理程 序程式(例如類似於VMware或Xen) 在各種具體實施例中,串體層實現以下能力之一或多 者 .用以將一或多個虛擬CPU(例如vcpu 1〇41至1〇46之執 行個體)及關聯虛擬器件174呈現至目標軟體的微處理器 硬體之虛擬化。VCPU看似執行—其中編碼目標軟趙層 之目標指令集。將VCPU動態映射至微處理器之原生 核(例如致使能夠執行原生指令集之儿_191」至 191.4之執行個體)及串内容脈絡(其保存⑽如)在暫存器 樓案叫幻至歸.4及/或串内容脈絡194Βι^94Ββ 一或多個執行個體中)上。 析丁 /軟體時目標軟體之工具裝備、逐行分析及分 Ζ行绪^部分用以識別將(循序)指令串流分成推測多 的個別循成:例如,系統將由啊之―或多個所執行 序指令串流分割成多個推測多執行緒之串。 136758.doc -23- 200935303 •基於分析將指令及/或程式碼序列插入至目標軟體中, 以調用微處理器之各種推測多執行緒硬體單元以分又及 聯合串、以預測及/或傳播駐内(live in)值至串、以管理 記憶體版本及串間之衝突以及以分又預提取串。 •用以使推測多執行緒化效能加速的目標軟體之最佳化, * 例如重新排程指令以時間更早地產生關鍵串駐内值、延 期及/或重新排序抑制平行性之操作以破壞或消除跨越 串相依性及移除記憶體混淆、及移除預提取串内之冗餘 ❹ 操作。 、 •維護已修改、已工具裝備及/或已最佳化程式碼之儲存 庫(例如經由轉譯快取記憶體管理〗丨υ以便儲存庫中之程 式碼對於目標碼而言係不可見且係可用以由串體調用而 代替原始目標碼(例如加以修改、工具裝備或最佳化之 前之目標碼之一部分)。 •處理任何内冑異常或錯豸’其係修?文、工具裝備及最佳 藝化(例如推測多執行緒化)之任何者之結果,不然在執行 目標軟體時不會出現。在某些環境中,内部異常或錯誤 之處理包括重新最佳化及/或停用降低效能之最佳化。 •將一可選機制提供至目標碼用於為串體提供隱示,例如 可能有益分又點、同步點、可能跨越串混淆點及其他最 佳化資訊。 二進制轉譯與動態最佳化 在某些具體實施例中,微處理器硬體經啟用以執行—與 目標軟體之指令集不同的内部指令[在各種具體實施例 136758.doc -24- 200935303 、’需要地與—或多個硬體加速機制之任何組合協力合作 : 實行動態一進制轉譯(例如經由χ86二進制轉譯1丨$) 以::或多個目標指令集(例如χ86相容指令集,例如χ86_ . 64♦"集)之目標*體轉譯成原生微操作(微運算碼)。軟體 加速機制包括逐行分析硬體1 8 1、硬體加速單元i 82、異動 .^憶體183及硬體咖解媽器187之—或多個之全部或任何 #刀微處理器硬體(例如VLIW核191.1至191.4之執行個 體)經啟用以直接執行微運算碼(且在各種具體實施例中, 微處理器硬體未經啟用以直接執行目標指令集之一或多個 之扣令)。至少在某些環境下,接著將轉譯儲存於儲存庫 中(例如經由轉譯快取記憶體管理111)用於快速再呼叫及再 使用(例如作為_影像),從而消除再次轉譯。 在各種具體實施例中,微處理器經啟用以存取(例如藉 由耦合或附接至)相對較大記憶體區。系統經由專用DRAM 模組(其在各種具體實施例中包含在微處理器中或在微處 Φ 理器外部)或替代地作為對於目標碼而言係不可見的外部 • 系統/串體DRAM 184A中之一已儲備區之部分實施該記憶 體區。該記憶體區提供針對串體之各種元件(例如程式 . 碼、堆疊、堆積及資料之一或多個)之儲存器且在某些具 體實施例中乂供轉譯快取記憶體(例如由轉譯快取記憶體 B理1 11所管理者)之全部或任何部分以及視需要地提供一 或多個緩衝器(例如推測多執行緒化暫時狀態緩衝器)。當 微處理器最初開機(例如藉由實行冷開機)時,將串體程式 碼自快閃ROM複製至記憶體區中(例如至專用DRAM模組 136758.doc -25- 200935303 或外部系統/串體DRAM說之已儲備部分中),微處理器 接著自該記憶體區提取原生微運算碼。串體初始化微處理 W例如經由硬體控制172)及串體之内部資料結構之後,串 .體開始使用二進制轉譯(例如經由χ86二進制轉譯115)執行 ’ 開機勒體及’或作業系統核心開機程式碼(編碼於目標指令 集之-或多個中),其類似於無二進制轉譯層的習知以硬 體為基礎的微處理器。 在某些使时案中,與新增推測多執行緒化指令至目標 指令集相比使用串體來實行二進制轉譯及/或動態最佳化 提供優點。在某些環境中,二進制轉譯及/或動態最佳化 (例如)藉由移除及/或減少用於解碼目標指令集之硬體(例 如硬體祕解碼器187)及用於無序執行之硬體實現簡化各 祆之硬冑纟某些具體實施例中’概念上採用—或多個 VLIW(超長指令字)微處理器核(例如vli· 191」至⑼* 之執行個體)來取代已移除及/或已減少硬體。vuw核心 φ (例如)執行預排程微運算碼之束,其中-束之所有微運算 . 碼(例如在複數個功能單元(例如ALU 192幻至192尤4及 FPU 192B. 1至192B.4之執行個體)上)平行執行(或開始執 . 〃)。在各種具體實施例中,VLIW核缺少相對較複雜解 碼、以硬體為基礎的相依性分析及動態無序排程之一或多 個。VLIW核視需要地包括本機健存器(例#li d快取記憶 體193」至193.4及暫存器檔案194Α.β194Α.4之執行個體) 及其他每核硬體結構以用於高效處理指令。 在某些使用方案及/或具體實施例中,VLIW核係小得足 136758.doc -26 - 200935303 以實現以下者之—或客去.腺苗夕^ -夕者.將更夕核封裝於給定晶粒區 中、在給定功率預算内為更多核供電及在比其他情況下針 對複雜無序核係可能之頻率高的頻率對核計時。在某些使 • 用方案及/或具體實施例中,經由二進制轉譯使VLIW核與 目標指令集語意隔離實現與高效推測多執行緒化相關的微 * $算碼格式、暫存器及VLIW核之各種細節之高效編碼, 無需修改目標指令集。 串體動態最佳化軟體之作用 串體層之追蹤建構子系統(例如追蹤逐行分析與捕獲 120)在藉由微處理器執行時收集及/或將已轉譯微運算碼 (例如自具有穿過目標碼之共同控制流路徑之已轉譯基本 區塊之序列的微運算碼)組織成追蹤。串體使用各式各樣 的技術實行相對廣泛最佳化(例如經由最佳化丨63)。某些技 術在範疇上係類似於有權使用原始碼之最佳化編譯器實行 之技術,但串體使用在逐行分析(例如經由實體頁逐行分 ❹ 析121、分支逐行分析124、預測性最佳化125及記憶體逐 行分析127之一或多個)期間所收集之動態測量之程式行為 來導引至少某些最佳化。例如,選擇性重新排序(例如依 * 據經由記憶體混淆分析162所獲得之資訊)至記憶體之載入 與儲存物以儘可能早地起始快取遺漏。在某些具體實施例 中’選擇性重新排序係至少部分基於參考同一位址之载入 與儲存物之測量(例如經由記憶體逐行分析127進行)。在某 些使用方案及/或具體實施例中,選擇性重新排序在數百 個指令之範疇上實現相對較積極最佳化。接著依據輸入運 136758.doc -27· 200935303 算元何時將可用且各種硬體資源(例如功能單元)何時將自 由排程各微運算碼(例如藉由經由排程各微運算碼165插入Depends on the specific choice of points and joint points. In some embodiments, the system places points and points at the point where the 丨M path finally arrives: (where all possible paths are executed.). For example, the current repetitive process relative to the loop, lteratlon)' system, starts with a string that follows the repetitive process of the current repetitive process, thus implementing the two strings in whole or in part. (4) Another example (for example, when the repetitive processes of the loops are mutually dependent), the knives are _ to execute the code following the end of the loop, and the repetitive process of the loop is executed in a string and the loop is followed by The code is executed in another string. For another example, the system is divided into another string to start executing the code following the return of a self-operating function. (If necessary, predict the called function, til a TM The called function and the following return code are executed in parallel by two parts or parts. In various embodiments, the following points are inserted by: or by the compiler and/or the string automatically (as needed, based at least in part on line-by-line analysis execution, analysis of dynamic program behavior or both) by hardware automation; and by programmers manually. Various embodiments of the implementation of the system are automatic and/or not 2 $ some automatic and / or unobservable speculation multi-threaded implementation 1 ', σ applied to all types of target software (such as application software, 136758.doc -16- 200935303 device driver, operating system routine Core and hypervisor) without any programmer intervention. (Note that sometimes the target software is called the target code and the target code contains the target instructions.) Some automatic and/or unobservable speculation multi-threading Specific embodiments are industry standard instruction sets • r (eg x86 instruction set), industry standard programming tools or languages (eg c, C++ and other languages) and industry standard general purpose computer systems (eg server, workbench, table) Compatible with computer and notebook computer. System architecture φ System with computer capable of string capability Figure 1A illustrates a system with a string capable computer. Each computer with string capability has one or more microprocessors with string capability. A serial-capable microprocessor can access serial images, memory, non-volatile memory, input/output devices, and networks. Conceptually, the system performs a string to observe (via hardware assistance) and analyze target software. (eg application, driver, operating system and hypervisor software) (eg χ86) dynamic execution of instructions. It is determined how to split the χ86 instruction into a plurality of strings on the VLIW core resource of the microprocessor that is executed in parallel for the .. string-sequencing capability. The string translates the segmented instruction into an operation (eg, micro-operation or micro-operation code). And then 'configure the operations in the bundle for efficient execution on the VLIW core resource. The string is stored in the translation cache for later use (eg as - or township (4) Translating the translation needs to include additional operations that do not directly correspond to the X86 instructions (eg, to improve performance or to implement parallel execution of the string). The system then targets the stored bundle (eg, a string image rather than a portion of the X86 instruction). Performing the configuration and executing the stored bundle to try 136758.doc 200935303 to improve performance. In some embodiments, observing, analyzing, segmenting, and executing one or more of the stored bundle configurations and the stored bundles Relative to the trace of the instruction. The figure illustrates a string capable computer 2000.1 to 2000.2 that is enabled for communication with each other via couplings 2063, 2064 and network 2009. The string capable computer 2000.1 is coupled to the memory 2010 via the coupling 2050, to the keyboard/display 2005 via the coupling 2055 and to the peripheral device 2006 via the coupling 205 6 . This network is any communication infrastructure that enables communication between computers with serial capabilities, such as any combination of regional network (LAN), metropolitan area network (MAN), wide area network (WAN), and the Internet. The coupling 2063 is associated with, for example, an Ethernet network (eg, l〇Base-T, 100Base-T, and 1 or 10 billion bits), an optical network (eg, a synchronous optical network or SONET), or a node for clusters. Connection mechanisms such as Infiniband, MyriNet, QsNET, or blade server backplane networks are compatible. The storage component is any non-volatile mass storage component, array or network thereof (such as flash memory, disk or CD, and via network attached storage or NAS and / or storage array network or SAN technology) The coupled component). The coupling 2050 is compatible with, for example, Ethernet or optical networking, Fibre Channel, advanced technology attachment or port, serial port or SATA, external SATA or eSATA, and small computer system interface or SCSI. The keyboard/display component concept represents any type of one or more of alphanumeric, graphic or other human input/output devices (eg, a combination of a QWERTY keyboard, an optical mouse, and a flat panel display). The Coupled 2055 Conceptual Generation 136758.doc -18- 200935303 table implements one or more aspects of communication between a string capable computer and a keyboard/display. In the example, one of the components of the face 2055 is compatible with the universal serial bus (USB) and the other component is compatible with the video graphics adapter (VGA) connector. The Peripheral Component Symbol concept represents any type (eg, scanner or printer) that can be used with one or more I/O devices in conjunction with a computer with a string of forces. The 2056 Series concept represents one or more couplings that enable communication between a computer capable of string capabilities and peripheral devices. In various embodiments (not illustrated), various components (eg, memory 2 Q i Q, keyboard/display 2 〇〇$, and peripheral device 2006) that are external to the brain having string capabilities are included in the string. The ability of the computer. In some embodiments, one or more of the serial-capable microprocessors 2〇〇1丨 to (9)(1)2 are included to enable coupling to elements functionally and functionally external to the computer having string capabilities. Any hardware of the same or similar components. In various embodiments, the hard system included is compatible with one or more specific protocols, such as peripheral component interconnect (pci) bus, pci extension (PCI_X) bus, PCI Express (PCI-E)® One or more of the bus, hypertransport (Ητ) bus, and fast path interconnect (QP1) bus. In various embodiments, the hard system included is compatible with a proprietary protocol for communicating with the (intermediate) chipset, enabling the (intermediate) chipset to communicate via any one or more of the specific protocols. . In some embodiments, the computer with string capabilities is identical to each other, and in other embodiments the computer with string capabilities varies depending on market and/or consumer needs. In some embodiments, a computer with a string of powers is used as a server, a workbench, a desktop computer, a pen, and a personal or portable computer.乂1^1\¥核2013.1 and transaction memory 2〇14 As illustrated in the figure, the computer with string capability 2〇〇〇1 includes two microprocessors with string capability to let ^(1) view.2, _ to the dynamic random access memory (DRAM) component fine 2. i to 2 surgery. 2. The microprocessor capable of communicating with each other communicates with the flash memory 2003 via the facets 205 i. i to 2 and communicates with each other via the coupling 2Q53. The microprocessor (200) having the capability of (4) includes a progressive analysis unit 2〇111, a string management unit 2〇121, and in some embodiments, a microprocessor having a string capability, and in other specific embodiments, has a contention The microprocessor of capabilities varies depending on the market and/or consumer needs. In various embodiments, a microprocessor capable of string capability is implemented in any of a single integrated circuit crystal, a plurality of integrated circuit dies, a multi-die module, and a plurality of packaged circuits. For the sake of brevity, the following description is in relation to the illustrated microprocessor capable of string capability. Other microprocessors with string capability and string capability are similar. The serial-capable microprocessor 2〇〇ΐι exits the reset state (for example, when a cold boot is performed) and begins to extract and execute the string from the code portion of the serial image 2004 included in the flash memory 2003. instruction. The execution of the instructions initializes various string data structures (e.g., the string data 2〇〇21A and the translation memory 2〇02.1B, which are part of the DRAM 2002.1). Initialization includes copying all or any subset of the code portion of the string image to a portion of the string data and setting the side regions of the string data for the stacked 'stack and private data storage. 136758.doc -20- 200935303 The microprocessor with string capability then begins processing the χ86 instructions (such as the χ86 boot firmware included in the flash memory in some embodiments), which is observed by the party (at least in part) Through the line-by-line analysis unit 2〇111) and points•#. The processing is further subjected to the above-described splitting into strings for parallel execution, translation into operations, and configuration in bundles corresponding to various string images and storage in a cached cache (eg, translation of cache memory 2〇〇) 2,1Β)). The processing is further performed by the party in response to the subsequent configuration of the stored bundle and the execution of the stored bundle (at least in part via the string management unit 2012.1, VLIW core 2013.1 and the variant memory 2014). The division of elements illustrated in the drawings is merely illustrative as there are other specific embodiments that employ other divisions. For example, various embodiments include all or any portion of flash memory and/or DRAm in a microprocessor capable of string capability. For another example, various embodiments are included in a string capable microprocessor (eg, in one or more static random access memories or SRAMs on an integrated circuit die) for use in string data and/or Or translating φ to cache all or any part of the memory of the S. For another example, in some embodiments, the string data 2〇〇21 8 and the translation cache memory 2102.1B are included in different DRAMts (eg, one in the first double inline. The memory module or The DIMM and the other are in the second DIMM). For another example, various embodiments store all or any portion of the string image on storage 2010. Large-capacity multi-threaded hardware and serial diagrams Figure 1B and 1C illustrate the 136758.doc 21 200935303 related to a microprocessor with string capability (such as any of the microprocessors with string capability). Body, string body (soft body) and target software layer (for example, the sub-wire type is conceptual in nature, and (4): such as: system). The figure controls and some (four)_he. 4_See' This figure omits various 1 2::'9. Including one or more independent cores (eg, vliw cores (9) to _τ) and /: singular entities), each core implements one or more hardware threads according to a multi-threaded two-dimensional context switch The context of the content (for example, stored in the register slot case .... (4) and, or in the execution entity of the string contents π 194Β.1 to ι94Β.4). The microprocessor is enabled to execute instructions in accordance with the 6-set architecture. The microprocessor includes speculative multi-thread extensions and enhancements, such as hardware, inter-thread, and inter-bank register propagation logic and/or circuitry for implementing separate and joint instructions and/or operations (multi-core internet) Road 195), the implementation of memory version management and conflict detection capabilities, δ mnemonic 183, progressive analysis hardware 181 and other hardware components to achieve speculative multi-threading processing. In the illustrated embodiment, the microprocessor also includes a multi-layer cache memory hierarchy (eg, L1 D cache memory 193.1 to 193.4 and L2/L3 cache memory 196 execution entities), to One or more interfaces of a large-capacity memory and/or a hardware device external to the processor (a DRAM controller coupled to an external system/serial DRAm 184A and a north bridge 197), for example, having a plurality of microprocessors ( Each of the microprocessors optionally includes a plurality of cores in a computer-to-socket system interconnect (multi-socket system interconnect 198) and interface/coupling to external hardware devices (for external PCI Express, QPI) , Super Transmission 199 coupled chipset / PCIe bus interface 186). The string layers 110A and 110B (sometimes collectively referred to as string layer no) and (χ86) target 136758.doc -22- 200935303 software layer 101 are at least partially included in the microprocessor and/or coupled to one of the microprocessors All or any portion of a plurality of cores (e.g., any of the execution entities of Figures 1 CiVLIW core 191 > 1 to 191 4) is executed. The string layer is not visible to the components of the target software layer, conceptually "under the target software layer" and/or "at the same level as the target software layer"" transparent operation. The target software layer includes the operating system core 102 and the narration system "execution system core" top "execution program (illustrated as the execution entities of the applications 103.1 to 103.4 in various specific embodiments and/or usage scenarios, The target software layer includes - managing a plurality of operating system execution individual hypervisor programs (eg, similar to VMware or Xen). In various embodiments, the stratum layer implements one or more of the following capabilities: to virtualize one or more The CPU (e.g., vcpu 1〇41 to 〇46 execution entities) and associated virtual device 174 present virtualization of the microprocessor hardware to the target software. The VCPU appears to execute - where the target instruction set of the target soft layer is encoded. Dynamic mapping of the VCPU to the native core of the microprocessor (eg, enabling execution of the native instruction set _191" to the execution entity of 191.4) and the string context (which holds (10) as in the case of the temporary register .4 and/or string content 194 Β ι ^ 94 Β β one or more executing individuals). In the case of Ding/Software, the tool and equipment of the target software, the line-by-line analysis and the branching line are used to identify the individual sequence of dividing the (sequential) instruction stream into more speculations: for example, the system will be executed by ― or more The sequence instruction stream is divided into a plurality of speculative multi-thread strings. 136758.doc -23- 200935303 • Inserting instructions and/or code sequences into the target software based on the analysis to invoke the various speculative multi-thread hardware units of the microprocessor to divide and combine strings, predict and/or The live in value is propagated to the string to manage the memory version and the conflict between the strings and to pre-fetch the string. • Optimization of target software to speed up speculative multi-threading performance acceleration, * for example, rescheduling instructions to generate key string resident values, delays, and/or reordering to suppress parallelism earlier in time to destroy Or eliminate cross-string dependencies and remove memory obfuscation, and remove redundant ❹ operations within the pre-fetched string. • Maintain a repository of modified, tooled, and/or optimized code (eg, via Translate Cache Management) so that the code in the repository is invisible to the target code and is Can be used to replace the original target code (such as modifying, tooling, or part of the target code before optimization) by the string call. • Handling any intrinsic anomalies or mistakes's repairs, tools, equipment, and most The result of any of Jiayi (such as speculative multi-threading) will not occur when executing the target software. In some environments, internal exceptions or errors are handled including re-optimization and/or deactivation. Optimize performance • Provide an optional mechanism to the target code to provide hints for the string, such as possible points, synchronization points, possible cross-string confusion points, and other optimization information. Dynamic Optimization In some embodiments, the microprocessor hardware is enabled to execute - internal instructions that are different from the instruction set of the target software [in various embodiments 136758.doc -24- 20093 5303, 'Require to cooperate with any combination of multiple hardware acceleration mechanisms: Implement dynamic binary translation (for example, via χ86 binary translation 1丨$) to:: or multiple target instruction sets (eg χ86 compatible) The set of instructions, such as χ86_. 64♦"sets, is translated into native micro-ops (micro-ops). The software acceleration mechanism includes line-by-line analysis of hardware 1 8 1 , hardware acceleration unit i 82, transaction. ^ Recalling the body 183 and the hardware 386 device - or a plurality of all or any of the # knife microprocessor hardware (such as the VLIW core 191.1 to 191.4 execution individual) is enabled to directly execute the micro-opcode (and in various In a specific embodiment, the microprocessor hardware is not enabled to directly execute one or more of the target instruction sets. At least in some circumstances, the translation is then stored in a repository (eg, via a translation cache) Memory management 111) is used for fast re-calling and reuse (eg, as an image) to eliminate re-translation. In various embodiments, the microprocessor is enabled for access (eg, by coupling or attaching) Relatively large The system is via a dedicated DRAM module (which is included in the microprocessor or external to the microprocessor in various embodiments) or alternatively as an external system/invisible to the target code. The memory area is implemented by a portion of one of the reserved areas of the serial DRAM 184A. The memory area provides a memory for various components of the string (eg, one or more of program, code, stack, stack, and data) and In some embodiments, one or more portions of the cache memory (e.g., by the manager of the translation cache) are provided, and one or more buffers are optionally provided (e.g., speculation) Threading the temporary status buffer.) When the microprocessor is initially powered on (for example, by performing a cold boot), copy the serial code from the flash ROM to the memory area (eg to the dedicated DRAM module 136758.doc) -25- 200935303 or the external system/serial DRAM said in the reserved portion), the microprocessor then extracts the native micro-ops from the memory region. After the string initialization micro-processing W, for example, via hardware control 172) and the internal data structure of the string, the string body begins to perform a 'boot boot and' or operating system core boot program using binary translation (eg, via χ86 binary translation 115). The code (encoded in - or more of the target instruction set) is similar to a conventional hardware-based microprocessor without a binary translation layer. In some cases, the use of a string to perform binary translation and/or dynamic optimization provides advantages over the addition of speculative multi-threaded instructions to the target instruction set. In some environments, binary translation and/or dynamic optimization (for example) by removing and/or reducing hardware used to decode the target instruction set (eg, hardware secret decoder 187) and for out-of-order execution The hardware implementation simplifies the hard work. In some embodiments, 'conceptually employed' or multiple VLIW (ultra-long instruction word) microprocessor cores (eg, execution individuals of vli 191" to (9)*) Replace the removed and/or reduced hardware. The vuw core φ (for example) performs a bundle of pre-scheduled micro-ops, where all the micro-operations of the bundle are coded (eg in a plurality of functional units (eg ALU 192 to 192 especially 4 and FPU 192B. 1 to 192B.4) The execution of the individual) is performed in parallel (or starting to execute.). In various embodiments, the VLIW core lacks one or more of relatively complex decoding, hardware-based dependency analysis, and dynamic out-of-order scheduling. The VLIW core needs to include native memory (eg #li d cache memory 193) to 193.4 and the executable file of the temporary file 194Α.β194Α.4) and other per-core hardware structures for efficient processing. instruction. In some usage scenarios and/or specific embodiments, the VLIW core system is small enough to 136758.doc -26 - 200935303 to achieve the following - or the guest to go. Gland seedlings eve ^ 夕 夕. Counting cores in a given grain region, for more cores within a given power budget, and for frequencies that are higher than would otherwise be possible for a complex disordered core. In some implementations and/or specific embodiments, the VLIW core is semantically isolated from the target instruction set via binary translation to implement the micro*$code format, the scratchpad, and the VLIW core associated with efficient speculative multi-threading. Efficient coding of various details without modifying the target instruction set. The role of the string dynamic optimization software The tracking construction subsystem of the string layer (eg, tracking progressive analysis and capture 120) collects and/or translates the translated microcode (eg, has passed through) when executed by the microprocessor. The micro-ops of the sequence of the translated basic blocks of the common control flow path of the object code are organized into traces. The string is relatively extensively optimized using a wide variety of techniques (e.g., via optimization 丨 63). Some techniques are similar in scope to techniques that are implemented by optimized compilers that have access to the source code, but are used in line-by-line analysis (eg, line-by-line analysis by physical page 121, branch-by-line analysis 124, At least some optimizations are guided by the program behavior of the dynamic measurements collected during predictive optimization 125 and one or more of memory progressive analysis 127. For example, selective reordering (e.g., based on information obtained via memory obfuscation analysis 162) to the load and storage of the memory initiates a cache miss as early as possible. In some embodiments, 'selective reordering is based, at least in part, on loading and storage measurements with reference to the same address (e.g., via memory progressive analysis 127). In some usage scenarios and/or embodiments, selective reordering achieves relatively aggressive optimization across hundreds of instructions. Then, depending on the input, when the operand will be available and various hardware resources (such as functional units) will freely schedule the micro-ops (for example, by interpolating each micro-code 165)

至排程中)。在某些具體實施例(例如具有藉由編碼似VLIW 束167所解說之功能性之某些具體實施例)中,排程嘗試將 多達四個微運算碼封裝於各束中。在一束中具有複數個微 運算碼致使一特定VLIW核(例如VLIW核191.1至191.4之任 何者)能夠在稍後執行已排程追蹤時排序執行該等微運算To schedule). In some embodiments (e.g., with some specific embodiments having functionality illustrated by encoding a VLIW beam 167), scheduling attempts to encapsulate up to four micro-ops in each bundle. Having a plurality of microcodes in a bundle causes a particular VLIW core (e.g., any of the VLIW cores 191.1 through 191.4) to be able to perform the micro-operations when the scheduled tracking is performed later.

❷ 碼。最後’將已最佳化追蹤(具有各具有一或多個微運算 碼之VLIW束)作為串影像之全部或部分插入至儲存庫中(例 如經由轉譯快取記憶體管理U1)。在某些具體實施例中, 硬體僅執行來自儲存於轉譯快取記憶體中之追蹤的原生微 運算碼,從而致使能夠連續再使用由串體所實行之最佳化 工作。在某些使用方案及/或具體實施例中,取決於(例如) 多頻繁地執行一追蹤透過一系列越來越高效能最佳化級連 續地重新最佳化追蹤,各級係相對更加昂貴地實行(例如 經由促進13〇)。 在某些具體實施例中,動態最佳化軟體經由原子執行之 使用實現某些相對較積極最佳化。在某些環境中,相對較 積極最佳化之執行個體在無原子執行情況下將,·不安全", :::將導致對架構狀態之不正確修改。原子執行之一範例 對架構狀態之修改將微運算碼之群組(稱為提交 交群组? ”’?_可分單元…追蹤視需要地包含—或多個提 任何里常=提交群組之所有微運算碼正確完成(例如無 "常或錯誤)’則依據提交群組之所有微運算碼之结 I36758.doc -28- 200935303 果對架構狀態谁杯_几 所有微運算二:ΓΛ其他環境下’丢棄提交群組之 架構狀態所進行相對於提交群組之結果對 丁之變化。例如,在相對於一提交群組之一Weight. Finally, the optimized tracking (with VLIW bundles each having one or more micro-ops) is inserted into the repository as all or part of the string image (e.g., via translation cache memory management U1). In some embodiments, the hardware only performs native micro-ops from the traces stored in the translation cache, thereby enabling continuous reuse of the optimization performed by the string. In some usage scenarios and/or embodiments, depending on, for example, how often to perform a tracking, continuously re-optimizing tracking through a series of increasingly high performance optimization levels, the levels are relatively more expensive. Implementation (eg via promotion 13〇). In some embodiments, the dynamic optimization software achieves some relatively aggressive optimization via the use of atomic execution. In some environments, a relatively actively optimized execution entity will, in the absence of atomic execution, "unsafe", ::: will result in incorrect modification of the state of the architecture. One example of atomic execution changes the state of the architecture. Groups of micro-computing codes (referred to as submitting cross-groups?) '?_divisible units...tracking as needed—or multiple mentions often = submit group All micro-ops are correctly completed (for example, no "often or wrong)', according to the conclusion of all micro-computing codes of the submitted group I36758.doc -28- 200935303. For the architectural state, who is the cup_a few micro-operations two: ΓΛ In other contexts, the structure state of the drop-off group is changed relative to the result of submitting the group. For example, in one of the submitted groups

微運碼所伯测之一異常(例如頁錯失或跟隨與最初 追縱所沿著之路徑不同之路徑之分支)之事件中,發生回 轉’且丟棄由提交群組之所有微運算碼所產生之所有結 果。在某些具體實施例及/或使用方案巾,回轉之後,微 處理器及’或串體採用原始程式順序(且視需要地在無一或 多個最佳化之情況下)重新執行對應於提交群組之微運算 碼之指令以查明異常之來源。標題為"Method andIn the event that one of the micro-codes is abnormal (for example, a page misses or a branch that follows a path different from the path along which the initial trace was followed), a turn occurs and discards all micro-opcodes generated by the commit group. All the results. In some embodiments and/or usage scenarios, after the slewing, the microprocessor and the 'or string are re-executed in the original program order (and optionally without one or more optimizations) corresponding to Submit the group's microcode instructions to find out the source of the anomaly. Titled "Method and

Apparatus f〇r Incremental Commitment to Architectural State"之共同待審美國專利申請案1〇/994,774揭示關於動態 最佳化及提交群組之其他資訊。 在某些具體實施例及/或使用方案中,組合運作之硬體 與軟體(例如)藉由經由相對較積極VLIW追蹤排程與最佳化 擁取單一串内之精細平行性實現類似於無序動態排程微處 理器之益處。在各種具體實施例中,類似於無序微處理 器’硬體與軟體實行精細平行性擷取,同時相對高效地重 新排序及使獨立串交錯以涵蓋記憶體延時停止。在某些環 境中,硬體與軟體實現橫跨許多核及/執行緒之相對高效 縮放,實現每時脈可能數百個微運算碼之有效發出寬度。 多執行緒動態最佳化 在具有大容量多核及/或多執行緒微處理器之某些具體 實施例中,實施動態最佳化軟體以相對高效地使用複數個 136758.doc -29- 200935303The copending U.S. Patent Application Serial No. 1/994,774, the entire disclosure of which is incorporated herein by reference. In some embodiments and/or usage scenarios, the hardware and software of the combined operation are similar to none, for example, by tracking the scheduling through a relatively aggressive VLIW and optimizing the fine parallelism within a single string. The benefits of sequencing dynamic scheduling microprocessors. In various embodiments, fine parallelism is performed similar to the unordered microprocessor' hardware and software, while reordering relatively efficiently and interleaving the independent strings to cover memory latency stalls. In some environments, hardware and software implement relatively efficient scaling across many cores and/or threads, enabling an effective emission width of hundreds of micro-ops per clock. Multi-Thread Dynamic Optimization In some specific embodiments with large-capacity multi-core and/or multi-threaded microprocessors, dynamic optimization software is implemented to use a plurality of relatively efficient 136758.doc -29- 200935303

核及/或執行緒之諸。例如,追蹤逐行分析與捕獲!20、 串建構14〇、排程與最佳化16〇及χ86二進制轉譯I} $之一或 ^個係在1多個級處遍佈式多執行緒化,致使能夠減 夕/肖耗或有效隱藏與二進制轉譯及/或動態最佳化關聯 之某些或全部額外負擔。微處理器以背景方式執行動態最 佳化軟體以便不妨礙執行目標碼(例如透過來自轉譯快取 記憶體之已最佳化程式碼)中之正向進展。各種具體實施 例實施用以實現執行動態最佳化軟體之背景方式的一或多 «I例如’微處理器及/或串體將資源之部分(例如在 -多核微處理器具體實施例中之一或多個核)明確專用於 執行動態最佳化軟體。該專用係永久的,或#代地為瞬變 及/或動態的’例如當資源之部分係可用時(例如當目標碼 明顯將未使用VCPU置於閒置狀態下時卜對於另一範例, -或多個核之優先權控制機制實現串體執行緒(其映射⑽ 如)至目標可見VCPU)在具有很少或不具有可觀察效能降 級之情況下共用該等核及關聯快取記憶體(例如,藉由使 用由依據目標指令集架構或ISA執行之已停止目標執行緒 所建立之鬆弛循環; 硬體與串體實施方案 在各種具體實施例中,圖1A所解說之元件對應於圖汨 與1C所解說之功能性之全部或部分。例如, 你呆些具體實 施例中,圖1A之DRAM 2002.1對應於圖1C之外部系 體DRAM 1 84A,且轉譯快取記憶體管理i營 s *里轉譯快取 記憶體2002.1B。對於另一範例,在某些具 •貝苑例中, I36758.doc -30- 200935303 圖1A之VLIW核2013.1對應於圖1C之VLIW核191·1至191.4 之一或多個’圖1Α之異動記憶體2014.1對應於圖1C之異動 記憶體183,且圖1Α之逐行分析單元2011.1對應於圖1C之 逐行分析硬體181。對於另一範例’在某些具體實施例中 圖1Α之串管理單元2012.1對應於耦合至圖ic之暫存器檔案 194A.1至194A.4及/或串内容脈絡194B.1至194B.4之一或多 個的控制邏輯。 對於圖ΙΑ、1B及1C之元件間之對應之另一範例,在某 些具體實施例中,圖1A之串體影像2〇〇4具有圖1B與1C之 串體層11 0A與110B之全部或任何部分之初始影像。對於 另一範例,在某些具體實施例中,圖1A之具備串能力之微 處理器2001.1實施藉由圖1C之硬體層19〇所例示之功能。 在各種具體實施例中,圖1C之晶片組/pcie匯流排介面 186、多插座系統互連198及/或?。特快、Qpi、超傳輸^的 之全部或任何部分實施與圖1A之耦合2050、2〇55、2〇56、 2063、205 1.1及2053關聯之介面之全部或任何部分。在各 種具體實施例中,與圖ic之中斷、SMP及計時器175結合 運作的晶片組/PCIe匯流排介面186及/或pci特快、Qpi、 超傳輸199之全部或任何部分實施圖1A之鍵盤/顯示器2〇〇5 及/或周邊設備2006之全部或任何部分。在各種具體實施 例中,圖1CiDRAM控制器與北橋197之全部或^何部分 實施與圖1Α之耦合2052.1關聯之介面之全部或任何部分f 推測多執行緒化模型 各種具體實施例之推測多執行緒化係用於使用在其中始 136758.doc 200935303 Π::確定性已程式排序執行之外觀的未修改目標碼 程式排序執=執行緒化提供-已嚴格 ^”有至多後繼串。若上代串p分又第一下 、、、後在與S1聯合之前及/或在S1終止之前嘗試分又 Μ二代串S2,則82之分又係無效的(例如,(例如)藉由將 串嘗無操作或視為卿抑㈣之分又)。若上代 ❹ ⑩ 刀又且不存在Μ資源(例如不存在自由執行緒 二絡)來完成分又,則視需要地取決於分又為什麼類 ^分又抑制分又或替代地阻隔已分又執行緒直到資 付可用。 缺在某些具體實施例中’微處理器經啟用以依據原生微運 异碼指令集執行’原生微運算碼指令集包括可用以分又 串、控制串間之互動、聯合串及中止(例如殺除)串之各式 各樣微運算碼、特徵及内部暫存器。在某些具體實施例 中’各式各樣微運算碼包括: nype target,inherit引導微處理器建立上代串ρ之新後 繼串S。微處理器(經由硬體與軟體元件之任何组合)依 據一或多個串體及/或硬體定義之政策將後繼串映射至 微處理器之特定核與執行緒。執行上代串^。⑽運算 碼的特定VCPU擁有後繼串(連同上代串一起)。後繼串 之執行在由target參數所指定(根據亊體位址空間内之一 原生微運算碼位址或作為—目標碼RIp)之目標位址處開 始。inhe出參數係用作執行分叉操作之後將藉由上代申 I36758.doc •32· 200935303 修改哪些暫存n収應將哪㈣存器複製⑽承)至後繼 串之一指示(參見定位於本文別處之小節”前跳越串")。 參數針對後繼串指定數個不同串類型之一(例如精細 . 前跳越串、完全推測多執行緒串、預提取串或具有其他 語意或用途之串)。fGrk微運算碼提供—為串ι〇之輸出 • 值。串1D係與後繼串關聯之識別項(其至少在同一 vcpu 内係全局唯—的),該識別項指定後繼串相對於與擁有 上代與後繼串兩者之特定VCPU關聯之所有其他串之程 © 式順序。 • kill.cmptype· ce ra,rb,τ引導微處理器消除一或多個 串。更明確言之,當在上代串p内加以執行時,k⑴以遞 回方式中止P之後繼争8(若有的話)及8之所有後繼串(若 有的話)。kiU微運算碼之執行經由指定ALU運算 cmptype(例如kU1· sub或kiu and)比較暫存器運算元^與 树而產生—結果,然、後檢查該結果之指定條件碼cc(例 ❹如j於或等於)。若指定條件為真,串範疇識別項τ與關 聯分叉微運算碼之串範疇識別項匹配且巢套式分又深度 . 為零’則殺除上代串p之後繼串。參見定位於本文別處 - 之小節”串範疇識別"與,,巢套式串"以瞭解另外揭示内 容。 • waiuype [object]引導微處理器停止執 懸而未決。更明確言之,當在"内加以執行時二 造成串S之執行在進行之前等待一指定條件(且視需要地 等待-指定物件’例如記憶體位址)。例如,在某些具 136758.doc -33- 200935303 體實施例令’微處理器經啟用以等待直至串為架構的 (例如非推測的),等待待寫入之一特定記憶體位置,等 待直至後繼串完成及等待直至上代串達到某一狀態。 • join引導微處理器阻隔與上代串關聯之推測後繼串之執 行,直到上代串與後繼串聯合。當一特定串在為推測性 的同時不能夠正向進展時藉由串體執行聯合微運算碼。 •微運算碼視需要地包括-傳播位元,其指示硬體將微運 算碼之結果(在上代串中)發送至上代串之後繼串。參見 定位於本文別處之小節"前跳越串"以瞭解與傳播位元修 改之另外揭示内容。 在某些具體實施例中’藉由執行複數個其他微運算碼、 實行寫入至内部機器狀態暫存器、以視需要地自動方式調 用各別不同以非微運算碼為基礎的硬體機制或其任何板 合來實施上述微運算碼之功能性之某些或全部。 , 在其中上代串分又—推測後繼串之各種使用方案中Nuclear and / or thread. For example, tracking line-by-line analysis and capture! 20, string construction 14 〇, scheduling and optimization 16 〇 and χ 86 binary translation I} $ one or ^ system at more than 1 level of ubiquitous multi-threading, resulting in reduced eve / Xiao consumption or effective Hide some or all of the additional burden associated with binary translation and/or dynamic optimization. The microprocessor executes the dynamic optimization software in a background manner so as not to interfere with the positive progression in the execution of the object code (e.g., through the optimized code from the translation cache). Various embodiments implement one or more of the "1" microprocessors and/or string portions of the resources used to implement the background of the dynamic optimization software (eg, in a multi-core microprocessor embodiment) One or more cores are explicitly dedicated to performing dynamic optimization software. The dedicated system is permanent, or #代地为 transient and / or dynamic 'for example, when part of the resource is available (for example, when the target code obviously puts the unused VCPU in an idle state), for another example, Or multiple core priority control mechanisms that implement the string thread (which maps (10), for example) to the target visible VCPU) to share the core and associated cache memory with little or no observable performance degradation ( For example, by using a slack loop established by a stopped target thread executed in accordance with a target instruction set architecture or ISA; hardware and constellation implementations In various embodiments, the elements illustrated in FIG. 1A correspond to maps All or part of the functionality explained with 1C. For example, in some embodiments, the DRAM 2002.1 of Figure 1A corresponds to the external system DRAM 1 84A of Figure 1C, and the translation cache memory management i camp s * Translating the cache memory 2002.1B. For another example, in some cases, I36758.doc -30- 200935303 The VLIW core 2013.1 of Figure 1A corresponds to the VLIW cores 191.1 to 191.4 of Figure 1C. One or more of the changes in Figure 1 The memory 2014.1 corresponds to the transaction memory 183 of FIG. 1C, and the progressive analysis unit 2011.1 of FIG. 1 corresponds to the line-by-line analysis hardware 181 of FIG. 1C. For another example, in some embodiments, FIG. The management unit 2012.1 corresponds to control logic coupled to one or more of the scratchpad files 194A.1 to 194A.4 and/or the string contexts 194B.1 to 194B.4 of Figure ic. For Figures 1, 1B and 1C As another example of the correspondence between the components, in some embodiments, the string image 2〇〇4 of FIG. 1A has an initial image of all or any portion of the string layers 110A and 110B of FIGS. 1B and 1C. As another example, in some embodiments, the string capable microprocessor 2001.1 of FIG. 1A implements the functions illustrated by the hardware layer 19 of FIG. 1C. In various embodiments, the chip set of FIG. 1C /pcie bus interface 186, multi-socket system interconnect 198 and/or ?. All or any part of express, Qpi, super-transmission ^ is implemented with coupling 2050, 2〇55, 2〇56, 2063, 205 of FIG. 1A 1.1 or 2053 all or any part of the associated interface. In various embodiments The chipset/PCIe bus interface 186 and/or the pci express, Qpi, and ultra-transmission 199 operating in conjunction with the interrupt of the ic, the SMP and the timer 175, and the keyboard/display 2〇〇5 of FIG. 1A are implemented. And/or all or any portion of the peripheral device 2006. In various embodiments, all or any portion of the interface associated with the coupling 2052.1 of Figure 1CiDRAM controller and Northbridge 197 is speculated to be more The threaded model of the various embodiments of the speculative multi-threading system is used to use the unmodified object code program in which the 136758.doc 200935303 Π:: deterministic programmed order execution is performed. Has been strict ^" there are at most successor strings. If the previous generation string p is first and then, before and after the association with S1 and/or before the termination of S1, the division of 82 is invalid (for example, by, for example) It will be considered as no operation or as a distinction between (4) and again. If the previous generation ❹ 10 knives and there are no Μ resources (for example, there is no free actor 2 network) to complete the points again, then depending on the points and why the class and the points are suppressed or alternatively blocked and executed again Until the payment is available. In some embodiments, the 'microprocessor is enabled to execute the native microcode instruction set according to the native microcode instruction set, including the available to divide and string, control the interaction between strings, union strings, and abort (eg, kill In addition to the various types of micro-codes, features and internal registers. In some embodiments, the various microcodes include: nype target, which instructs the microprocessor to create a new successor string S of the previous generation string ρ. The microprocessor (via any combination of hardware and software components) maps the successor string to a particular core and thread of the microprocessor in accordance with one or more string and/or hardware defined policies. Execute the previous generation string ^. (10) The specific VCPU of the opcode has a successor string (along with the previous generation string). The execution of the successor string begins at the target address specified by the target parameter (based on one of the native micro-opic code addresses in the body address space or as the target code RIp). The inhe parameter is used to perform the fork operation and will be modified by the previous generation application I36758.doc •32· 200935303 to determine which temporary storage n (4) to copy (10) to one of the successor strings (see positioning in this article). The section elsewhere skips the string. The parameter specifies one of several different string types for the successor string (such as fine. Pre-jump string, fully speculative multi-thread string, pre-fetch string, or other semantics or use) String). The fGrk micro-opcode provides the output value of the string 1. The string 1D is an identifier associated with the successor string (which is globally unique at least within the same vcpu), which specifies the successor string relative to The sequence of all other strings associated with a particular VCPU that has both the previous and succeeding strings. • kill.cmptype· ce ra, rb, τ directs the microprocessor to eliminate one or more strings. More specifically, when When executed in the upper generation string p, k(1) suspends P in a recursive manner and then contends 8 (if any) and all subsequent successors (if any) of 8. The execution of the kiU micro-opcode operates the cmptype via the specified ALU. (eg kU1·sub or kiu and Comparing the register operand with the tree - the result, then checking the specified condition code cc of the result (for example, j is equal to or equal to). If the specified condition is true, the string category identifier τ and the associated score The cross-country operation code string category identification item matches and the nested type is divided into depth. Zero' then kills the previous generation string p subsequent string. See elsewhere in this article - the section "string category identification" and, nesting Strings " to understand the additional disclosure. • waiuype [object] boots the microprocessor to stop suspending. More specifically, when executed in " causes the execution of string S to wait for a specified condition (and optionally wait - specify an object 'e.g., a memory address) before proceeding. For example, in some embodiments 136758.doc -33-200935303, the microprocessor is enabled to wait until the string is architected (eg, non-speculative), waiting for a particular memory location to be written, waiting until The successor string is completed and waits until the previous generation string reaches a certain state. • The join bootstrap microprocessor blocks the execution of the speculative successor string associated with the previous generation string until the previous generation string is merged with the successor string. A joint micro-op is executed by a string when a particular string is not speculative while being able to progress forward. • The micro-opcode optionally includes a -propagation bit that instructs the hardware to send the result of the micro-algorithm (in the previous generation string) to the successor string of the previous generation string. See the section "Before Jumping Strings" located elsewhere in this article for additional disclosure of the propagation and modification of the bits. In some embodiments, 'by performing a plurality of other micro-ops, performing a write to an internal machine state register, and automatically calling different hardware mechanisms based on non-micro-opcodes as needed. Any or all of the functionality of the above described micro-codes may be implemented. In the various usage scenarios in which the upper generation is divided and the post-sequence string is speculated

於後繼串等待戋傳y ,, Α — 執仃(例如暫停或暫時中止)及等待上 二存在數個原因。例如,若在推測串中發生 串情況下該異常指示-誤推測或其中上代 fII 又後繼串的-情況。對於另-範例,一推 於僅使用在非推=操作,因為限制該特定操作用 體視需要地包括存取〇、構串中。受限制操作之執行個 超傳輸⑼)、讀取或寫入特器,1如經由PCI特快、QPI、 憶體)、進入受限於以非推:::體區域(例如不可快取記 八方式執行的一串體部分或嘗 136758.doc -34- 200935303 試使用已延期操作結果。 當一上代串與一等待後繼串交叉時且若上代串驗證上代 串之所有駐外者與後繼串之駐内者匹配,則後繼串之異常 為”真正的”。該異常在異常不為不正確推測之副效應之意 義上為真正的且因而微處理器以架構可見方式處理該異 常。在各種情況下,當後繼串之執行恢復時,後繼_立即 基於該異常導向(例如至作業系統核心中)以處理該異常(例 如頁錯失)。在某些情況下,當後繼串之執行恢復時,執 行無錯誤地繼續,因為後繼串現在為架構性(非推測性卜 微處理器依程式順序聯合串,且各VCPU擁有該等串之 或多個。最新式架構串表示擁有該串之VCPU之架構狀 態。微處理器使得架構狀態可用於在擁有VCPU外部觀察 (例如經由已提交至記憶體之儲存物)。微處理器經啟用以 在微處理器内之核之間自由移動最新式架構串,且同時擁 有VCPU看似繼續執行((例如)藉由相對於擁有VCPU所執行 之作業系統核心所觀察)。 推測多執行緒化策略 微處理器硬體與微處理器串體(軟體)實現具有越來越寬 範之數個級上之推測多執行緒化: •當—串在一相對較長延時快取操作(例如自主記憶體得 以作出補償的快取遺漏)上停止時視需要地自動分又預 提取串(參見定位於本文別處之小節”預提取串")。在使 用資料之前(例如在藉由自之分又預提取串之(上代)串存 取資料之前),預提取串嘗試將預期使用之資料提取至 136758.doc -35- 200935303 一或多個快取記憶體中及/或嘗試準備好一或多個具有 適田貝料之分支m ^在某些環境中,預提取_針對 數百個循環係處於作用中。在某些具體實施例中,只要 分又串未分又另一串,系統提供用以分又預提取串的任 何類型之串(從而防止其中—特定宰具有—個以上後繼 串之方案)。在某些具體實施例中,甚至當分又串已分 又另串時,系統亦提供用以分又預提取串的任何類型 之串(導致其中一特定串具有一個以上後繼串之方案)。 在某些具體實施例中’硬體具有用以依據一或多個軟體 及/或串體可控制預提取政策選擇性啟動或抑制預提取 串之建立的邏輯。 當串體決定上代串係相對可能在一特定指令(例如相對 頻繁地遇到快取遺漏之載入)上停止時藉由上代串分又 則跳越串(參見定位於本文別處之小節"前跳越串。。或 者,分又一前跳越串因此該前跳越串在一相對高度可預 測最後分支(例如具有大於預定及/或可程式化臨限值之 正確預測率之分支)之後開始執行。前跳越串阻隔直到 上代串提供前跳越串取決於之駐内者,例如,當上代串 產生駐外者時將針對駐外者之值發送至前跳越串(其中 上代串之駐外者之子集係前跳越串之駐内者)。已發送 駐外者視需要地包括暫存器及/或記憶體位置。 基於動態(且視需要地靜態)推斷控制流結構與習語之串 體分又推測串執行緒化(SST)串(參見定位於本文別處之 小節”推測串執行緒化(SST)":^該等結構與習語包括反 I36758.doc -36- 200935303There are several reasons for waiting for y, y, squatting (such as pausing or temporarily aborting) and waiting for the second. For example, if an exception occurs in the speculative string, the anomaly indication-false speculation or the case where the upper generation fII is a successor string. For the other-example, one push is only used in the non-push = operation, because the specific operation is restricted to include the access 〇, the string. Restricted operation is performed by a supertransfer (9)), read or written to the device, 1 via PCI Express, QPI, Remembrance), access is restricted to non-push::: body area (eg not fast cached eight) The execution of a bunch of parts or taste 136758.doc -34- 200935303 Try to use the delayed operation result. When a previous generation string and a waiting succession string cross and if the previous generation string verifies all the foreigners and subsequent successors of the previous generation string If the resident matches, then the exception of the successor string is "true." The exception is true in the sense that the exception is not a side effect of the incorrect speculation and thus the microprocessor handles the exception in an architecturally visible manner. Next, when the execution of the subsequent string is resumed, the successor_ is immediately based on the exception (for example, into the operating system core) to handle the exception (eg, page miss). In some cases, when the execution of the subsequent string is resumed, execution is performed. Continue without error, because the successor string is now architectural (non-speculative microprocessor-based sequential synchronizing strings, and each VCPU owns one or more of these strings. The most recent architecture string indicates that the string is owned. The architectural state of the VCPU. The microprocessor allows the architectural state to be used for external observation of the VCPU (eg, via storage that has been committed to memory). The microprocessor is enabled to freely move the latest between cores within the microprocessor. The architecture string, and at the same time, the VCPU appears to continue execution (for example, by observing the operating system core with respect to the VCPU). Predictive multi-threading strategy microprocessor hardware and microprocessor string (software) Implementing speculative multi-threading on a number of levels that are increasingly broad: • When the string is stopped on a relatively long delay cache operation (such as a cache miss that the self-memory can compensate) Automatically pre-fetching strings (see the section "Pre-fetching strings" located elsewhere in this article.) Before using the data (for example, before pre-fetching the strings from the previous generation) The pre-extraction string attempts to extract the expected data into one or more cache memories and/or try to prepare one or more of the materials with the field In some environments, pre-fetching is in effect for hundreds of cycles. In some embodiments, as long as the strings are not separated and another string is provided, the system provides a pre-fetching string. Any type of string (thus preventing it - a particular slaughter has more than one successor string). In some embodiments, even when the substrings are split and another string, the system provides Extracting any type of string of strings (resulting in a scheme in which one particular string has more than one successor string). In some embodiments, the 'hardware has controllable pre-following based on one or more software and/or constellations. The policy selectively initiates or suppresses the logic of the establishment of the prefetch string. When the string determines that the previous generation string is relatively likely to stop on a particular instruction (eg, a cache that frequently encounters a cache miss), the previous generation is again Then skip the string (see the section located elsewhere in this article " before jumping over the string. . Alternatively, the preceding skipped string is therefore executed after the previous skipped string is branched at a relatively highly predictable last branch (e.g., a branch having a correct prediction rate greater than a predetermined and/or programmable threshold). The jump before the string is blocked until the previous generation string provides the previous jump depends on the resident. For example, when the previous generation generates the resident, the value of the foreigner is sent to the forward jump string (where the upper generation string is stationed outside) The subset of the person who jumped in front of the string is the one who stayed in the string). Sentees are included as needed to include the scratchpad and/or memory location. Inferring the string structure of the control flow structure and the idiom based on dynamics (and optionally statically) and speculating the string execution (SST) string (see the section located elsewhere in this article). Predictive string implementation (SST)": ^These structures and idioms include anti-I36758.doc -36- 200935303

覆結構(例如迴圈)、呼叫與返回(例如副常式、函式、程 序及程式庫之呼叫與返回)及控制流聯合(例如在一條件 區塊中藉由"if"與"else"路徑兩者所到達之共同聯合 點)。SST串包含一或多個指令序列(例如基本區塊、追 蹤、提交群組或其他指令量)。在某些方案中動態控制 流變化發生在各指令序列之結束處以決定針對欲執行串 之下一指令序列。在一 SST串内之控制流變化(與某些其 他串類型不同)獨立於該SST串之後繼串内之控制流發 生。在SST串内之控制流變化相對罕見地使後繼Overlays (such as loops), calls and returns (such as secondary routines, functions, calls and returns to programs and libraries), and control flow unions (for example, in a conditional block by "if" and "Else" the common point of arrival of both paths). The SST string contains one or more sequences of instructions (e.g., basic blocks, traces, commit groups, or other instruction quantities). In some scenarios dynamic control flow changes occur at the end of each sequence of instructions to determine the sequence of instructions for the string to be executed. The control flow variation within an SST string (as opposed to some other string types) occurs independently of the control flow within the subsequent string of the SST string. Control flow changes within the SST string are relatively rare to make successors

丁 j ("V 效。在某些情況下,系統選擇性使一 SST串變化為一預 提取串。在某些環境中’一 SST串針對數萬個或數十萬 個循環係處於作用中。 •在某些具體實施例中在SST串之建構期間使用逐行分析 串I,參見定位於本文別處之小節,,針對逐行分析之工具裝 備")以聚集跨越串轉遞資料。相對於其他串,逐行分析 串係以串列方式(例如依程式順序)而非與逐行分析:之 上代串平行加以執行。 預提取串 =串執行之某些環境中,執行遇到不_阻隔進展之停 採哭、迈属)作為回應,微處 ,益視而要地分叉預提取串同時停止遇到停止事。D (j). In some cases, system selectivity changes an SST string to a pre-fetch string. In some environments, an SST string is in effect for tens of thousands or hundreds of thousands of cycles. • In some embodiments, the progressive analysis string I is used during the construction of the SST string, see the section located elsewhere herein, and the tooling for the line-by-line analysis ") to aggregate the data across the string. Relative to other strings, the line-by-line analysis is performed in tandem (for example, in program order) rather than in line with the line-by-line analysis: the previous generation string. Pre-extraction string = In some environments where the string is executed, the execution encounters a stop that does not block the progress of the process. In response, the micro-location, the benefit of the fork-forward pre-extraction string while stopping the encounter with the stop.

微處理器分配(新)預提取串(在某此I ^ ± . — /、體實施例十,在與上 代串相同之核上’但在不同的串内容脈絡中 取串以上代串之架構狀態(暫存器與 已隐體)開始。預提取 136758.doc •37- 200935303 串繼續執行直到將資訊遞送至已停止(上代)串致使已停止 串能夠恢復處理(例如將針對快取遺漏之資料遞送至已停 止串)接著微處理器(例如硬體層190之元件)自動毀壞預 f取串且解除對已停止串之阻隔。在某些具體實施例中, 微處理器具有用以依據一或多個軟體及/或串體可控制預 提取政策選擇性啟動或抑制預提取串之建立的邏輯。例 如’串體組態微處理器用以在一串所遇到之一_漏導致 主記憶體存取時分又一預提取串,及用以在一_漏導致 L2或L3命中時停止該串。 在執仃-载人之某些環境中,預提取串遇到相對較長延 =快取遺漏(例如導致預提取串之分又㈣漏)。若這㈣ 人遞送(在預提取串之内容脈絡中)—區別(例如藉 經由快取命中可獲得之所右1送之所有其他資料值(例如 阻隔。預提取串使用針對^料值)的”含糊"占位符值而非 异碼具有一含糊輸入 算碼之結果傳播含冑運算碼作為針對微運 值|,)。微處理器如同一曰^時稱為"微運算碼輸出一含糊 與該分支之實際目的地匹配支之::目的地 串執行-儲存物時,該預提取串分::::二當-預提取 見(例如可觀察及可控制) 為該預提取串可 件,以防止上代串觀察 、歹’J或暫時記憶體緩衝元 若儲存物寫入一含糊值子物。在某些具體實施例中, J36758.doc 3糊值(例如至快取記憶體中),則儲存物 •38- 200935303 之目的地接收該含糊值(例如在快取記憶體之一或多個快 取列中之受影響位元組係標記為含糊)。目的地之後續载 入接收該含糊值’從而傳播該含糊值。在各種使用方案 中’含糊值之傳播致使能夠避免預提取不需要之資料(例 如在載入指標時)及/或避免不然將不正確或無效率地更新 分支預測器者(例如在載入分支條件時)。The microprocessor allocates (new) pre-fetched strings (in some cases I ^ ± . / /, body embodiment 10, on the same core as the previous generation string) but the architecture of the string above the string in different string contexts Status (scratchpad and hidden) begins. Prefetch 136758.doc •37- 200935303 The string continues to execute until the message is delivered to the stopped (upper generation) string causing the stopped string to resume processing (eg, for cache misses) Data is delivered to the stopped string. Subsequent microprocessors (e.g., elements of the hardware layer 190) automatically destroy the pref f string and unblock the stopped string. In some embodiments, the microprocessor has a Multiple software and/or serials can control the logic of the pre-fetch policy to selectively initiate or suppress the establishment of the pre-fetch string. For example, the 'string configuration microprocessor is used to encounter one of the strings in a string to cause the main memory. The access time is further divided into pre-fetched strings, and is used to stop the string when a _ leak causes an L2 or L3 hit. In some environments of the shackle-carrying, the pre-fetched string encounters a relatively long delay=cache Missing (for example, causing pre-fetching strings (4) Leakage. If this (4) person delivers (in the context of the pre-fetched string) - the difference (for example, all other data values sent by the right 1 via the cached hit (eg blocking. Pre-fetching string use) For the "ambiguous" placeholder value of the material value, instead of the heterocode, the result of a ambiguous input algorithm propagates the 胄-containing arithmetic code as the micro-transport value|,). The microprocessor is called the same 曰^ "Micro-opcode output is ambiguously matched to the actual destination of the branch:: When the destination string is executed-stored, the pre-fetched string::::two-pre-fetch see (for example, observable and Controlling) pre-fetching the string to prevent the previous generation string observation, 歹'J or temporary memory buffer element if the storage is written to a vague value. In some embodiments, J36758.doc 3 (eg, to the cache), the destination of the storage •38-200935303 receives the vague value (eg, the affected byte in one or more cache columns of the cache memory is marked as vague) Subsequent loading of the destination receives the vague value' Broadcasting the vague value. The propagation of vague values in various usage scenarios enables avoidance of pre-fetching of unwanted data (eg, when loading indicators) and/or avoiding those who would otherwise update the branch predictor incorrectly or inefficiently ( For example, when loading a branch condition).

在某些具體實施例中,微處理器具有用以針對遇到快取 遺漏之載入組態條件與臨限值以傳回含糊結果(代替停止 預提取串)的邏輯。例如,串體組態微處理器用以僅針對 導致主記憶體存取之快取遺漏產生含糊值,及用以針對其 他快取遺漏停止。 在各種使用方案(例如整數及/或浮點程式碼)中,預提取 串使得資料在上代串使用之前可用(減少或消除快取遣漏) 及/或準備好分支預測器(減少或消除誤預測)。各種具體實 施例使用預提取串而非硬體預提取(或除)硬體預提取(之 外)使用預提取串。 在其中自上代串分又預提取_之某些環境中,預提取串 針對數百個循環執行同時上代串正在等待—快取遺漏(例 日自實施(例如)為DRAM之主記憶體對此遺漏作出補償 時)。在某些使用方案及/或具體實施例中,一系統致使一 能夠針對—上代串正在等待之時間之相對較長部 向進展。例如,串體建構一或多個追蹤用於使用 在預提取串中,且兮笙 微運复 且該等追蹤視需要地排除具有某些屬性之 例如,串體視需要地排除對記憶體位址產生無 136758.doc -39- 200935303 =成用之微運异碼。例如,串體視需要地排除僅用以驗 :相ί較谷易_之分支的微運算碼。例如,相對於在一 2疋預提取串内之―追縱,串體視需要地排除將在預提取 串内未讀取(或相對不可能加以讀取)之值儲存至記憶體的 :運算碼對:又一範例,串體視需要地排除載入在微運 、碼之執行之前已經存在(或相對可能存在)於快取記憶體 中之資料的微運算碼。例如,串體視需要地排除具有使得 微運算碼與預提取無關之屬性的微運算碼。 在某些具體實關及/或使时案巾,微處理 =遠遠提前於(等待)上代_(假定有可用時間)執行預提取 串。例如,串體嘗試最小化(藉由消除或減少)在— 追縱中之微運算碼,僅留下在至特定載人之執行的一或多 =關鍵路徑上之微運m料特定載人係(例 繁地導致快取遺漏之載人、導致具有相對較長欲填充延^ 之快取遺漏之載人或其奸組合。在某些具體實施例中, 串體與硬體(例如快取遺漏效能計數器)結合(例如)藉 集關於過期載人之資訊來收集且保存用以決定特入 逐行分析資料結構。當最佳化預提取追縱時,串體= 地運作以減少產生特定載人之目標位址之資料 前跳越串 ^ 前跳越多執行緒化模型 串體層之逐行分析子系統(例如圖13之追 捕獲叫在藉由微處理器執行時將較追 = 跳越推測多執行緒化之候選者。在某些具體實 136758.doc •40- 200935303In some embodiments, the microprocessor has logic to return configuration results and thresholds for cache misses to return ambiguous results (instead of stopping the pre-fetch string). For example, a serial configuration microprocessor is used to generate ambiguous values only for cache misses that result in primary memory access, and to stop for other cache misses. In various usage scenarios (such as integer and/or floating-point code), the pre-fetch string allows the data to be available before the previous generation (reducing or eliminating cache misses) and/or preparing the branch predictor (reducing or eliminating errors) prediction). Various embodiments use pre-extraction strings instead of hardware pre-fetching (or in addition to) hardware pre-extraction (except). In some environments where the previous generation is pre-fetched and pre-fetched, the pre-fetched string is executed for hundreds of cycles while the previous-generation string is waiting - the cache misses (for example, the main memory of the DRAM is implemented, for example) When the omission is made, the compensation is). In some usage scenarios and/or embodiments, a system enables one to progress toward a relatively long period of time that the previous generation string is waiting. For example, the string constructs one or more traces for use in the pre-fetched string, and the micro-recovery and the traces optionally exclude certain attributes, for example, the string optionally excludes the memory address Produced no 136758.doc -39- 200935303 = used in the micro-transport. For example, the string body optionally excludes the micro-ops that are only used to check the branch of the __. For example, the string is optionally excluded from storing the value of the unread (or relatively impossible to read) in the prefetched string to the memory relative to the "snap" in a 2" prefetched string: Code Pair: In another example, the string optionally excludes the micro-ops that are already stored (or relatively likely to exist) in the cache memory before the execution of the code. For example, the string optionally excludes micro-ops having attributes that make the micro-ops independent of pre-fetching. In some specific implementations and/or timepieces, the micro-processing = far ahead of (waiting) the previous generation _ (assuming there is time available) to perform the pre-fetch string. For example, the string tries to minimize (by eliminating or reducing) the micro-ops in the tracking, leaving only the micro-carriers specific to the one or more = critical paths to the specific manned execution. (usually resulting in a cache of missing carry-overs, resulting in a relatively long-lasting missed loader or a traitor combination. In some embodiments, the string and hardware (eg, fast) The missing performance counter is combined with, for example, borrowing information about the expired manned to collect and store the data structure used to determine the ad hoc line-by-line analysis. When optimizing the pre-fetching tracking, the string body = operation to reduce the generation The data of the target address of the specific manned person jumps over the string ^ The more jumps the thread is analyzed by the thread-by-line analysis of the threaded layer of the model layer (for example, the catching capture of Figure 13 will be chased when executed by the microprocessor = Jump over the speculation of multi-threading candidates. In some concrete 136758.doc •40- 200935303

❿ 使用方案中,系統針對具有相對高度可預測終端分支(例 如無條件分支、迴圈指令分支或系統已相對成功預測之分 支)之追蹤使用前跳越串。系統視需要地基於一或多個特 性選擇候選者。一範例性特性係(例如)由於相對較多N〇p 引起的相對較低靜態指令級平行性(ILP)。另一範例性特性 係相對較低動態ILP(例如具有相對頻繁地停止之載入,導 致相對難以靜態觀察之動態排程間隙另一範例性特性 係大於單一核能夠提供者的平行發出可能性。 在具有包含全部迴圈反覆過程之追蹤及/或其中迴圈反 覆過程时在相對較少相依性之某些使时案中前跳越推 測多執行緒化係有效的。在具有不為用於直插擴展至單一 追蹤中之候選者之呼叫與返回的某些❹方案中,前跳越 推測多執行緒化係有效的。在某些使用方案及/或具體實 施例中’前跳越推測多執行緒化產生類似於以獅為基礎 的:序核之效能位準(但具有相對較少硬體複雜性)。在某 些前跳越推測多執行緒化環境中,一後繼串跳過在追縱之 =始之前的數百個指令。在某些情況下,藉由前跳越推測 多執^緒化所實現(例如藉由相對較高或最大重疊所達成) 之效lb改善取決於後繼者之聞 料獨H 〗始位址之相對準確預測及資 圖2解說執行前跳越串(例 义甲(例如藉由串體加以合成)之硬體 之一範例,其針對以循環 <听間對核或互連加以繪製。 在該說明中,術語"前跳越串 係私目標碼(或其二進制已轉 譯版本)之執行(作為一串), ’其中削跳越串在上代串之終端 136758.doc 41 200935303❿ In the usage scenario, the system skips the pre-flight string for tracking with a relatively highly predictable terminal branch (for example, an unconditional branch, a loop instruction branch, or a branch that has been relatively successfully predicted by the system). The system selects candidates based on one or more characteristics as needed. An exemplary characteristic is, for example, relatively low static instruction level parallelism (ILP) due to relatively large N〇p. Another exemplary characteristic is a relatively low dynamic ILP (e.g., having a relatively frequent stop loading, resulting in a relatively difficult static observation of the dynamic scheduling gap. Another exemplary characteristic is greater than the parallel issuance possibility of a single core capable provider. It is effective to prejudge the multi-threading system in some cases that have relatively little dependency when tracking with all loop repetitive processes and/or loop repetitive processes. In some schemes where the in-line extension is extended to the call and return of the candidate in the single tracking, the pre-jump multi-threading system is effective. In some usage scenarios and/or specific embodiments, the pre-jumping speculation Multi-threading produces a lion-based: probabilistic performance level (but with relatively little hardware complexity). In some pre-jumping speculative multi-threading environments, a successor string skip Hundreds of instructions before the start of the chase = in some cases, by the pre-jumping speculation multi-implementation (for example, by a relatively high or maximum overlap) Successor The relatively accurate prediction of the starting address and the picture 2 illustrate an example of the hardware of the jump before execution (for example, by synthesizing the string), which is directed to looping < listening Drawing the core or interconnect. In this description, the term "pre-jumping string is the execution of the private object code (or its binary translated version) (as a string), 'where the clip is skipped in the upper string Terminal 136758.doc 41 200935303

追蹤之t纟後所執仃(在某些環境中)之下一指令(或二進制 轉譯等效者)處開始執行。對於各前跳越串,串體層之 矛王式馬產生器(例如圖⑺之排程與最佳化_及/或圖id 串建構140之一或多個元件)將_ fork.skip微運算碼插入至 上代串之終端追蹤上。—串之"終端追蹤"係指該串到達其 聯合點之前藉由該串所執行的最後追蹤。系統執行 kip微運舁碼時,系統分又新(例如後繼或下代)串作 為前跳越串。前跳越串在到達包含fork.skip微運算碼之追 縱之結束之後依程式順序所執行之τ —指令(或其二進制 已轉譯版本)處開始執行°在某些具體實施例中,對於以 條件或間接分支結束之終端追蹤,前跳越串在分支之動態 決定之目標處開肖。纟某些使用彳案及,或具體實施例 中,系統經由追蹤預測器及/或分支預測器動態選擇分又 目標。在其中終端追蹤以無條件分支結束及/或串體在基 本區塊之中間結束追蹤之方案中,當產生終端追蹤時決定 前跳越串之開始點。 在圖2中,在上代串200中之f〇rk.skip微運算碼2ιι已建 立後繼串201,其在右行中解說作為串m 22執行於核2 上。後繼串在由於核間通信延時所引起之某一延遲(解說 為三個循環)後開始。後繼串接著開始執行對應於分叉目 標位址之追縱。 fork.skip微運算碼編碼一指定上代之終端追蹤欲寫入之 架構暫存器之位圖(該追蹤未修改其他架構暫存器)的傳播 集(解說為propagated_archreg_set欄位虛線框元件212)。除 136758.doc -42- 200935303 非後繼串先前已寫入暫存器 集之成員的架構暫存器之第 將其後讀取該暫存器之其自 未傳播之版本)。 ’否則後繼串之執行在為傳播 -次讀取上停止,因此後繼者 己的私有版本(代替上代之尚 相對於上代串之終端追蹤,微運算碼格式包括一用以指 示微運算碼之結果將傳播至後繼串之機制。在某些具體實 施例中,VLW束包括_或多個”傳播”位元,其各與該束Execution begins after an instruction (or binary translation equivalent) that is executed (in some circumstances) after tracing. For each pre-jumping string, the spear-level puppet horse generator (such as the scheduling and optimization of Figure (7) and/or one or more components of the figure id string construction 140) will be _fork.skip micro-operation The code is inserted into the terminal tracking of the previous generation string. —String "Terminal Tracking" refers to the last trace performed by the string before it reaches its joint point. When the system executes the kip micro-transport weight, the system divides the new (for example, subsequent or next generation) string as the pre-jump string. The pre-jump string begins execution at the τ-instruction (or its binary translated version) executed by the program sequence after the end of the trace containing the fork.skip micro-opcode. In some embodiments, Terminal tracking at the end of a conditional or indirect branch, the front jump string is opened at the target of the dynamic decision of the branch. In some use cases and, or in particular embodiments, the system dynamically selects the target by means of a tracking predictor and/or a branch predictor. In the scheme in which the terminal tracking ends with an unconditional branch and/or the string ends the tracking in the middle of the basic block, the start point of the previous skip is determined when the terminal tracking is generated. In Fig. 2, the f〇rk.skip micro-opcode 2 ι in the upper-generation string 200 has established a successor string 201, which is illustrated in the right-hand row as a string m 22 on the core 2. The successor string begins after a delay (illustrated as three cycles) due to the inter-core communication delay. The successor string then begins to perform the tracking corresponding to the bifurcation target address. The fork.skip micro-opcode encodes a propagation set that specifies the bitmap of the architecture register to be written (the trace does not modify other architectural registers) (illustrated as the propagated_archreg_set field dashed box component 212). Except 136758.doc -42- 200935303 The non-subsequent string of the schema register that was previously written to the member of the scratchpad set will then read its unspread version of the scratchpad). 'Otherwise the execution of the successor string is stopped for the propagation-sub-read, so the private version of the successor (instead of the previous generation's terminal tracking with respect to the previous generation string, the micro-opcode format includes a result indicating the micro-opcode) The mechanism that will propagate to the successor string. In some embodiments, the VLW beam includes _ or more "propagation" bits, each of which is associated with the beam

之-或多個微運算碼關聯。體針對前跳越排程且最佳 化-終端追蹤時,在且只有在微運算瑪係欲寫人至特定架 構暫存器A之最後微運算碼(相對於追蹤之微運算碼之原始 程式順序)的條件下串體才設定各微運算碼之傳播位元, 從而產生駐外值。在某些具时施㈣,原始程式順序係 不同於已排程VLIW追縱之執行順序,且在其他具體實施 例中,該等順序係相同的。- or multiple micro-ops associated. For the forward skip scheduling and optimization - terminal tracking, and only in the micro-matrix to write the last micro-computing code of the specific architecture register A (relative to the original program of the tracking micro-computing code) Under the condition of the sequence), the string is set to propagate the bit of each micro-opcode, thereby generating an external value. In some cases (4), the original program order is different from the order in which the scheduled VLIW tracks, and in other embodiments, the order is the same.

當-以架構暫存器A為目標之微運算碼執行且設定該微 運算碼之傳播位元時’將微運算碼輸出值V發送至(目前串 之)後繼串S。㈣上,接著將值心至串s之暫存器2 中以便串S中之微運算碼採用新(區域產生)值覆寫架構暫 存器A之前Stlt取架構暫存器A之嘗試接收值V。若嘗試 讀取駐内架構暫存器辑已停止後繼串S,則,因為值乂已 到達所以接著對串S解除阻隔以繼續執行。在某些環境 中,在後繼串讀取暫存11之前上代串_特定駐外架構暫 存器。料特定暫存器係在f景中寫人至後繼串之暫存器 播細如暫存器槽案194幻至194A.4之任何者)且不為停 136758.doc •43· 200935303 止之來源。不藉由終端追跑宜 私七 追蹤寫入不為傳播集之成員的架構 暫存器,且後繼串因而繼蚤少 、、承在終端追蹤之開始處的暫存器 之值。在背景中將該等值值躲 ^, 傳播至與後繼串關聯之暫存器檔 案中。若後繼串存取暫存器二 仔器之削未傳播已繼承架構暫存 器,則後繼串停止。 圖2解說該傳播之一範例。分又微運算碼川建立後繼串 之後後繼串之第-追蹤之最初三個束則、卻及加 執行(分別在循環3、4及5 , m * 因為該等束並不取決於任何 駐内暫存器(例如來自上代串終端追縱之駐外暫存器)。不 過,當束283在循環6期間嘗試執行時,該束停止,因為該 束係取決於上代串之終端追縱尚未產生的駐内架構暫存器 %加與%邮。在循環9中,上代串終端追瞰之束269分別 經由微運算碼215與216計算%如與%咖之駐外值且將該 等值傳播至後繼串。該等值在數個循環(例如對應於核間 通信延時)之後到達執行後繼串2〇1之核’且在循環12中, (後繼串)追蹤201唤醒且執行束284與285。當後繼串之下一 束嘗試讀取%rdi時,-值係不可用的。上代串在循環13中 經由微運算碼217產生之駐外值且在循環16中將該值 傳播至後繼串201用於在循環16中到達。接著在循環“中 束286喚醒且執行。該圖式解說藉由後繼串讀取暫存器之 别某些駐外架構暫存器(例如分別藉由微運算碼213與2 Μ 所傳播之%rsp與%xmmh0)之背景傳播。 在某些環境中,已將針對後繼串欲繼承之一架構暫存器 的一值發送至後繼串之前上代串嘗試覆寫該暫存器。在某 I36758.doc • 44 - 200935303 些具體實施例中,一暫存器之舊值在至後繼者之途中之 前,連鎖硬體防止上代串覆寫該舊值。在某些環境中,上 代串已將對應駐外值傳播至後繼串之前後繼串覆寫駐内架 構暫存器而不讀取該暫存器。在某些具體實施例中,後繼 . 串通知上代串後繼串不再等待所傳播之暫存器值,因為後 • 繼串具有更新式(區域產生)值。 在各種具體實施例中使用各種機制將暫存器值從上代串 傳播至後繼串。某些具體實施例針對駐外傳播之暫存器對 ❹、繼承之暫存器使用不同傳播機制及/或優先權。在某些具 體實施例中,不複製暫存器值。而是,後繼串使用二= 時複製暫存器快取機制從隨選上代串取回已繼承且駐外 值。該機制使用寫入時複製功能以防止繼承值在傳達至後 繼串之前由上代串覆寫,及後繼串不再取決於一值時抑制 傳播。在某些具體實施例中,使用一暫存器重命名機制來 避免複製實際值。分成操作將上代串之重命名表複製至後 φ 繼串(而非複製值),且兩個串共用一或多個實體暫存器直 到串覆寫該等實體暫存器之一或多個。 * 推測串執行緒化(SST) . SST概覽 串體將目標軟體分割成複數個獨立可執行串,以實現增 —之平行性、效能或兩者。串體與硬體共同運作以動態逐 2分析目標軟體則貞測目標軟體之控制與資料流之區域間 /、有相對較少或不具有彼此相依性之相對較大區域。串體 藉由在開始處插入一分又點且在結束處插入一聯合點/分 136758.doc -45- 200935303 又目標將各區域轉變為一串。串係相對於彼此加以程式排 序’且獨立執行。 在各種具體實施例中’硬體與串體繼續基於來自觀察及 逐行分析動態控制流與資料相依性之即時回授監視及精細 ,分又與聯合點之選擇’在某些使用方案中實現已改善效 能、已改善適應性及已改善健壯性。 串範疇識別 在某些推測多執行緒化具體實施例中,一分又點產生兩 個平行串:在目標軟體中之分又目標位址處開始執行之新 後繼串及在分又點之後繼續執行(在目標軟體中)之現有上 代串。追蹤預測器及/或分支預測器動態選擇分又目標。 在一分叉之後,上代串之範疇(例如壽命)包括在分又操 作之後在上代串之執行路徑到達後繼串之初始開始位址或 到達某些其他限制之前所執行的所有程式碼。串體串逐行 分析子系統導出各串之範疇。 ❿ 若串體識別一針對平行化之迴圈’則分又點(在該分又 點處執行一分叉操作)與分又目標(在該分又目標處後繼串 開始執行)兩者係指迴圈之頂部且終止迴圈之分支限制上 代串之範疇。在迴圈之結束處之一條件分支(其針對下一 反覆過程跳至迴圈之頂部)之方案中,不採用分支之中止 方向。 串體使用啟發法來基於各種編譯器(例如Gcc、、When the micro-opcode targeting the architecture register A is executed and the propagation bit of the micro-code is set, the micro-code output value V is sent to the (current string) subsequent string S. (4) Up, and then the value is sent to the register 2 of the string s so that the micro-opcode in the string S is overwritten with the new (region-generated) value before the architecture register A is Stlt takes the attempted reception value of the architecture register A. V. If the attempt to read the resident architecture register has stopped the successor string S, then since the value 乂 has arrived, the string S is then unblocked to continue execution. In some environments, the string_specific resident architecture register is passed before the successor string reads the scratchpad 11. The specific register is written in the scene of the scene to the successor string of the subsequent stream, such as the register slot 194 phantom to 194A.4) and does not stop 136758.doc •43· 200935303 source. The value of the register is not traced by the terminal, and the register is not the member of the propagation set, and the successor string is followed by the value of the register at the beginning of the terminal tracking. The value is hidden in the background and propagated to the scratchpad file associated with the successor string. If the subsequent serial access register is not propagated, the successor architecture is stopped, and the successor string is stopped. Figure 2 illustrates an example of this propagation. After the micro-calculation code, the successor string is followed by the first-tracking of the first three-string, but the addition is performed (in cycles 3, 4, and 5, respectively, m * because the bundle does not depend on any resident a scratchpad (e.g., an external register from the previous generation string terminal). However, when the bundle 283 attempts to execute during loop 6, the bundle stops because the bundle depends on the last generation string and the terminal has not yet been generated. The resident architecture register % is added with the % mail. In loop 9, the bundle 269 of the previous generation string is calculated by the micro-computing codes 215 and 216, respectively, and the external value of the % coffee is propagated and the value is propagated. To the successor string. The value arrives at the core of the execution successor string 2〇1 after several cycles (eg, corresponding to the inter-core communication delay) and in loop 12, the (subsequent string) trace 201 wakes up and executes beams 284 and 285. The value is not available when a bunch of subsequent strings attempts to read %rdi. The previous generation string is generated in the loop 13 via the micro-opcode 217 and propagated in loop 16 to the successor string. 201 is used to arrive in loop 16. Then in the loop "middle bundle 286 wakes up and executes This diagram illustrates the background propagation of some of the external architecture registers (e.g., %rsp and %xmmh0 propagated by the micro-ops 213 and 2 分别, respectively) by the subsequent string read register. In some environments, a value has been attempted to be overwritten by a previous generation string before a successor string is sent to a subsequent string. In some I36758.doc • 44 - 200935303 In the middle, the old value of a register is before the way to the successor, the chain hardware prevents the previous generation from overwriting the old value. In some environments, the previous generation string has propagated the corresponding resident value to the successor string before the successor string. Overwriting the resident architecture register without reading the scratchpad. In some embodiments, the successor string notification of the previous generation string does not wait for the propagated scratchpad value because the subsequent string has Update (region generation) values. Various mechanisms are used in various embodiments to propagate the scratchpad values from the previous generation string to the successor string. Some embodiments are directed to the temporary propagation of the temporary propagation, the inheritance of the temporary storage. Use different propagation mechanisms and/or priorities In some embodiments, the scratchpad value is not replicated. Instead, the successor string uses the two=time copy scratchpad cache mechanism to retrieve the inherited and resident values from the on-demand string. The mechanism uses writes. The copy function prevents the inherited value from being overwritten by the previous generation before being passed to the successor string, and the subsequent string is no longer dependent on a value to suppress propagation. In some embodiments, a scratchpad renaming mechanism is used to avoid duplication. The actual value. The split operation copies the rename table of the previous generation string to the subsequent φ successor string (instead of the duplicate value), and the two strings share one or more physical registers until one of the physical registers is overwritten. Or multiple. * Predictive string manipulation (SST). The SST overview string splits the target software into a plurality of independent executable strings to achieve parallelism, performance, or both. The combination of the string and the hardware to dynamically analyze the target software is a measure of the relative security of the target software and the relatively large area of the data stream. The string is transformed into a string by inserting a point and a point at the beginning and inserting a joint point/minute at the end. 136758.doc -45- 200935303 The strings are programmed relative to each other' and executed independently. In various embodiments, 'hardware and string continue to be based on real-time feedback monitoring and fine-grained from the observation and line-by-line analysis of dynamic control flow and data dependencies, and the selection of joint points is implemented in some usage scenarios. Improved performance, improved adaptability and improved robustness. String Category Identification In some speculative multi-threading implementations, two parallel strings are generated in one point and one point: a new successor string that begins execution at the target software and at the target address and continues after the points and points. Execute the existing upper generation string (in the target software). The tracking predictor and/or the branch predictor dynamically select points and targets. After a fork, the category of the previous generation string (e.g., lifetime) includes all code that was executed after the execution of the previous generation string reached the initial start address of the successor string or before reaching some other limit. The string string progressive analysis subsystem derives the scope of each string. ❿ If the string recognition is for the parallelized loop, then the point is again (executing a fork operation at the point and point) and the target is again (the successor string is executed at the point and the target) The branch at the top of the loop and ending the loop limits the scope of the previous generation string. In the scenario where one of the conditional branches at the end of the loop (which jumps to the top of the loop for the next iteration), the branch abort direction is not used. The string uses heuristics based on various compilers (eg Gcc,

Microsoft Visual Studio、Sun Studio、PathScale編譯器套 件及PGI)之輸出識別終止分支與方向。編譯器針對給定才匕 136758.doc -46- 200935303 令集(例如x86)產生粗略等效控制流習語。例如,藉由找到 跳越至緊接在針對下一反覆過程反向跳至迴圈之頂部之基 本區塊之後之基本區塊的任何已採用分支來識別迴圈之邊 界。其他終止分支包括返回指令及至在迴圈主體中之最後 • 基本區塊之後之位址的無條件分支。 . 彳量其中分又原點係、緊接纟函式+叫之前(例如在_ CALL指令之前)且目標位址係緊接在呼叫指令之後(即在返 回位址處)的呼十返回分又。上代串之料係僅藉由函式 彳叫之主體來決^ ’且係藉由上代串與返回位址之交又點 來終止冑態地,除非程式執行錯誤程式碼或異常處理常 式,否則函式呼叫相對頻繁地返回至呼叫位置。 (例如)當在開始相對較大程式碼區塊之 又目標係在該區塊之社束之後_力产甘刀 ,,σ末之後時,存在其他相對更一般化 類型之分又。在該區塊(例如上代串之料)内之内部分支 視需要地退出區塊且分支至後繼範脅中。串體將内部分支 • 朗且工具裝備為終止分支。在各種具體實施例中,作為 更般化控制流分析技術之部分處理各種已建構程式化 情況(例如針對迴圈、呼叫及返回)。 某…、體實化例中’藉由在包含分又原點之基本區塊 ]"且以遞回方式跟隨至每一分支之已採用及未採用出 口兩者執行一遍佈控制流圖表上之基本區塊的深度優先橫 越找到終止分支。在使用方案中,定位終止分支係因各式 η:凊况(例如未映射至位址空間中之分支、無效或不 刀支目軚及引起難以決定控制流變化之其他情況)而 136758.doc -47· 200935303 複雜化=不過,即使串體未偵測所有 持目標軟體之正確性。甚支’串體亦保 (例如原妗碼)β田 向級程式結構資訊 如原始碼)之任何知識時,接納未偵 實現串體操作。 、、”端分支亦 串體藉由將條件性殺除微運算碼注入 之追蹤中識別且工具裝備嗲等追 終止分支 指一評估為真_=微:::3 ==的所有後繼串。若執:== 測性的則一替代類型之條件性殺除微運算碼之執 殺除微運算碼之串及其所有後繼串(參見定位 、本文別處之小節”橋追蹤與駐内暫存器預測")。 若一終止基本區塊以分支微運算碼(例如" 啊其中比較暫存器_2且只有在比較條件Μ為真 時才知用为支)結束’貝"體注入一匹配殺除微運算碼, 例如”kUUc R1,R2,T"。殺除微運算碼指定與分支匹配 cc、R1 及 R2。 巢套式串 為了維持目標軟體之完全確定性執行,在某些具體實施 例中串體使用已嚴格程式排序非巢套式推測多執行緒化模 尘其中上代串P具有至多一後繼串81(81可選遞回至後繼 串S2’等等)。某些具體實施例致使一串能夠具有複數個 後繼預提取串(視需要地除單一非預提取後繼串之外卜因 為預提取串不對架構狀態進行修改。 在某些程式中,聯合S1之前P遇到另一分又點。為了保 I36758.doc •48· 200935303 寺決又性行為,當後繼串存在時㈣㈣在^ :以點。為了確保ρ最終確實聯合S1,寧 = 與硬體實施功能(例如超時)來偵測及中止失控^重套 新分析針對終端分支之目標軟體以減少或防止將來出、現。 各殺除微運算碼係採用串㈣識別項加以標記,因此若 抑制針對-串之分又點’則亦抑制針對串料之任何殺除 微運算碼。The output of Microsoft Visual Studio, Sun Studio, PathScale Compiler Suite, and PGI) recognizes the termination branch and direction. The compiler generates a rough equivalent control flow idiom for a given set of 136758.doc -46- 200935303 command sets (eg x86). For example, the boundary of the loop is identified by finding any branch that jumps to the base block immediately after the basic block that jumps back to the top of the loop for the next iteration. Other termination branches include return instructions and unconditional branches to the address after the last • basic block in the loop body.彳 其中 其中 其中 、 、 、 、 原 原 原 原 原 原 原 原 原 原 原 原 + + + + + + + + 且 且 且 且 且 且 且 且 且 且 且 且 且 且 且 且 且 且 且 且 且 且 且also. The material of the previous generation is only terminated by the main body of the function, and is terminated by the intersection of the previous generation string and the return address, unless the program executes the error code or the exception handling routine. Otherwise the function call returns to the call location relatively frequently. For example, when the target is at the beginning of a relatively large code block and the target is after the community of the block, there are other relatively more general types. The internal branch within the block (e. g., the material of the previous generation) optionally exits the block and branches into the successor. The string will be internal branches • The tool will be equipped to terminate the branch. In various embodiments, various structured stylizations (e.g., for loops, calls, and returns) are handled as part of a more generalized control flow analysis technique. In a ..., physical example, 'by using the basic block containing the origin and the origin" " and recursively following each branch with and without the exit to perform a control flow graph The depth of the basic block is first traversed to find the terminating branch. In the usage scheme, the location termination branch is caused by various types of η: (for example, not mapped to a branch in the address space, invalid or no branch, and other conditions that make it difficult to determine the control flow change) 136758.doc -47· 200935303 Complexity = However, even if the string does not detect the correctness of all target software. In the case of any of the knowledge of the original structure, such as the original code, the unrecognized implementation of the string operation is accepted. The "end branch" is also identified by the tracking of the conditional kill microinjection code injection and the tooling and the like termination branch refers to all subsequent strings evaluated as true_=micro:::3 ==. If the implementation of === estimative, then an alternative type of conditional killing of the micro-computing code in addition to the micro-computing code string and all its successor strings (see positioning, elsewhere in this article) bridge tracking and resident temporary storage Prediction "). If the termination of the basic block is to branch microcode (for example, " ah which compares the register _2 and only knows when the comparison condition is true), the end of the 'bean' body injection a match kill micro Opcodes, such as "kUUc R1, R2, T". The kill code specifies that the branch matches cc, R1, and R2. The nested string is maintained in a specific deterministic manner to maintain the target software, in some embodiments The body uses a strictly programmed non-nested speculative multi-threaded demographic dust in which the upper-generation string P has at most one successor string 81 (81 optionally recursively to the successor string S2', etc.). Some embodiments result in a string It is possible to have a plurality of subsequent pre-fetch strings (except as needed for a single non-pre-fetch subsequent string) because the pre-fetched strings do not modify the architectural state. In some programs, P encounters another point before the joint S1. In order to protect I36758.doc •48· 200935303 Temple and sexual behavior, when the successor string exists (4) (4) in ^: to point. In order to ensure that ρ eventually combined with S1, Ning = with hardware implementation functions (such as timeout) to detect and Suspension of control Analyze the target software for the terminal branch to reduce or prevent future generations and currents. Each killing micro-code is marked with a string (four) identification item, so if the point-to-string is suppressed, then any of the data is suppressed. Kill the micro-opcode.

為實行遞回函式’各串保存—在抑制分又時遞增的私有 分又巢套計數器(在建立串時初始化為零)。當硬體處理— 殺除微運算碼時1串之巢套計數器為零則殺除微運算碼 僅中止串,否則使巢套計數器遞減且不中止串。 候選串選擇 在某些使用方案中,某些迴圈為用於推測多執行緒化之 良好候選者(每串具有一或複數個反覆過程)。在某些具體 實施例中,硬體包含逐行分析邏輯單元且_體合成工具裝 備碼(其與逐行分析邏輯單元互動)以用於決定哪些迴圈係 適於分裂為平行串。 目標軟體中之各反向(迴圈)分支具有串體針對識別與逐 灯分析使用的一唯一目標實體位址p。硬體藉由追蹤總循 環與反覆過程且針對總循環與反覆使用串體可調諧臨限值 過渡出決定為太小以致於不能多產地最佳化之迴圈(例如 硬體過濾出每反覆過程具有少於256個循環之迴圈硬體 向相對較大迴圈分配一藉由p編索引之迴圈設定檔計數器 (LPC)。該LPC保存總循環、反覆過程、可信度估計器及 I36758.doc •49- 200935303 與決定迴圈是否為針對最佳化之良好候選者相關的其他資 訊°串體週期性審查LPC以識別串候選者。串體管理 LPC。在各種具體實施例中,LPC之一或多個係快取於硬 體中及/或儲存於記憶體中。 在某些具體實施例中針對其他類型之候選串(例如已呼 Η函式)使用類似技術。對於呼叫,視需要地使用一組呼 叫逐订分析計數器(CPC)來記錄各種統計内容,例如已呼In order to implement the recursive function, each string is saved—the private branch that is incremented in the suppression time and the nested counter (initialized to zero when the string is created). When hardware processing - when the micro-opcode is removed, the nested counter of one string is zero, then the micro-operating code is killed, only the string is aborted, otherwise the nest counter is decremented and the string is not aborted. Candidate String Selection In some usage scenarios, some loops are good candidates for speculative multi-threading (each string has one or more iterations). In some embodiments, the hardware includes a progressive analysis logic unit and a body synthesis tooling code (which interacts with the progressive analysis logic unit) for determining which loops are suitable for splitting into parallel strings. Each of the reverse (loop) branches in the target software has a unique target entity address p used by the string for identification and lamp-by-light analysis. The hardware loops through the tracking of the total cycle and the repetitive process and uses the tunable threshold for the total cycle and the repeated use to determine the loop that is too small to be optimized for prolific production (eg, hardware filtering out each repetitive process) A loop hardware with less than 256 cycles allocates a loop index counter (LPC) indexed by p to the relatively large loop. The LPC saves the total loop, the repeat process, the confidence estimator, and the I36758 .doc •49- 200935303 Other information related to determining whether the loop is a good candidate for optimization. The string periodically reviews the LPC to identify the string candidates. The string manages the LPC. In various embodiments, the LPC One or more of the caches are cached in hardware and/or stored in memory. In some embodiments, similar techniques are used for other types of candidate strings (eg, called functions). Need to use a set of call-by-book analysis counters (CPC) to record various statistical content, such as called

叫函式中所耗用的循環數、修改哪些暫存器、最可能傳回 值及在決定串是否為針對最佳化之良好候選者時可能有用 的其他資訊。 串巢套圖表建構 在某些具體實施例中’串體作為為串體熟知之串或候選 串動態建構-或多個表示目標碼之區域間之關係之資料妹 構。串體使㈣等結構來追蹤串在彼此㈣之巢套。例 對於複數個巢套式迴圈(例如内部迴圈與外部迴圈), 3 = 1體之串視需要地包含一巢套式函式呼叫(該函 中,串二串)或一或多個迴圈。在某些具體實施例 在某些具體實施例中,串體向已料 存在轉譯快取記憶财運异碼(例如保 锞與、番曾 瑕備碼’以在執行已轉 實施例中碼之運行_更㈣巢㈣料結構。在某歧具體 實施例中,硬體包括用以協 …、體 邏輯。 賴㈣發㈣巢套關係之 基於串巢套資料結構中所表 不之串巢套階層, 串體使用 136758.doc -50- 200935303 啟發法來選擇程式碼之相對更有效區域以轉變為串,且串 體工具裝備各選定串以用於以下所說明之進一步逐行分 析。在某些具體實施例中,啟發法包括一或多個用以自巢 套式内部與外部迴圈選擇一適當串之技術。 • 針對逐行分析之工具裝備 • 基於分又原點、分又目標及終止分支與個別方向集,串 體將工具裝備注入至目標軟體之以微運算碼為基礎的轉譯 Y例如儲存於轉譯快取記憶體中)中以形成完整且正確界定 ® 料之串。在某些具體實施例中,串體將逐行分析分又注 入至包含在分又原點處之基本區塊之追蹤中。逐行分析分 =不硬體建立-逐行分析串’例如定位於本文別處之小 節”上代串逐行分析"與"後繼串逐行分析"中所說明。串體 識別且工具裝備包含各終止分支之追縱如定位於本文 別處之小節"串範疇識別"中所說明。 上代串逐行分析 ❷ 十對逐行刀析之工具裝備之後,執行包含分又點之追縱 的下一次,丰體作為上代串之後繼串建立一逐行分析串。 . 逐行分析串阻隔直到上代串與逐行分析串之開始位址交 ’ X °接著逐行分析串開始執行’而上代串阻隔。當逐行分 析串完成(例如經由交又點、、終止分支或另一分又)時上 代串解除阻隔且聯合逐行分析串。如以下所說明硬體調用 串體以完成串建構。 實行一逐行分析分又之後’硬體進入特殊逐行分析模式 以執行上代串之其餘者。 136758.doc •51 · 200935303 對於上代串中之某些事件之各出現,串體配置一串執行 逐行分析記錄(SEPR),其欲寫入至藉由串體所分配以保$ 由上代串所產生之SEPR的一記憶體緩衝器中。在某此較 佳具體實施例中,在實行某些類型之記憶體位址(載二^ 儲存物)時寫入SEPR。在某些具體實施例中,(例如)藉由 記錄基本區塊、追蹤、控制流變化或類似資料之執行寫入 額外SEPR以致使串體能夠稍後重新建構由串所執= 確程式碼序列。 ’ 後繼串逐行分析 上代串在完成時阻隔,同時後繼(逐行分析)_執行且識 別暫存器與記憶體相依性。相對於暫存器相依性,隨著^ 繼串執行,在硬體透過後繼串中之暫存器寫入之前,备^ 體最初讀取架構暫存器時硬體更新每串位元遮罩。位2遮 罩表示來自上代串用作針對後繼串之駐内者的駐外者^ 相對於記憶體相依性’在某些具體實施例中異動記憶體 版本管理系統實現資料快取記憶體内之推告 b± 田甲戰入資 '、,硬體採用快取列(或位元組級)粒度在記憶體位置上 進行儲備。硬體藉由更新推測串已載入哪些位元組(或多 $位元組之塊)之位圖追蹤儲備物。硬體視需要地追蹤元 貝料,例如哪些特定將來串已載人—記憶體 石f科於《 7 Μ衣0 更體採用快取列及/或在各別不同結構中儲存位圖。 位^栽入之資料來自已早於載入串(依程式順序)寫入該 串(例如所有串之最晚者。在某些環境中’最早之串係架構 甲(例如,在列係乾淨時卜在某些環境中,最早之串係早 136758.doc -52- 200935303 於,入串之推測串(例如在列係薪時)。 列: = 取列時,硬體檢查任何將來串是否在快取 、肴,:硬體! 這樣的話,則硬體已谓測到跨越串現 體跨=將來串及任何較晚卜或者,硬體通知串 此淆,以致使串體能夠實施一用於 軟體定義政策。 丁止串之靈活 始L於硬Γ字逐行分析串串列化以在上代串已完成之後開 …丁"跨越串混淆不發生;硬體依程式順序(相對 於串順序,夫必筏A 、順序(相對 入及儲存物 之微運算碼之順序)執行所有載 …J'因此健備硬體係自由的可用於其他用途。 硬體斑:::二式下時’在某些具體實施例中系統(例如 串記憶體轉遞。)使用記憶體儲備硬體來分析跨越 逐订》析_之料對於—㈣而言係有 達迴圈之頂部時逐行分析社击甘 w執仃到 吁、仃刀析結束。其他類型之分又(例如呼 刀又或一般化分又)具有可能無限制範疇,且因此 系統使用啟發法來限制逐行 逐行分析串已完成執行時,對±^= 測到 對上代串解除阻隔且串體聞払 :行一建構完全推測_所需要之工具裝備的聯合處= 經由SEPR處理之資料流圖表建構 使用系統先前所收集的已程式排序S猶資料,串體以 (上二串之作為根節點之駐外者開始構建一資料流圏表 136758.doc -53- 200935303 料又㈣所說明,執行上代串時’硬體作為硬體執行 哪些追縱及/或基本區塊之一記錄保存已程式排㈣心 列表,亦保存快取記憶體標籤及相關載入與儲存物之索引 -資料。使用該記錄’串體將各已執行追縱中之各基本區 塊解碼為已程式排序微運算碼之串流。為建構卿,使用 =器重命名表’將微運算碼運算元轉換為至 較早微運算碼之指標。 ❹ 為追縱記憶體相依性,串體保存一記憶體重命名表’立 將快取位置映射至用以寫入至一位址的最晚儲存操作1 此,載入與儲存物選擇性指定先前儲存物為原始運算元。 串體結合記憶體重命名表使用記錄請叹中之快取 以在DFG中包括記憶體相依性。 在程序完結時,已將上代串中所執行之所有微運算碼併 入至資料流圖表中,其中囍 衣t具中藉由目剛暫存器重命名表與記憶 體重命名表指向圖表之根節點(駐外者)。 ·、 橋追蹤與駐内暫存器預測 -推測後繼串之駐内集(例如上代串之最後駐外者)係自 :又上代串時所存在之架構暫存器值預測。串體自各駐外 (暫存器與記憶體兩者)深度優先搜尋動態卿以產生產 7運算碼之子集。依程式順序的所有子集之集合係駐外 產生集。 縣I=建立以上代串中之分又點處之架構暫存器與記憶 體值開始的橋追縱’且僅包括用以預測最後駐外者之駐外 產生集(如藉由後繼推測串之駐内位元遮罩所指示卜橋追 136758.doc -54- 200935303 縱亦將任何駐外暫存器預測複製至記憶體緩衝器。稍後系 統使用該等複本來偵測誤預測。 當一追縱分又至-推測㈣,串體設立新串以在橋追縱 而非推測串之第-微運算碼處開始執行。除處置暫存器相 依性之外,橋㈣將任何終止分支(及計算分支條件之相 關微運算碼)轉換為中止推測丰之微運算碼。最後,橋追 縱針對串設立各種㈣暫存^,例如至已㈣記憶體值列 ❹ 表、延期列表及無條件分支、至推測串之開始的指標。 橋追縱最佳化 —旦串體6建構橋㈣’串體便嘗試使用各種動態最佳 化技術減少或最小化長度。某些習語(例如溢出盥填充暫 存器或在-串t使料多呼叫與返回)有時導致無變化地 自堆叠重複載人及儲存—暫存器。同樣地,有時使堆疊指 標或其他暫存n重複遞增或遞減,而集合起來,相依性鏈 係4效於新增一常數。 串體辨識該等f語與圖案之至少某些且將相依性鍵最佳 化為相對少或較少的操作。例如,串體使用他⑽e七ad- use短路 ,其中 藉由— 先前儲 存物之 值推測 性取代 來自該 儲存物之—載人讀取資料(隨同暫存器與記憶體預測一起 在聯合點處驗證推測)。 若串體不能夠將橋追蹤減少至預定或可程式化長度則 串體放棄串之最佳化。放棄發生在各種環境中,例如當存 在真的跨越串暫存器相依性時,或當―駐外者係在上二串 中相對較晚地加以計算且在後繼串中相對較早地加以耗用 136758.doc -55- 200935303 (從而導致相對較長相依性鏈)時。 °己憶體值預測 對於某些串’橋追蹤預測記憶體值 行分加:_ , φ體使用在後繼逐 刀析串(例如定位於本文別處之 所說明)之執扞期門所㈣ 後繼串逐行分析"令 代甘 載入儲備資料來決定藉由上 由後料行分析㈣取(㈣稱為跨越 =遞)哪些記憶體位置。在某些具體實施例 接存取硬體資料快取纪 甲瓶直 鏟… 1 籤與70資料以構建橫跨串所 轉遞之快取位置之列表。 叮 串體在針對DFG之記憶體重命名表中查找The number of cycles consumed in the function, which registers are modified, the most likely to return values, and other information that may be useful in determining whether a string is a good candidate for optimization. Nested Set Chart Construction In some embodiments, the string body is dynamically constructed as a string or candidate string that is well known to the string body or a plurality of data structures representing the relationship between the regions of the object code. The string makes the structure of (4) and so on to track the nests of the strings in each other (4). For a plurality of nested loops (such as internal loops and external loops), the 3 = 1 body string optionally includes a nested function call (in the letter, string two strings) or one or more Circles. In some embodiments, in some embodiments, the string is expected to have a translation cache memory (for example, a security code, a code) to perform the code operation in the executed embodiment. _ more (four) nest (four) material structure. In a specific embodiment, the hardware includes the use of coordination, body logic. Lai (four) hair (four) nested relationship based on the nested nest data structure The string uses the 136758.doc -50- 200935303 heuristic to select the relatively more efficient region of the code to convert to a string, and the string tool equips each selected string for further line-by-line analysis as explained below. In a specific embodiment, the heuristic includes one or more techniques for selecting an appropriate string from the nested internal and external loops. • Tooling for line-by-line analysis • Based on the origin, the sub-target, and the termination The branch and the individual direction set, the string device injects the tool equipment into the target software, and the micro-code-based translation Y is stored in the translation cache memory, for example, to form a complete and correctly defined string. In some embodiments, the stringer injects the progressive analysis into the tracking of the basic blocks contained at the origin and origin. Line-by-line analysis = non-hardware establishment - line-by-line analysis string 'for example, located in the section elsewhere," on the line-by-line analysis of line-by-line analysis "and "subsequent string line-by-line analysis" The tracking of each terminating branch is described in the section "Chasition Category Identification" located elsewhere in this article. The previous generation string analysis is performed ❷ After the dozens of tools are equipped with line-by-line analysis, the execution includes points and points. The next time, the body expands a line-by-line analysis string as a successor string. The line-by-line analysis of the string is blocked until the previous generation string intersects with the start address of the line-by-line analysis string 'X° and then the line-by-line analysis string begins execution' The upper generation string is blocked. When the line-by-line analysis string is completed (for example, via the intersection, the end branch, or the other branch), the upper generation string is unblocked and the line is analyzed line by line. As described below, the hardware calls the string to complete the string. Construction. Perform a line-by-line analysis and then 'hardly enter the special progressive analysis mode to execute the rest of the previous generation string. 136758.doc •51 · 200935303 For each occurrence of certain events in the previous generation string, The string configuration performs a series of progressive analysis records (SEPR), which are to be written into a memory buffer allocated by the string to protect the SEPR generated by the previous generation string. In an example, the SEPR is written when certain types of memory addresses (loaded storage) are implemented. In some embodiments, for example, by recording basic blocks, tracking, controlling flow changes, or the like Execution writes additional SEPR to enable the string to be re-constructed later by the string = correct code sequence. 'Subsequent string progressive analysis of the previous generation string is blocked at completion, while subsequent (progressive analysis) _ execution and identification Memory and memory dependencies. Relative to the dependency of the scratchpad, with the succession of the string, the hard disk is hard to read the schema register before the scratchpad is written in the successor string. Each bit of the bit mask is updated. The bit 2 mask indicates that the resident string from the previous generation is used as a resident for the successor string ^ relative to the memory dependency 'in some embodiments, the transaction memory version management System implementation data cache memory告b±田甲战入入', the hardware uses the cache column (or byte level) granularity to reserve in the memory location. The hardware by loading the speculative string has loaded which bytes (or more The bitmap of the $bit block tracks the stock. The hardware tracks the material as needed, for example, which specific future strings have been loaded - the memory stone f is in the "7 Μ衣0 more body using the cache column And/or storing bitmaps in separate structures. The bits are loaded from the string earlier than the load string (in the order of the program) (eg the last of all strings. In some environments) The earliest string architecture A (for example, when the column is clean, in some environments, the earliest string is 136758.doc -52- 200935303, the string of speculations (for example, when the column is paid). Column: = When taking the column, the hardware checks if any future strings are in the cache, food: Hardware! In this case, the hardware has been found to cross the string span = future string and any later, or hardware notification string confusion, so that the string can implement a software definition policy. The flexibility of the D-stop string is based on the line-by-line analysis of the string of hard words. After the previous generation string has been completed, the Ding "crossing string confusion does not occur; the hardware depends on the program order (relative to the string order, the husband must A, the order (the order of the micro-opcodes of the relative input and the storage) performs all the loads... J' so the hard-wired system is free for other uses. Hard-spot::: when the second type is in some implementations In the example system (such as string memory transfer), using the memory reserve hardware to analyze the analysis of the traversing of the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ At the end of the call, the end of the analysis. Other types of points (such as call or generalization) have possible unrestricted categories, and therefore the system uses heuristics to limit the progressive line-by-line analysis of the string has been completed, ±^= It is detected that the upper generation string is unblocked and the string body is heard: the line-construction is completely speculated _ the joint of the tools and equipment required = the data flow chart constructed by SEPR is used to construct the system. Data, string body Beginning to build a data flow table for the external nodes of the root node 136758.doc -53- 200935303 It is also explained in (4), when performing the previous generation string, what hardware and hardware are used as the hardware to perform tracking and/or recording of one of the basic blocks Save the program (4) heart list, and also save the cache memory tag and the related load and store index - data. Use the record 'string body to decode each basic block in each executed track to be programmed. The stream of micro-computing code. For the construction of the Qing, use the = device to rename the table 'convert the micro-operating code operand to the index of the earlier micro-computing code. ❹ In order to trace the memory dependence, the string saves a memory weight naming The table 'maps the cache location to the latest storage operation for writing to the address. 1 The load and storage selectively specify the previous storage as the original operand. The string combined memory weight naming table usage record Please sigh in order to include memory dependencies in the DFG. At the end of the program, all the micro-ops executed in the previous generation string have been incorporated into the data flow chart, in which Just register The name table and memory weight naming table point to the root node of the chart (residents). · Bridge tracking and resident register prediction - guess the successor string of the inner set (such as the last survivor of the previous generation string) from: Predictor value predictions that existed in the previous generation string. The string body searches for dynamics from each of the external (scratchpad and memory) depths to generate a subset of the 7 operational code. All subsets of the program order are The collection system is set up outside the county. County I= establishes the structure of the above-mentioned generation string and the structure of the scratchpad and the beginning of the memory value of the bridge, and only includes the set of foreign resident generations for predicting the last resident. (For example, by the inferred string of the in-situ bit mask, the indicated bridge is 136758.doc -54- 200935303 and any external register register is copied to the memory buffer. The system later uses these replicas to detect false predictions. When a trace is repeated to - speculation (4), the string sets up a new string to begin execution at the bridge-snap-and-predictive string---the micro-code. In addition to handling the dependency of the scratchpad, the bridge (4) converts any terminating branch (and the associated micro-computing code that computes the branching condition) into a discontinuous speculative micro-computing code. Finally, the bridge pursuit sets up various (four) temporary storages for the string, for example, to the (four) memory value column 、 table, the deferred list and the unconditional branch, and the indicator to the beginning of the speculative string. The bridge is optimized to reduce or minimize the length using a variety of dynamic optimization techniques. Some idioms (such as overflow 盥 padding registers or in-string t-feeding multiple calls and returns) sometimes result in unremoved self-stacking of the person and storage-storage. Similarly, sometimes the stacking index or other temporary n is repeatedly incremented or decremented, and the dependency chain 4 is added to add a constant. The string recognizes at least some of the f words and patterns and optimizes the dependency keys to relatively few or fewer operations. For example, the string uses his (10)e seven ad-use shorts, where the manned read data from the stock is speculatively replaced by the value of the previous stock (with the register and the memory prediction at the joint point) Verify speculation). If the string is not able to reduce the bridge tracking to a predetermined or programmable length, the string will abandon the string optimization. Abandonment occurs in a variety of environments, such as when there is a true cross-string register dependency, or when the "outsider" is calculated relatively late in the last two strings and consumed relatively early in the successor string Use 136758.doc -55- 200935303 (thus resulting in a relatively long dependency chain). °Recalling the body value prediction for some string 'bridge tracking prediction memory value line addition: _, φ body is used in the subsequent step-by-step analysis (for example, as explained elsewhere in this article) The string-by-line analysis " enables Daigan to load the reserve data to determine which memory locations are taken by (4) (4). In some embodiments, access to the hardware data cache is performed. 1 Sign and 70 data to construct a list of cache locations that are forwarded across the string.叮 Find the body in the memory weight naming table for DFG

影響的各絲位置。該表M H 表扣向用以寫入至該位置的最近儲 存微運算碼(依程式順序)。接著串體構建產生储存微運算 碼之值所必需之微運算碼之子圖表(例如使用深度優先搜 尋)。串體包括隨同用以產生暫存器值預測之任何其他微 運算碼一起進入橋追蹤之微運算碼。 在橋追蹤中之儲存微運算碼將上代串中之儲存物與後續 後繼串解除耦合(後繼争而是自橋追蹤載入預測)。最後, 串體將關於各已預測儲存物之資訊複製至一與實際儲存值 相比較晚的每串儲存物預測確認列表中以確認推測。在各 種具體實施例中,該資訊包括儲存物之實體位址、所儲存 之值及藉由儲存物所寫入的位元組之遮罩(或替代地,儲 存物之以位元組計之大小以及偏移)。 聯合處理常式追蹤 藉由串體所建構之各推測串具有一匹配橋追蹤與聯合處 136758.doc •56- 200935303 理常式追蹤。聯合處理常式追蹤確切 使用之路士… 飞退峨释呢由橋追蹤進行的實際 預測)。1 或記憶體值_(例如,忽略未使用之 二—上代串結束(例如經由與後繼串之交又點、级 對他事件)時,硬體重新料後繼串開始執行針 f上代争所定義之聯合處理常式。 對於所使用之S暫存器值預測,聯合處理常式自記憶體 2衝器讀取已預測值(例如U於本文別處之小節,,橋追縱 =駐内暫存器預測”中所說明),且將已預測值與來自上代 :之駐外值相比較。硬體包括"透通"暫存器讀取與記憶體 入函式’其致使能夠比較聯合追蹤讀取聯合追蹤之狀離 (例如暫存器與記憶體)及上代串之對應狀態用於比較。; 些具體實施例僅比較由後繼串所讀取之暫存器。 同樣地’為確認記憶體值預測,聯合追料佈所使用的 已預測儲存物之列表(在各種具體實施例中,包括實體位 址、值及針對各項目之位元組遮罩的-或多個)反覆,且 將各已預測儲存值與上代串之處於相同實體位址處之區域 產生駐外值相比較。若系統痛測到任何失配,則系統中止 後繼串且上代串如同系統未分又後繼串一樣越過聯合點繼 續。 若聯合係成功的,則系統丟棄上代串且後繼串變為針對 對應VCPU之新架構串。 結論 在說明中僅為方便製備文字與圖式起見已進行某些選擇 且除非存在相反指示,否則該等選擇不應本質上解釋為傳 136758.doc •57- 200935303 達關於所說明之具體實施例之結構與操作的額外資訊。該 等選擇之範例包括:用於圖式編號之標識之特定組織或= 派與用以識別及參考具體實施例之特徵與元件之元件識= 項(例如,標注或數字標識符)之特定組織或指派。 詞語”包括”或”包含”係明確意欲解釋為說明開端範疇之 邏輯集的抽象詞語且並非意指傳達實體包含,除非明顯後 隨詞語在…之内”。 ’ 儘管已基於說明與理解之清晰之目的詳細說明先前之具 體實施例,但本發明不受限於所提供之細節。存在本發= 之許多具體實施例。所揭示之具體實施例係範例性的且不 為限制性的。 應瞭解與該說明-致的建構、配置及使用之許多變化係 可能的’且係在所頒佈之專利之申請專利範圍之範嗜内:、 例如’互連與功能單元位元寬度、時脈速度及所使用之技 術類型係依據各種具體實施例在各組件區塊中可變化。提 供給互連與邏輯之名稱僅為範例性,且不應解釋為限制所 說明之概念。流程圖與流程圖式之程序、動作及功能元件 之順序與配置係依據各種具體實施例可變化。此外,除非 ::作出相反陳述’否則所指定之值範圍、所使用之最大 =最小值或其他特定規以例如ISA、循環之雙目 =器中之項目或分級之數目)係僅為所說明之具體實 此等值範圍、最大與最小值及特定規定,係預期追 方案技術中之改善與變化,且不應解釋為限制。 可採用此行技術中所熟知的功能等效技術而非所說明者 136758.doc -58· 200935303 來實施各種組件、子系統、功能、操作、直插常式、副常 式、常式、程序、巨集指令或其部分。亦應瞭解具體實施 例之許多功能態樣係可與具體實施例相依設計約束及更快 處理(利於將先前採用硬體之功能轉移至軟體中)與更高整 合铯度(利於將先前採用軟體之功能轉移至硬體中)之技術 趨向成函數關係選擇性採用硬體(即,一般專用電路)或軟 體(即,經由已程式化控制器或處理器之某一方式)實現。The position of each wire affected. The table M H tabs to the most recently stored micro-op (in program order) for writing to the location. The string is then constructed to produce a sub-graph of the micro-ops necessary to store the value of the micro-ops (e. g., using depth-first search). The string includes the micro-ops that enter the bridge tracking along with any other micro-ops used to generate the scratchpad value prediction. The stored micro-op in the bridge trace decouples the storage in the previous generation string from the subsequent successor string (subsequent contention is self-bridge tracking loading prediction). Finally, the string copies the information about each of the predicted stocks to a list of stock prediction confirmations that are later than the actual stored value to confirm the guess. In various embodiments, the information includes a physical address of the stored item, a stored value, and a mask of the byte written by the stored item (or alternatively, the stored item is in a byte group) Size and offset). Joint processing routine tracking Each of the speculative strings constructed by the string has a matching bridge tracking and association. 136758.doc • 56- 200935303 Regular tracking. The joint processing routine traces the exact road used... The flyback is actually predicted by the bridge tracking). 1 or the memory value _ (for example, ignoring the unused two - the end of the previous generation string (for example, via the intersection with the successor string, the level to the other event), the hardware re-follows the successor string to start the execution of the needle f The joint processing routine. For the S register value prediction used, the joint processing routine reads the predicted value from the memory 2 buffer (for example, U is in the section elsewhere herein, bridge tracking = resident temporary storage) (predicted in the "predicted"), and compares the predicted value with the resident value from the previous generation: the hardware includes "through" register read and memory input function, which enables comparison of joints The status of the trace read joint trace (eg, scratchpad and memory) and the previous generation string is compared for comparison. Some embodiments compare only the scratchpad read by the successor string. Memory value prediction, a list of predicted stores used by the joint tracking cloth (in various embodiments, including physical addresses, values, and - or more of the byte masks for each item), And each predicted stored value and the previous generation The area at the same physical address is compared with the external value. If the system detects any mismatch, the system aborts the successor string and the previous generation string continues over the joint point as if the system was not divided and the successor string. If the joint system succeeds, Then the system discards the previous generation string and the subsequent string becomes a new architecture string for the corresponding VCPU. Conclusion In the description, some choices have been made for the convenience of preparing text and schema and unless there is an indication of the opposite, the selection should not be essential. The above is explained as 136, 758.doc • 57- 200935303 for additional information on the structure and operation of the specific embodiments described. Examples of such choices include: the specific organization used for the identification of the schema number or = Recognizing and referring to the specific organization or assignment of the features and the elements of the elements (e.g., annotated or numerical identifiers). The word "including" or "comprising" is expressly intended to be interpreted as a logical set of the opening category. Abstract words do not mean to convey the inclusion of an entity unless it is clearly followed by the word within." 'Although it is based on clarity of explanation and understanding The present invention is not limited to the details of the present invention. The specific embodiments disclosed herein are illustrative and not restrictive. Many variations of the construction, configuration, and use of this description are possible and are within the scope of the patent application scope of the issued patent: for example, 'interconnect and functional unit bit width, clock speed and The type of technology used is variable in various component blocks in accordance with various specific embodiments. The names provided to the interconnects and logic are merely exemplary and should not be construed as limiting the illustrated concepts. The sequence and configuration of the procedures, acts, and functional elements can be varied in accordance with various embodiments. In addition, unless:: the contrary statement is made, otherwise the range of values specified, the maximum=minimum used, or other specific rules such as ISA , the binoculars of the cycle = the number of items or grades in the device) is only the specified range of values, maximum and minimum values, and specific regulations, which are expected to be pursued. Improvement and change in the operation, and should not be construed as limiting. Various components, subsystems, functions, operations, in-line routines, sub-normals, routines, procedures may be implemented using functionally equivalent techniques well known in the art, rather than the described 136758.doc -58.200935303 , macro instructions or parts thereof. It should also be appreciated that many of the functional aspects of the specific embodiments can be designed and constrained in accordance with the specific embodiments and facilitate processing (to facilitate the transfer of previously used hardware functions into the software) and higher integration flexibility (to facilitate the use of previously used software) The technique of transferring functionality to hardware) tends to be functionally selective using hardware (ie, general purpose circuitry) or software (ie, via a programmed controller or processor).

在各種具體實施例中之特定變化包括(但不受限於):分割 差異;不同形狀因數與組態;使用不同作業系統及其他系 統軟體;冑用不同介面標準、網路協定或通信鍵路;及依 據-特定應用之唯-性工程與行業約束實施本文所說明之 概念時預期的其他變化。 已採用遠遠超出所說明之具體實施例之許多態樣之最小 實施方案所需要者的細節與環境内容脈絡說明具體實施 例。熟習此項技術者應認識到某些具體實施例省略所揭示 之組件或特徵而不改變其餘元件間之基本合作。因而應瞭 解所揭示的許多細節並非實施所說明之具體實施例之各種 態樣所需要的。在其餘元件係可區別於先前技術之範圍 内丄所省略之組件與特徵並不限制本文所說明之概念。 «又汁之所有此類變化係在由所說明之具體實施例所傳達 之原理之上的非實質性變化。亦應瞭解本文所說明之且體 =施例可廣泛應用於其他應用,且不受限於所說明之I體 =例的特定應用或產業。因而本發明不應解釋為包括包 頒佈之專利之申請專利範圍之範相的所有可能修 136758.doc -59- 200935303 改與變化。 【圖式簡單說明】 圖〗A解說關於具傷串力電 之電腦且古。 系各具備串能力 ”有-或多個有權使用串體影像、記憶體 性儲存器、輸入/輸出器件及 發 器。 卞内崎及兴壻串能力之微處理 =與〗。共同解說與一具備串能力之微處理器相關的 概心硬體、串體(軟體)及目標軟體層(例如子系統) 二、了共同解說執行前跳越串(例如藉由串體加 口幻之硬體之-範例,其針對以循環計之時 互連加以繪製。有時說明將圖2八、咖成為圖2。, 【主要元件符號說明】 參 101 102 103.1 至103.4 104.1 至 104.6 110、110A、110B 111 115 120 121 124 125 127 (x86)目標軟體層 作業系統核心 應用程式 VCPU 串體層 轉譯快取記憶體管理 X 8 6 一進制轉譯 追蹤逐行分析與捕獲 實體頁逐行分析 分支逐行分析 預測性最佳化 記憶體逐行分析 136758.doc -60- 200935303 130 促進 140 串建構 160 排程與最佳化 162 記憶體混淆分析 ' 163 最佳化 , 165 排程各微運算碼 167 編碼似VLIW束 172 硬體控制 ❿ 174 虛擬器件 175 中斷、SMP及計時器 181 逐行分析硬體 182 硬體加速單元 183 異動記憶體 184A 外部系統/串體DRAM 186 晶片組/PCIe匯流排介面 187 硬體x86解碼器 190 硬體層 • 191.1 至 191.4 VLIW 核 . 192A.1 至 192A.4 ALU 192B.1 至 192B.4 FPU 193.1 至193.4 LI D快取記憶體 194A.1 至 194A.4 暫存器檔案 194B.1 至 194B.4 串内容脈絡 195 多核互連網路 136758.doc -61 - 200935303 211 212 © 213、214、215、216、217 269 、 280 、 281 、 282 、 284 、 285 196 197 198 199 200 201 2000.1、2000.2 2001.1 ' 2001.2Specific variations in various embodiments include (but are not limited to): segmentation differences; different form factors and configurations; use of different operating systems and other system software; use different interface standards, network protocols, or communication keys And other changes expected in the implementation of the concepts described herein based on the specific application-specific engineering and industry constraints. Specific embodiments have been described in terms of details and environmental contexts that are required to be far from the minimum implementations of many aspects of the specific embodiments described. Those skilled in the art will recognize that certain embodiments omit the disclosed components or features without altering the basic cooperation between the remaining components. It is understood that many of the details disclosed are not required to implement the various embodiments of the specific embodiments described. The components and features that are omitted in the remaining elements are distinguished from the prior art and do not limit the concepts described herein. «All such variations of the juice are based on insubstantial changes in the principles conveyed by the specific embodiments illustrated. It should also be understood that the embodiments described herein can be widely applied to other applications and are not limited to the particular application or industry in which the described I body = example. Therefore, the present invention should not be construed as including all modifications of the scope of the patent application scope of the patent issued. [Simple description of the diagram] Figure 〗 A explains the computer and the ancient with the injury. Each has the ability to string. There are - or more of the right to use the serial image, memory storage, input / output devices and transmitters. 微 Neiqi and Xing 壻 string of micro-processing = and 〗. Joint interpretation and A microprocessor-based hardware, string (software), and target software layer (such as a subsystem) with string capabilities. 2. A common explanation of the jump before execution (for example, by string and hard - the example, which is drawn for the time when the loop is used for the loop. Sometimes the description is shown in Fig. 2, and the coffee is shown in Fig. 2. [Major component symbol description] Reference 101 102 103.1 to 103.4 104.1 to 104.6 110, 110A, 110B 111 115 120 121 124 125 127 (x86) target software layer operating system core application VCPU serial layer translation cache memory management X 8 6 binary translation tracking progressive analysis and capture entity page line-by-line analysis branch line by line analysis Predictively Optimized Memory Progressive Analysis 136758.doc -60- 200935303 130 Promotion 140 String Construction 160 Scheduling and Optimization 162 Memory Confusion Analysis '163 Optimization, 165 Scheduling Microcode 167 Encoding VLIW Bundle 172 Hardware Control 174 Virtual Device 175 Interrupt, SMP and Timer 181 Progressive Analysis Hardware 182 Hardware Acceleration Unit 183 Transaction Memory 184A External System / Serial DRAM 186 Chipset / PCIe Bus Interface 187 Hardware X86 Decoder 190 Hard Layer • 191.1 to 191.4 VLIW Core. 192A.1 to 192A.4 ALU 192B.1 to 192B.4 FPU 193.1 to 193.4 LI D Cache Memory 194A.1 to 194A.4 Register File 194B .1 to 194B.4 Serial context 195 Multicore internet 136758.doc -61 - 200935303 211 212 © 213, 214, 215, 216, 217 269, 280, 281, 282, 284, 285 196 197 198 199 200 201 2000.1 , 2000.2 2001.1 ' 2001.2

2002.1 ' 2002.2 2002.ΙΑ 2002.IB ❹ 2003 * 2004 - 2005 2006 2009 2010 2011.1 2012.1 L2/L3快取記憶體 DRAM控制器與北橋 多插座系統互連 PCI特快、QPI、超傳輸 上代串 後繼串 fork.skip微運算碼 傳播集 微運算碼 束 具備串能力之電腦 具備串能力之微處理器 動態隨機存取記憶體元件 串體資料 轉譯快取記憶體 快閃記憶體 串體影像 鍵盤/顯示器 周邊設備 網路 儲存器 逐行分析單元 串管理單元 136758.doc -62- 200935303 2013.1 VLIW 核 2014.1 異動記憶體 2050 麵合 2051.1、2051.2 耦合 2052.1 麵合 2053 耦合 2055 耦合 2056 耦合 2063 耦合 2064 耦合 136758.doc -63 -2002.1 ' 2002.2 2002.ΙΑ 2002.IB ❹ 2003 * 2004 - 2005 2006 2009 2010 2011.1 2012.1 L2/L3 cache memory DRAM controller and Northbridge multi-socket system interconnection PCI Express, QPI, super transmission last string successor string fork. Skip micro-opcode spread set micro-computing code bundle with string capability computer with string capability microprocessor dynamic random access memory component string data translation cache memory flash memory string image keyboard / display peripheral device network Road storage line-by-line analysis unit string management unit 136758.doc -62- 200935303 2013.1 VLIW core 2014.1 transaction memory 2050 face 2051.1, 2051.2 coupling 2052.1 face 2053 coupling 2055 coupling 2056 coupling 2063 coupling 2064 coupling 136758.doc -63 -

Claims (1)

200935303 十、申請專利範圍: 1 一種系統,其包含: 串建構構件,其用於動態建構— 一者之-㈣逐行分析導向_分執 π緒之至少 串建構構件係至少部分經㈣仃绪部分,其中該 者予以實施日h 多個執行緒之至少一 考予以實細’且其令各串分割執行 部分包含個別複數個串影像; δχ为割執行緒 ❹ 執行構件,其用於該一或多個執 處理;及 ’’以串為基礎的 其t對於已處理之各串分 銥鉑F7 β此/ 丁緒§亥執行構件致使 I夠同時執行對應於該串分 f^ ^ . 〒刀“執仃緒之该個別複數個申 像之兩個或兩個以上之執行緒内複數個串。 2.如請求項1之系統,其中對於 -,_ . y 對於已處理之各串分割執行 緒該執行緒内複數個串包含嗲电八划批/ 割執行緒之至少行緒及該串分 5個後繼串的一架構串,該架構串 ❹ 3該串刀割執行緒之架構狀態的-架構串内容脈 、\’各後繼串係比該串分割執行緒之該架構串年輕,且 錢繼串更新包含該串分割執行緒之該架構狀態之一推 測版本的一個別後繼串内容脈絡。 长項1之系統’其中該執行構件進一步致使能夠同 時執:于包含與已處理之各串分割執行緒關聯之一架構串 行緒間複數個串’各架構串更新包含個別架構狀態 之個別架構串内容脈絡。 4·如請求項3之系統, 136758,doc 200935303 其中對於已處理之各串分割 個串包含該串分割執行绪 丁、,該執行緒内複數 丁緒及該串分割執行緒之至少一七 多個後繼串的該架構串,各後 或 之該牟椹虫立4- 係比該串分割執行緒 該芊=各後繼串更新包含該串分割執行緒之 条構狀態之一推測版本的 其中兮装由一 ⑺後繼串内容脈絡;及 個專用Γ 絡係保存於—微處理器内之一或多 個專用内容脈絡儲存器中。 A夕 ❹ ❹ 5 ·如吻求項丨之系統,其中該同時執 之者係在一微處理器之個別核上執行。複數個串 6_ Π::Γ:統’其中該同時執行執行緒内複數個· 7.:=〜其中該串建構構件之二係使 施。❹之可執行程式碼與微碼之—或多者予以實 8’ 項7之“’其中可執行程式碼與微竭之該-或 中。之至少部分係保存於一或多個非揮發性儲存器件 如"耷求項1之系統,其進一步包含: 己隐體’其包含-或多個DRAM器件;及 其中在支援各串分割執行緒部分之建構 構構件分配該記憶體的部分。 向該串建 1〇’如凊求項1之系統,其進一步包含: 建輯’其經啟用以解碼至少—類型之串 昇碼及至少一類型之串解構微運算碼。 136758.doc 200935303 "•如請求項2之系統,其進一步包含·· 内容脈絡儲存器,其專用 串聯合邏輯,其•合至該内容内容脈絡;及 針對該-或多個執行緒之至少且經啟用以 行複數個兮黧由& 串刀割執行緒實 内容脈絡之硬體協助合併。 如一求項2之系統,其進一步包含: 内容脈絡儲存器,其專200935303 X. Patent application scope: 1 A system, which comprises: a string construction component, which is used for dynamic construction - one of them - (four) progressive analysis orientation - at least a string of construction components is at least partially (four) a part, wherein the person performs at least one test of the plurality of threads of the day h, and the segmentation execution part of each string includes a plurality of individual string images; δχ is a cut execution thread, and the execution component is used for the one Or a plurality of processing; and ''string-based t for the processed string 铱 Platinum F7 β This / Ding Xu § hai execution component enables I to simultaneously perform corresponding to the string f ^ ^ . The knife "executes the multiple strings of two or more threads of the individual multiple images of the application. 2. The system of claim 1, wherein for -, _. y is divided for each processed string Thread The thread in the thread contains at least the thread of the batch/cutting thread and a string of the string of 5 successors. The architecture string 3 is the architecture state of the thread cutting thread. - Architecture string content, \ 'The successor string is younger than the string of the string splitting thread, and the money string is updated with a sequence of contexts of the speculative version of the one of the architectural states of the string splitting thread. The system of the long term 1' Wherein the execution component is further enabled to simultaneously perform: a plurality of strings comprising one of the architectures associated with the processed strings of the processed strings, each of the architectural strings updating an individual architecture string context containing the individual architectural states. For example, the system of claim 3, 136758, doc 200935303, wherein the divided strings of the processed strings include the string splitting execution thread, the plurality of threads in the thread, and at least one or more successors of the string splitting thread The string of the framework string, each of the worms or the worms 4-way than the string splitting thread 芊 = each subsequent string update contains one of the staggered threads of the staggered thread state of the speculative version of the armored A (7) successor string context; and a dedicated network is stored in one or more dedicated content caches in the microprocessor. A ❹ ❹ ❹ 5 a system in which the concurrent performers are executed on a single core of a microprocessor. A plurality of strings 6_ Π::Γ: 统' where the plurality of threads within the execution thread are simultaneously executed. 7.:=~ The second component of the construction component is the executable code and the microcode - or more of the actual 8' item 7 ''where the executable code and the exhaustion of the - or medium. At least partially stored in one or more non-volatile storage devices, such as the system of claim 1, further comprising: a hidden body that includes - or a plurality of DRAM devices; The constructing component of the thread part allocates the portion of the memory. The system of claim 1 is constructed to the string, further comprising: a build that is enabled to decode at least a type of upsell and at least one type of deconstructed microcode. 136758.doc 200935303 "• The system of claim 2, further comprising: a context store, dedicated serial logic, coupled to the content context; and at least for the one or more threads And it is enabled to perform a plurality of hardware assisted merges by the & string cutter. The system of claim 2, further comprising: a content network storage device 串分又邏短甘等串内容脈絡;及 針對至該㈣脈絡儲存Μ經啟用以 ”構/個執仃緒之至少一已處理串分割執行緒將 後繼串之至少-者之内容脈絡之對應部分卜 4 13.如請求項2之系統,其進一步包含: 一異動記憶體,其包含專用異動記憶體儲存器及專用 異動記憶體控制邏輯,該異動記憶體係經啟用以針對該 體::個執行緒之至少一已處理串分割執行緒實行記憶 資枓=硬體協助版本管理,其中對應於複數個記憶體 位置之每—者保存多個資料版本,其中對於該複數個記 憶體位置之每一者,該等版本之一第一版本對應於該架 構串且該等版本之至少一第二版本個別對應於該等後繼 串之至少一者。 14.如請求項1之系統,其進一步包含·· 分析構件,其用於識別對應於發生在該複數個同時執 仃執行緒内串之間且使一或多個個別記憶體位置混淆之 個別跨越_操作的一或多個潛伏相依性; I36758.doc 200935303 構件,其經由採用一或多個個別已延期操作取代 S 別跨越串操作來移除該一或多個潛伏相依性; 解析構件,其用於評估由該複數個同時執行執行緒内 串所實行之該等已延期操作之每一者; 其中該識別與該取代經啟用以在已處理之各串分割執 行緒之該處理期間動態運作,·及 ° ^相對於已處理之各串分割執行緒之執行,自經由 =個同時執行執行緒内串之該處理所實現之結果係 15, 目5 ;針對嚴格循序處理之架構指定結果。 一種方法,其包含: 構一或多個執行緒之至少一者之一 析導向串分割執行緒部分, ^ 經由兮一 $各棚# 、中該動態建構係至少部分 該$多個執行緒之至少-者予以實施,其中各串 分割執行緒之該串分匈勃“ 、各串 像; °仃緒邛分包含個別複數個串影 參 以串為基礎處理該—或多個執行緒;及 其中對於已處理之各串分 兮串八$丨勃/-你 〇J執仃緒,同時執行對應於 該串刀割執灯緒之該個別複數應於 上之執行緒内複數個串。 I之兩個或兩個以 16.如請求項15之方法,其進—步包含·· 其中對於已處理之各串八/ 個串包含該串分割執行緒二_ 該執行緒内複數 多個後繼串的-架構串,各後」執讀之至少-或 之該架構串年輕; 、/、比該串分割執行緒 136758.doc 200935303 其中對於已處理之各串分割執行緒,該架構串更新包 含該串分割執行緒之架構狀態的一架構串内容脈絡;及 其中對於已處理之各串分割執行緒,各後繼串更新包 3該串为割執行緒之該架構狀態之一推測版本的一個別 ' 後繼串内容脈絡。 ,17.如請求項15之方法,其進一步包含同時執行包含與已處 之各串为割執行緒關聯之一架構串的執行緒間複數個 串,各架構串更新包含個別架構狀態之個別架構串内容 ❹ 脈絡。 18.如4求項15之方法,其中該一或多個執行緒包含採用使 用者模式執行之應用程式之全部或任何部分及/或一採用 特權模式執行之作業系統核心。 如叫求項15之方法,其中該一或多個執行緒包含一虛擬 機孤硯窃之全部或任何部分及/或由該虛擬機監視器管理 的一或多個作業系統核心。 φ 士咕求項15之方法,其中該一或多個執行緒係依據至少 心令集架構且該個別複數個串影像係依據一第二 * 指令集架構。 '21·如請求項15之方法,其中該動態建構係自動的且係該- 或多個執行緒不可觀察的。 22 .如請灰:£5 Ί r 之方法,其進一步包含在一微處理器之個別 核上執行該同時執行執行緒内複數個串之每一者。 处二长項15之方法,其進一步包含在一微處理器之個別功 月匕早疋上執行該同時執行執行緒内複數個串之每一者。 136758.doca string of contextual links, such as string and singularity; and a contextual context corresponding to at least one of the processed strings of at least one processed string splitting thread to the (4) context store The system of claim 2, further comprising: a transaction memory comprising a dedicated transaction memory and dedicated transaction memory control logic, the transaction memory system being enabled to target the body: At least one processed string splitting thread of the thread implements memory asset=hardware assisted version management, wherein each of the plurality of memory locations holds a plurality of data versions, wherein for each of the plurality of memory locations In one case, the first version of one of the versions corresponds to the architecture string and the at least one second version of the versions individually corresponds to at least one of the successor strings. 14. The system of claim 1, further comprising · an analysis component for identifying an individual span that corresponds to the occurrence of one or more individual memory locations between the plurality of simultaneous execution threads and confusing one or more individual memory locations One or more latency dependencies; I36758.doc 200935303 component that removes the one or more latency dependencies by employing one or more individual deferred operations instead of S spanning string operations; Evaluating each of the deferred operations performed by the plurality of concurrent execution threads within the thread; wherein the identifying and the replacing are enabled to dynamically operate during the processing of the processed string split threads, And ° ^ relative to the execution of the processed string split threads, the result system 15 achieved by executing the processing of the thread within the thread simultaneously, the target 5; the result specified for the strict sequential processing architecture. The method includes: constructing at least one of the one or more threads to segment the thread segmentation thread, ^ via 兮一$ each shed#, wherein the dynamic construction system is at least partially at least part of the plurality of threads - Implemented, in which the string splitting the thread of the string is divided into "Hung Bo", each string image; ° 仃 邛 包含 包含 包含 个别 包含 包含 包含 包含 包含 包含 包含 个别 个别 个别 个别 个别 个别 个别 个别 个别 个别 个别 个别Lines; and for each of the processed strings, eight 丨 丨 / - - - - - - , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , string. Two or two of I are 16. According to the method of claim 15, the further step includes ·· wherein the processed string of each of the eight/strings includes the string splitting thread 2 _ the plurality of threads within the thread The successor string-architecture string, each of which "reads at least - or the architecture string is younger; /, than the string splitting thread 136758.doc 200935303 where the processed string is updated for the processed string split threads a framework string context containing the architectural state of the string splitting thread; and a partitioning thread for each of the processed strings, each subsequent string updating packet 3 is a speculative version of one of the architectural states of the cutting thread Individual 'subsequent string content context. 17. The method of claim 15, further comprising simultaneously executing a plurality of strings comprising threads of one of the architectural strings associated with each of the strings being executed, each architectural string updating an individual architecture comprising individual architectural states String content 脉 context. 18. The method of claim 15, wherein the one or more threads comprise all or any portion of an application executed in a user mode and/or an operating system core executed in a privileged mode. The method of claim 15, wherein the one or more threads comprise all or any portion of a virtual machine hacking and/or one or more operating system cores managed by the virtual machine monitor. The method of claim 15, wherein the one or more threads are based on at least a heartbeat architecture and the plurality of string images are based on a second * instruction set architecture. The method of claim 15, wherein the dynamic construction is automatic and the one or more threads are unobservable. 22. A method of graying out: £5 Ί r, further comprising performing each of the plurality of strings within the execution thread simultaneously on a core of a microprocessor. The method of claim 2, further comprising performing each of the plurality of strings within the execution thread simultaneously on an individual power of the microprocessor. 136758.doc
TW97148039A 2007-12-10 2008-12-10 Strand-based computing hardware and dynamically optimizing strandware for a high performance microprocessor system TW200935303A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US1274107P 2007-12-10 2007-12-10

Publications (1)

Publication Number Publication Date
TW200935303A true TW200935303A (en) 2009-08-16

Family

ID=40756092

Family Applications (1)

Application Number Title Priority Date Filing Date
TW97148039A TW200935303A (en) 2007-12-10 2008-12-10 Strand-based computing hardware and dynamically optimizing strandware for a high performance microprocessor system

Country Status (2)

Country Link
TW (1) TW200935303A (en)
WO (1) WO2009076324A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI514262B (en) * 2011-12-30 2015-12-21 Intel Corp Method and system for identifying and prioritizing critical instructions within processor circuitry
US9292288B2 (en) 2013-04-11 2016-03-22 Intel Corporation Systems and methods for flag tracking in move elimination operations
TWI571799B (en) * 2010-09-25 2017-02-21 英特爾公司 Apparatus, method, and machine readable medium for dynamically optimizing code utilizing adjustable transaction sizes based on hardware limitations
TWI587146B (en) * 2010-09-24 2017-06-11 英特爾公司 Implementing quickpath interconnect protocol over a pcie interface
TWI617986B (en) * 2014-03-27 2018-03-11 萬國商業機器公司 Dispatching multiple threads in a computer

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8230410B2 (en) * 2009-10-26 2012-07-24 International Business Machines Corporation Utilizing a bidding model in a microparallel processor architecture to allocate additional registers and execution units for short to intermediate stretches of code identified as opportunities for microparallelization
US8495307B2 (en) 2010-05-11 2013-07-23 International Business Machines Corporation Target memory hierarchy specification in a multi-core computer processing system
US9405551B2 (en) * 2013-03-12 2016-08-02 Intel Corporation Creating an isolated execution environment in a co-designed processor
US9870226B2 (en) * 2014-07-03 2018-01-16 The Regents Of The University Of Michigan Control of switching between executed mechanisms

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3632635B2 (en) * 2001-07-18 2005-03-23 日本電気株式会社 Multi-thread execution method and parallel processor system
US20040216101A1 (en) * 2003-04-24 2004-10-28 International Business Machines Corporation Method and logical apparatus for managing resource redistribution in a simultaneous multi-threaded (SMT) processor
US7000048B2 (en) * 2003-12-18 2006-02-14 Intel Corporation Apparatus and method for parallel processing of network data on a single processing thread

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI587146B (en) * 2010-09-24 2017-06-11 英特爾公司 Implementing quickpath interconnect protocol over a pcie interface
TWI571799B (en) * 2010-09-25 2017-02-21 英特爾公司 Apparatus, method, and machine readable medium for dynamically optimizing code utilizing adjustable transaction sizes based on hardware limitations
TWI514262B (en) * 2011-12-30 2015-12-21 Intel Corp Method and system for identifying and prioritizing critical instructions within processor circuitry
US9323678B2 (en) 2011-12-30 2016-04-26 Intel Corporation Identifying and prioritizing critical instructions within processor circuitry
US9292288B2 (en) 2013-04-11 2016-03-22 Intel Corporation Systems and methods for flag tracking in move elimination operations
TWI617986B (en) * 2014-03-27 2018-03-11 萬國商業機器公司 Dispatching multiple threads in a computer

Also Published As

Publication number Publication date
WO2009076324A2 (en) 2009-06-18
WO2009076324A3 (en) 2009-08-13

Similar Documents

Publication Publication Date Title
JP6526609B2 (en) Processor
TW200935303A (en) Strand-based computing hardware and dynamically optimizing strandware for a high performance microprocessor system
US20090217020A1 (en) Commit Groups for Strand-Based Computing
CN103282877B (en) For by program automatic classifying into the Hardware & software system of multiple parallel threads system, apparatus and method
US20090150890A1 (en) Strand-based computing hardware and dynamically optimizing strandware for a high performance microprocessor system
US8301870B2 (en) Method and apparatus for fast synchronization and out-of-order execution of instructions in a meta-program based computing system
US5974538A (en) Method and apparatus for annotating operands in a computer system with source instruction identifiers
JP2898820B2 (en) Self-parallelized computer system and method
CN108027769A (en) Instructed using register access and initiate instruction block execution
TW201734767A (en) Methods, apparatus, and instructions for user-level thread suspension
KR20180021812A (en) Block-based architecture that executes contiguous blocks in parallel
JP6450705B2 (en) Persistent commit processor, method, system and instructions
KR20190031494A (en) Transaction register files for block-based processors
CN108027778A (en) Associated with the store instruction asserted prefetches
TW201140447A (en) Parallel execution unit that extracts data parallelism at runtime
US9632775B2 (en) Completion time prediction for vector instructions
JPH05282265A (en) Method for distributing instruction group of execution sequence and device for scheduling serial instruction stream
US10318261B2 (en) Execution of complex recursive algorithms
US9442734B2 (en) Completion time determination for vector instructions
CN112241288A (en) Dynamic control flow reunion point for detecting conditional branches in hardware
KR100837400B1 (en) Method and apparatus for processing according to multi-threading/out-of-order merged scheme
Gilbert Dependency and exception handling in an asynchronous microprocessor
Mutlu Efficient runahead execution processors
Xiang et al. MSpec: A design pattern for concurrent data structures
Mameesh et al. Speculative-Aware Execution: A simple and efficient technique for utilizing multi-cores to improve single-thread performance